Summary: Add automatic per-block interpolation search selection (`kAuto` mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, `kAuto` resolves to interpolation search for uniform blocks and binary search otherwise. ## Key changes - **New `BlockSearchType::kAuto` enum value**: Resolves per-block at read time to either `kInterpolation` or `kBinary` based on the block's uniformity flag. Falls back to `kBinary` on older versions that don't recognize it. - **Write-path uniformity analysis**: `BlockBuilder::ScanForUniformity()` uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count. - **New table option `uniform_cv_threshold`** (default: -1 `disabled`): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and `db_bench`. - **Code reorganization**: Block entry decode helpers (`DecodeEntry`, `DecodeKey`, `DecodeKeyV4`, `ReadBe64FromKey`) moved from `block.cc` to a new shared header `block_util.h` so they can be reused by `BlockBuilder` on the write path. - **New histogram `BLOCK_KEY_DISTRIBUTION_CV`**: Records the CV (scaled by 10000) of each index block's key distribution for observability. - **Java bindings**: `IndexSearchType.kAuto`, `uniformCvThreshold` getter/setter, JNI portal constructor signature updated, and `HistogramType.BLOCK_KEY_DISTRIBUTION_CV` added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14383 Test Plan: - `IndexBlockTest.IndexValueEncodingTest` parameterized to include `kAuto` search type alongside `kBinary` and `kInterpolation`, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths. - Uniformity detection validated: blocks with uniform key distribution correctly set `is_uniform = true`, blocks with clustered/non-uniform keys set `is_uniform = false`. - Stress test coverage - Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. `uniform_cv_threshold=1000`), I see `bad block contents`, which is expected ## Benchmark readrandom with `fillrandom,compact -seed=1 --statistics`: | Benchmark | Branch | Params | avg ops/s | % change vs main | CV P50 | |-----------|--------|--------|-----------|------------------|--------| | readrandom | main | `binary_search, shortening=1` | 335,791 | baseline | N/A | | readrandom | feature | `binary_search, shortening=1` (default) | 335,749 | -0.0% | 1,500 | | readrandom | feature | `auto_search, shortening=1` (kAuto) | 366,832 | **+9.2%** | 1,500 | | readrandom | feature | `interpolation_search, shortening=1` | 366,598 | **+9.2%** | 1,500 | | readrandom | feature | `auto_search, shortening=2` (kAuto) | 344,631 | **+2.6%** | 1,030,000 | | readrandom | feature | `interpolation_search, shortening=2` | 201,178 | **-40.1%** | 1,030,000 | As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search. ## Write benchmark There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks. See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of `-benchmarks=fillseq,compact -compression_type=none -disable_wal=1`. Only 0.08% attributed to `ScanForUniformity`. Reviewed By: pdillinger Differential Revision: D94738890 Pulled By: joshkang97 fbshipit-source-id: 9661ac593c5fef89d49f3a8a027f1338a0c96766
92 lines
4 KiB
C++
92 lines
4 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#pragma once
|
|
|
|
#include <cstdint>
|
|
#include <string>
|
|
|
|
#include "rocksdb/slice.h"
|
|
#include "rocksdb/status.h"
|
|
#include "rocksdb/table.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
// DataBlockFooter represents the footer of a data block, containing metadata
|
|
// about the block's structure and features.
|
|
//
|
|
// Current encoding (may expand in future format versions):
|
|
// - A single uint32_t where:
|
|
// - The low 28 bits store the number of restart points (num_restarts)
|
|
// - The high 4 bits are reserved for metadata/features:
|
|
// - Bit 31: Hash index present (kDataBlockBinaryAndHash)
|
|
// - Bit 30: IMPORTANT: Cannot be used without format version bump.
|
|
// - Bit 29: Uniform keys flag (for kAuto index block search)
|
|
// - Bit 28: Separated KV storage (keys and values stored in separate
|
|
// sections within the block)
|
|
//
|
|
// Note on forward compatibility: Bits 28-29 can be read by older versions of
|
|
// RocksDB and interpreted as an extremely large, which will be caught as a
|
|
// corruption. Bit 30 is special because num_restarts is multipled by 4, causing
|
|
// overflow and the corruption check to be silently ignored
|
|
// (https://github.com/facebook/rocksdb/blob/10.11.fb/table/block_based/block.cc#L1070-L1103)
|
|
//
|
|
// When separated KV is enabled, an additional uint32_t is prepended before the
|
|
// packed footer, storing the offset to the values section within the block.
|
|
//
|
|
// When any unrecognized reserved bit is set, DecodeFrom() returns an error,
|
|
// allowing older versions to fail gracefully on newer formats.
|
|
//
|
|
// The encoding size is not fixed - future format versions may expand it.
|
|
// Use kMaxEncodedLength for buffer sizing.
|
|
struct DataBlockFooter {
|
|
// Maximum number of restarts that can be stored (2^28 - 1 = 268,435,455).
|
|
// This reserves the top 4 bits for metadata (bit 31 for hash index, bits
|
|
// 28-30 for future features). For historical compatibility purposes, the
|
|
// limit is adequate because a 4GiB block (maximum due to 32-bit block size)
|
|
// with restart_interval=1 and minimum entries (12 bytes: 3 varint bytes +
|
|
// 9-byte internal key + empty value) plus 4-byte restart offsets = 16 bytes
|
|
// per restart, fits at most (2^32 - 4) / 16 ≈ 268 million restarts.
|
|
static constexpr uint32_t kMaxNumRestarts = (1u << 28) - 1;
|
|
|
|
// Maximum encoded length of a DataBlockFooter (for buffer sizing).
|
|
// 8 bytes when separated KV is enabled (values_section_offset + packed),
|
|
// 4 bytes otherwise.
|
|
static constexpr uint32_t kMaxEncodedLength = 2 * sizeof(uint32_t);
|
|
|
|
// Minimum encoded length (for current format version)
|
|
static constexpr uint32_t kMinEncodedLength = sizeof(uint32_t);
|
|
|
|
BlockBasedTableOptions::DataBlockIndexType index_type =
|
|
BlockBasedTableOptions::kDataBlockBinarySearch;
|
|
|
|
// Whether the block uses separated KV storage (keys and values in separate
|
|
// sections). When true, values_section_offset indicates where the values
|
|
// section begins within the block data.
|
|
bool separated_kv = false;
|
|
uint32_t values_section_offset = 0;
|
|
|
|
uint32_t num_restarts = 0;
|
|
bool is_uniform = false;
|
|
|
|
DataBlockFooter() = default;
|
|
DataBlockFooter(BlockBasedTableOptions::DataBlockIndexType _index_type,
|
|
uint32_t _num_restarts)
|
|
: index_type(_index_type), num_restarts(_num_restarts) {}
|
|
|
|
// Appends the encoded footer to dst.
|
|
void EncodeTo(std::string* dst) const;
|
|
|
|
// Decodes a footer from the end of input (consumes bytes from the end).
|
|
// Returns an error if reserved/unrecognized feature bits are set.
|
|
// On success, advances input to exclude the consumed footer bytes.
|
|
Status DecodeFrom(Slice* input);
|
|
};
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|