rocksdb/table/block_based/data_block_footer.cc
Josh Kang 3ad23b2d94 Support automated interpolation search (#14383)
Summary:
Add automatic per-block interpolation search selection (`kAuto` mode) for index blocks. During SST construction, each index block's key distribution is analyzed using the coefficient of variation (CV) of gaps between restart-point keys. Blocks with uniformly distributed keys are flagged via a new bit in the data block footer, and at read time, `kAuto` resolves to interpolation search for uniform blocks and binary search otherwise.

## Key changes

- **New `BlockSearchType::kAuto` enum value**: Resolves per-block at read time to either `kInterpolation` or `kBinary` based on the block's uniformity flag. Falls back to `kBinary` on older versions that don't recognize it.
- **Write-path uniformity analysis**: `BlockBuilder::ScanForUniformity()` uses Welford's online algorithm to incrementally compute the CV of key gaps at restart points. The result is stored in a new bit (bit 30) of the data block footer's packed restart count.
- **New table option `uniform_cv_threshold`** (default: -1 `disabled`): Controls how strict the uniformity check is. Set to negative to disable. Exposed in C++, Java (JNI), and `db_bench`.
- **Code reorganization**: Block entry decode helpers (`DecodeEntry`, `DecodeKey`, `DecodeKeyV4`, `ReadBe64FromKey`) moved from `block.cc` to a new shared header `block_util.h` so they can be reused by `BlockBuilder` on the write path.
- **New histogram `BLOCK_KEY_DISTRIBUTION_CV`**: Records the CV (scaled by 10000) of each index block's key distribution for observability.
- **Java bindings**: `IndexSearchType.kAuto`, `uniformCvThreshold` getter/setter, JNI portal constructor signature updated, and `HistogramType.BLOCK_KEY_DISTRIBUTION_CV` added.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14383

Test Plan:
- `IndexBlockTest.IndexValueEncodingTest` parameterized to include `kAuto` search type alongside `kBinary` and `kInterpolation`, verifying correct seek/iteration behavior across all combinations of key distributions, restart intervals, and key lengths.
- Uniformity detection validated: blocks with uniform key distribution correctly set `is_uniform = true`, blocks with clustered/non-uniform keys set `is_uniform = false`.
- Stress test coverage
- Updated check_format_compatible to also include a "uniform" dataset. By default using uniform_cv_threshold=-1 does not result in an incompatibility issues. When manually changing the threshold (e.g. `uniform_cv_threshold=1000`), I see `bad block contents`, which is expected

## Benchmark

readrandom with `fillrandom,compact -seed=1 --statistics`:

| Benchmark | Branch | Params | avg ops/s | % change vs main | CV P50 |
|-----------|--------|--------|-----------|------------------|--------|
| readrandom | main | `binary_search, shortening=1` | 335,791 | baseline | N/A |
| readrandom | feature | `binary_search, shortening=1` (default) | 335,749 | -0.0% | 1,500 |
| readrandom | feature | `auto_search, shortening=1` (kAuto) | 366,832 | **+9.2%** | 1,500 |
| readrandom | feature | `interpolation_search, shortening=1` | 366,598 | **+9.2%** | 1,500 |
| readrandom | feature | `auto_search, shortening=2` (kAuto) | 344,631 | **+2.6%** | 1,030,000 |
| readrandom | feature | `interpolation_search, shortening=2` | 201,178 | **-40.1%** | 1,030,000 |

As seen with shortening=2, a non-uniform distribution produces a high CV, which does not use interpolation search.

## Write benchmark

There is a write overhead which scans each restart entry for a block upon Finish. In practice this is very low because currently it is only applied to index blocks.

See cpu profile (https://fburl.com/strobelight/io5hwj9h) here of `-benchmarks=fillseq,compact -compression_type=none -disable_wal=1`. Only 0.08% attributed to `ScanForUniformity`.

Reviewed By: pdillinger

Differential Revision: D94738890

Pulled By: joshkang97

fbshipit-source-id: 9661ac593c5fef89d49f3a8a027f1338a0c96766
2026-03-06 10:13:51 -08:00

102 lines
3 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "table/block_based/data_block_footer.h"
#include "util/coding.h"
namespace ROCKSDB_NAMESPACE {
// Hash index bit (bit 31)
constexpr uint32_t kHashIndexBit = 1u << 31;
// Uniform keys bit (bit 29) - indicates keys are uniformly distributed
constexpr uint32_t kUniformKeysBit = 1u << 29;
// Separated KV storage bit (bit 28)
constexpr uint32_t kSeparatedKVBit = 1u << 28;
void DataBlockFooter::EncodeTo(std::string* dst) const {
assert(num_restarts <= kMaxNumRestarts);
// If separated KV, write the values_section_offset before the packed footer
if (separated_kv) {
PutFixed32(dst, values_section_offset);
}
uint32_t packed = num_restarts;
if (index_type == BlockBasedTableOptions::kDataBlockBinaryAndHash) {
packed |= kHashIndexBit;
} else {
assert(index_type == BlockBasedTableOptions::kDataBlockBinarySearch);
}
if (separated_kv) {
packed |= kSeparatedKVBit;
}
if (is_uniform) {
packed |= kUniformKeysBit;
}
PutFixed32(dst, packed);
}
Status DataBlockFooter::DecodeFrom(Slice* input) {
if (input->size() < kMinEncodedLength) {
return Status::Corruption("Block too small for footer");
}
// Decode from the end of the input
const char* footer_ptr = input->data() + input->size() - sizeof(uint32_t);
uint32_t packed = DecodeFixed32(footer_ptr);
if (packed & kHashIndexBit) {
index_type = BlockBasedTableOptions::kDataBlockBinaryAndHash;
packed &= ~kHashIndexBit;
} else {
index_type = BlockBasedTableOptions::kDataBlockBinarySearch;
}
if (packed & kSeparatedKVBit) {
separated_kv = true;
packed &= ~kSeparatedKVBit;
} else {
separated_kv = false;
}
if (packed & kUniformKeysBit) {
is_uniform = true;
packed &= ~kUniformKeysBit;
} else {
is_uniform = false;
}
// Check for reserved/unrecognized feature bits (anything beyond
// kMaxNumRestarts)
if (packed > kMaxNumRestarts) {
return Status::Corruption(
"Unrecognized feature in block footer (reserved bits set)");
}
num_restarts = packed;
input->remove_suffix(sizeof(uint32_t));
// If separated KV, read values_section_offset from before the packed footer
if (separated_kv) {
if (input->size() < sizeof(uint32_t)) {
return Status::Corruption(
"Block too small for separated KV values section offset");
}
values_section_offset =
DecodeFixed32(input->data() + input->size() - sizeof(uint32_t));
input->remove_suffix(sizeof(uint32_t));
}
return Status::OK();
}
} // namespace ROCKSDB_NAMESPACE