rocksdb/table/block_based/binary_search_index_reader.cc
Josh Kang 9f47518676 Add interpolation search as an alternative to binary search (#14247)
Summary:
Interpolation search is an alternative algorithm to binary search, which performs better on uniformly distributed keys. Instead of binary search always computing the mid point of the left and right boundaries, interpolation search "interpolates" the mid point based on the distance to the target. Fortunately, we can re-use existing block format to support interpolation search.

For a given block, we compute the shared_prefix length of the first and last key. Interpolation search is usually done with numerical target values, so for a variable binary length key, we calculate the "value" as the first 8 non-shared bytes. This also means interpolation search would only really be effective for bytewise comparator (guarded via options validations).

#### Fallback to binary search
- if the the val(left_key) == val(right_key) then we fallback to classic binary search (to avoid divide by 0)
- interpolation search is significantly more computationally expensive than binary search, so when the search distance is small, we also fallback to binary search.
- if interpolation search does not make significant progress (i.e. reduces search space by more than half each iteration), we can assume data is non-uniform and fallback.

Interpolation search also performs best when there is minimal shortening, especially shortening of the last block, as it can heavily skew the distribution of the actual keys.

Note that each search algorithm is guaranteed to make progress because at each iteration the search space is guaranteed to be reduce by at least 1.

For now this change only applies to index block seeks, as data block seeks and other blocks do not have as many entries and would not require significant number of search rounds, but it could be easily extended to include that support.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14247

Test Plan:
Updated unit tests and crash test with new search option

### Benchmark
The default benchmark sets up keys in generally uniform distribution, so it was a good way to test performance improvements.

Setup: `./db_bench -benchmarks=fillseq,compact -index_shortening_mode=1`

#### Before this change
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1

readrandom   :       2.899 micros/op 344973 ops/sec 2.899 seconds 1000000 operations;   38.2 MB/s (1000000 of 1000000 found)
```

#### After this change

Notice how key comparison counts are the same between the two.
```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.881 micros/op 347128 ops/sec 2.881 seconds 1000000 operations;   38.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       2.609 micros/op 383209 ops/sec 2.610 seconds 1000000 operations;   42.4 MB/s (1000000 of 1000000 found)
```

With a non-uniform distribution, `i.e. index_shortening_mode=2`

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=binary_search

readrandom   :       2.958 micros/op 338075 ops/sec 2.958 seconds 1000000 operations;   37.4 MB/s (1000000 of 1000000 found)
```

```
./db_bench -use_existing_db=true -benchmarks=readrandom -seed=1 -index_search_type=interpolation_search

readrandom   :       5.502 micros/op 181750 ops/sec 5.502 seconds 1000000 operations;   20.1 MB/s (1000000 of 1000000 found)
```

Reviewed By: pdillinger

Differential Revision: D91063163

Pulled By: joshkang97

fbshipit-source-id: 151d6aa76f8713740b714de6e406aff40d28ccbc
2026-02-13 17:15:10 -08:00

74 lines
2.6 KiB
C++

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "table/block_based/binary_search_index_reader.h"
namespace ROCKSDB_NAMESPACE {
Status BinarySearchIndexReader::Create(
const BlockBasedTable* table, const ReadOptions& ro,
FilePrefetchBuffer* prefetch_buffer, bool use_cache, bool prefetch,
bool pin, BlockCacheLookupContext* lookup_context,
std::unique_ptr<IndexReader>* index_reader) {
assert(table != nullptr);
assert(table->get_rep());
assert(!pin || prefetch);
assert(index_reader != nullptr);
CachableEntry<Block> index_block;
if (prefetch || !use_cache) {
const Status s =
ReadIndexBlock(table, prefetch_buffer, ro, use_cache,
/*get_context=*/nullptr, lookup_context, &index_block);
if (!s.ok()) {
return s;
}
if (use_cache && !pin) {
index_block.Reset();
}
}
index_reader->reset(
new BinarySearchIndexReader(table, std::move(index_block)));
return Status::OK();
}
InternalIteratorBase<IndexValue>* BinarySearchIndexReader::NewIterator(
const ReadOptions& read_options, bool /* disable_prefix_seek */,
IndexBlockIter* iter, GetContext* get_context,
BlockCacheLookupContext* lookup_context) {
const BlockBasedTable::Rep* rep = table()->get_rep();
CachableEntry<Block> index_block;
const Status s = GetOrReadIndexBlock(get_context, lookup_context,
&index_block, read_options);
if (!s.ok()) {
if (iter != nullptr) {
iter->Invalidate(s);
return iter;
}
return NewErrorInternalIterator<IndexValue>(s);
}
Statistics* kNullStats = nullptr;
// We don't return pinned data from index blocks, so no need
// to set `block_contents_pinned`.
auto it = index_block.GetValue()->NewIndexIterator(
internal_comparator()->user_comparator(),
rep->get_global_seqno(BlockType::kIndex), iter, kNullStats, true,
index_has_first_key(), index_key_includes_seq(), index_value_is_full(),
false /* block_contents_pinned */, user_defined_timestamps_persisted(),
nullptr /* prefix_index */, rep->table_options.index_block_search_type);
assert(it != nullptr);
index_block.TransferTo(it);
return it;
}
} // namespace ROCKSDB_NAMESPACE