rocksdb/table/block_based/block_cache.h
Josh Kang 901c88e37b Separate keys and values in data blocks (#14287)
Summary:
Introduce new table option with separated key-value storage in data blocks.

This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression.

The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark.

For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined.

### New block layout

```
+------------------+
| Keys Section     |  <- Key entries with delta encoding
+------------------+
| Values Section   |  <- (new) Values stored contiguously
+------------------+
| Restart Array    |  <- Fixed32 offsets to restart points
+------------------+
| Values Offset    |  <- (new) 4 bytes: offset to values section
| Footer           |  <- 4 bytes: packed index_type + num_restarts
+------------------+
```

### Entry Format

**At restart points**
```
+--------------+------------------+----------------+-----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | value_off (v32) | key_delta |
+--------------+------------------+----------------+-----------------+-----------+
```

**At non-restart points**
```
+--------------+------------------+----------------+-----------+
| shared (v32) | non_shared (v32) | value_sz (v32) | key_delta |
+--------------+------------------+----------------+-----------+
```

- `value_offset` is only stored at restart points to save space
- For non-restart entries, value offset is computed as: `prev_value_offset + prev_value_size`

### Forward Compatibility
- We make use of reserved block footer bits to mark if a block has separated kv format. Should an older version read this, it will assume a very large block restart interval and result in a corruption error.

### Key Changes
- **BlockBuilder**: Accumulates values in a separate buffer; value offsets are stored only at restart points (other entries derive offset from previous value's position). There is an additional memcpy cost to place the value data after the key data.
- **Block iteration**: Iteration now needs to know if we are at a restart point. This will rely on `cur_entry_idx_`, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums.
- **Table properties**: Store `data_block_restart_interval`, `index_block_restart_interval`, and `separate_key_value_in_data_block` in table properties

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14287

Test Plan:
- Extended block_test, table_test, compaction_test to contain new separated_kv param
- Added new parameter to crash test

 ---

## Benchmark

### Varying Value Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Value Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 1 | 253,280 | 264,977 | 0.516 | 0.497 (**-3.7%**) | 596,328 | 586,625 (**-1.6%**) | 5,904,681 | 6,069,581 (**+2.8%**) | 5.1 | 4.2 (**-18.6%**) |
| 16 | 235,757 | 242,367 | 0.572 | 0.533 (**-6.8%**) | 570,751 | 572,815 (**+0.4%**) | 5,371,138 | 5,604,908 (**+4.4%**) | 14.7 | 13.4 (**-8.5%**) |
| 100 | 264,299 | 265,427 | 0.461 | 0.454 (**-1.5%**) | 323,696 | 332,790 (**+2.8%**) | 4,239,725 | 4,232,416 (**-0.2%**) | 38.8 | 37.6 (**-3.2%**) |
| 1,000 | 238,992 | 242,764 | 2.349 | 2.329 (**-0.9%**) | 244,608 | 261,403 (**+6.9%**) | 1,285,394 | 1,265,868 (**-1.5%**) | 342.1 | 342.0 (**-0.0%**) |

### Varying Block Restart Interval

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| BRI | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|----:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 1 | 251,654 | 263,707 | 0.453 | 0.485 (**+7.1%**) | 334,653 | 328,708 (**-1.8%**) | 4,194,342 | 3,954,291 (**-5.7%**) | 40.6 | 42.4 (**+4.5%**) |
| 4 | 253,797 | 252,394 | 0.476 | 0.476 (**+0.0%**) | 332,719 | 341,676 (**+2.7%**) | 4,135,691 | 4,051,151 (**-2.0%**) | 39.2 | 39.3 (**+0.3%**) |
| 8 | 260,143 | 262,273 | 0.496 | 0.460 (**-7.3%**) | 330,859 | 337,567 (**+2.0%**) | 4,144,081 | 4,187,389 (**+1.0%**) | 38.9 | 38.1 (**-2.1%**) |
| 16 | 252,875 | 263,176 | 0.464 | 0.455 (**-1.9%**) | 323,783 | 335,418 (**+3.6%**) | 4,127,310 | 4,217,028 (**+2.2%**) | 38.8 | 37.6 (**-3.2%**) |
| 32 | 260,224 | 269,422 | 0.464 | 0.451 (**-2.8%**) | 304,001 | 314,989 (**+3.6%**) | 4,310,162 | 4,247,248 (**-1.5%**) | 38.8 | 37.3 (**-3.8%**) |

### Varying Compression

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Compression | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|------------|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| None | 252,494 | 260,552 | 0.413 | 0.419 (**+1.5%**) | 356,290 | 371,535 (**+4.3%**) | 4,479,261 | 4,507,133 (**+0.6%**) | 73.5 | 73.6 (**+0.2%**) |
| LZ4 | 246,010 | 256,360 | 0.477 | 0.455 (**-4.6%**) | 342,497 | 345,882 (**+1.0%**) | 4,400,570 | 4,268,102 (**-3.0%**) | 38.3 | 37.6 (**-2.0%**) |
| ZSTD (L3) | 254,748 | 258,556 | 1.067 | 1.055 (**-1.1%**) | 176,724 | 177,566 (**+0.5%**) | 2,736,841 | 2,717,739 (**-0.7%**) | 32.9 | 31.3 (**-4.7%**) |
| ZSTD (L6) | 256,459 | 259,388 | 1.556 | 1.462 (**-6.0%**) | 177,390 | 176,691 (**-0.4%**) | 2,754,336 | 2,688,682 (**-2.4%**) | 32.8 | 31.1 (**-5.1%**) |

### Varying Block Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Block Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|-----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 4 KB | 263,203 | 260,362 | 0.469 | 0.461 (**-1.7%**) | 324,178 | 332,249 (**+2.5%**) | 4,231,537 | 4,217,763 (**-0.3%**) | 38.8 | 37.6 (**-3.1%**) |
| 16 KB | 252,742 | 263,161 | 0.426 | 0.428 (**+0.5%**) | 227,805 | 222,873 (**-2.2%**) | 5,146,997 | 5,081,080 (**-1.3%**) | 38.1 | 36.7 (**-3.6%**) |
| 64 KB | 257,490 | 260,225 | 0.423 | 0.414 (**-2.1%**) | 86,807 | 91,586 (**+5.5%**) | 5,380,403 | 5,372,372 (**-0.1%**) | 36.3 | 35.0 (**-3.5%**) |

### Varying Key Size

**Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compact`
**Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq`

| Key Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | |
|---------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:|
| | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv |
| 10–100 | 243,740 | 255,183 | 0.618 | 0.622 (**+0.6%**) | 284,304 | 307,569 (**+8.2%**) | 3,738,921 | 3,686,676 (**-1.4%**) | 41.5 | 41.2 (**-0.8%**) |

### CPU Profile Notes
- No compression: DataBlock::SeekForGet uses less cpu (13.2% vs 13.9%)
  - https://fburl.com/strobelight/6mwwebft with separated KV
  - https://fburl.com/strobelight/m9m798ka without
- ZSTD compression: rocksdb::DecompressSerializedBlock uses more CPU (45.8% vs 44.9%), while DataBlock::SeekForGet uses less cpu (5.09% vs 6.52%)
  - https://fburl.com/strobelight/3x5nw1k4 with separated KV
  - https://fburl.com/strobelight/e7809046 without

 ---

Reviewed By: xingbowang, pdillinger

Differential Revision: D92103024

Pulled By: joshkang97

fbshipit-source-id: 47cfeb656ff3c20d34975f0b6c4c0462935a83dc
2026-02-23 12:42:05 -08:00

207 lines
7.6 KiB
C++

// Copyright (c) Meta Platforms, Inc. and affiliates.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
// Code supporting block cache (Cache) access for block-based table, based on
// the convenient APIs in typed_cache.h
#pragma once
#include <type_traits>
#include "cache/typed_cache.h"
#include "port/lang.h"
#include "table/block_based/block.h"
#include "table/block_based/block_type.h"
#include "table/block_based/parsed_full_filter_block.h"
#include "table/format.h"
namespace ROCKSDB_NAMESPACE {
// Metaprogramming wrappers for Block, to give each type a single role when
// used with FullTypedCacheInterface.
// (NOTE: previous attempts to create actual derived classes of Block with
// virtual calls resulted in performance regression)
class Block_kData : public Block {
public:
using Block::Block;
static constexpr CacheEntryRole kCacheEntryRole = CacheEntryRole::kDataBlock;
static constexpr BlockType kBlockType = BlockType::kData;
};
class Block_kIndex : public Block {
public:
using Block::Block;
static constexpr CacheEntryRole kCacheEntryRole = CacheEntryRole::kIndexBlock;
static constexpr BlockType kBlockType = BlockType::kIndex;
};
class Block_kFilterPartitionIndex : public Block {
public:
using Block::Block;
static constexpr CacheEntryRole kCacheEntryRole =
CacheEntryRole::kFilterMetaBlock;
static constexpr BlockType kBlockType = BlockType::kFilterPartitionIndex;
};
class Block_kRangeDeletion : public Block {
public:
using Block::Block;
static constexpr CacheEntryRole kCacheEntryRole = CacheEntryRole::kOtherBlock;
static constexpr BlockType kBlockType = BlockType::kRangeDeletion;
};
// Useful for creating the Block even though meta index blocks are not
// yet stored in block cache
class Block_kMetaIndex : public Block {
public:
using Block::Block;
static constexpr CacheEntryRole kCacheEntryRole = CacheEntryRole::kOtherBlock;
static constexpr BlockType kBlockType = BlockType::kMetaIndex;
};
class Block_kUserDefinedIndex : public BlockContents {
public:
static constexpr CacheEntryRole kCacheEntryRole = CacheEntryRole::kIndexBlock;
static constexpr BlockType kBlockType = BlockType::kUserDefinedIndex;
explicit Block_kUserDefinedIndex(BlockContents&& other)
: BlockContents(std::move(other)) {}
const Slice& ContentSlice() const { return data; }
};
struct BlockCreateContext : public Cache::CreateContext {
BlockCreateContext() {}
BlockCreateContext(const BlockBasedTableOptions* _table_options,
const ImmutableOptions* _ioptions, Statistics* _statistics,
Decompressor* _decompressor,
uint8_t _protection_bytes_per_key,
const Comparator* _raw_ucmp,
bool _index_value_is_full = false,
bool _index_has_first_key = false,
uint32_t _data_block_restart_interval = 0,
uint32_t _index_block_restart_interval = 0)
: table_options(_table_options),
ioptions(_ioptions),
statistics(_statistics),
decompressor(_decompressor),
raw_ucmp(_raw_ucmp),
protection_bytes_per_key(_protection_bytes_per_key),
index_value_is_full(_index_value_is_full),
index_has_first_key(_index_has_first_key),
data_block_restart_interval(_data_block_restart_interval),
index_block_restart_interval(_index_block_restart_interval) {}
const BlockBasedTableOptions* table_options = nullptr;
const ImmutableOptions* ioptions = nullptr;
Statistics* statistics = nullptr;
// TODO: refactor to avoid copying BlockCreateContext for dict in block cache
Decompressor* decompressor = nullptr;
const Comparator* raw_ucmp = nullptr;
uint8_t protection_bytes_per_key = 0;
bool index_value_is_full;
bool index_has_first_key;
// Restart intervals from table properties (0 if not available)
uint32_t data_block_restart_interval = 0;
uint32_t index_block_restart_interval = 0;
// For TypedCacheInterface
template <typename TBlocklike>
inline void Create(std::unique_ptr<TBlocklike>* parsed_out,
size_t* charge_out, const Slice& data,
CompressionType type, MemoryAllocator* alloc) {
BlockContents uncompressed_block_contents;
if (type != CompressionType::kNoCompression) {
assert(decompressor != nullptr);
Status s =
DecompressBlockData(data.data(), data.size(), type, *decompressor,
&uncompressed_block_contents, *ioptions, alloc);
if (!s.ok()) {
parsed_out->reset();
return;
}
} else {
uncompressed_block_contents =
BlockContents(AllocateAndCopyBlock(data, alloc), data.size());
}
Create(parsed_out, std::move(uncompressed_block_contents));
*charge_out = parsed_out->get()->ApproximateMemoryUsage();
}
void Create(std::unique_ptr<Block_kData>* parsed_out, BlockContents&& block);
void Create(std::unique_ptr<Block_kIndex>* parsed_out, BlockContents&& block);
void Create(std::unique_ptr<Block_kFilterPartitionIndex>* parsed_out,
BlockContents&& block);
void Create(std::unique_ptr<Block_kRangeDeletion>* parsed_out,
BlockContents&& block);
void Create(std::unique_ptr<Block_kMetaIndex>* parsed_out,
BlockContents&& block);
void Create(std::unique_ptr<Block_kUserDefinedIndex>* parsed_out,
BlockContents&& block);
void Create(std::unique_ptr<ParsedFullFilterBlock>* parsed_out,
BlockContents&& block);
void Create(std::unique_ptr<DecompressorDict>* parsed_out,
BlockContents&& block);
};
// Convenient cache interface to use for block_cache, with support for
// SecondaryCache.
template <typename TBlocklike>
using BlockCacheInterface =
FullTypedCacheInterface<TBlocklike, BlockCreateContext>;
// Shortcut name for cache handles under BlockCacheInterface
template <typename TBlocklike>
using BlockCacheTypedHandle =
typename BlockCacheInterface<TBlocklike>::TypedHandle;
// Selects the right helper based on BlockType and CacheTier
const Cache::CacheItemHelper* GetCacheItemHelper(
BlockType block_type,
CacheTier lowest_used_cache_tier = CacheTier::kNonVolatileBlockTier);
// For SFINAE check that a type is "blocklike" with a kCacheEntryRole member.
// Can get difficult compiler/linker errors without a good check like this.
template <typename TUse, typename TBlocklike>
using WithBlocklikeCheck = std::enable_if_t<
TBlocklike::kCacheEntryRole == CacheEntryRole::kMisc || true, TUse>;
// Helper for the uncache_aggressiveness option
class UncacheAggressivenessAdvisor {
public:
UncacheAggressivenessAdvisor(uint32_t uncache_aggressiveness) {
assert(uncache_aggressiveness > 0);
allowance_ = std::min(uncache_aggressiveness, uint32_t{3});
threshold_ = std::pow(0.99, uncache_aggressiveness - 1);
}
void Report(bool erased) { ++(erased ? useful_ : not_useful_); }
bool ShouldContinue() {
if (not_useful_ < allowance_) {
return true;
} else {
// See UncacheAggressivenessAdvisor unit test
return (useful_ + 1.0) / (useful_ + not_useful_ - allowance_ + 1.5) >=
threshold_;
}
}
private:
// Baseline minimum number of "not useful" to consider stopping, to allow
// sufficient evidence for checking the threshold. Actual minimum will be
// higher as threshold gets well below 1.0.
int allowance_;
// After allowance, stop if useful ratio is below this threshold
double threshold_;
// Counts
int useful_ = 0;
int not_useful_ = 0;
};
} // namespace ROCKSDB_NAMESPACE