Summary: Introduce new table option with separated key-value storage in data blocks. This PR implements a new SST block format where keys and values are stored in separate sections within data blocks, rather than interleaved. Keys are stored first, followed by all values in a contiguous section. The motivation is better cpu cache hit rate during seeks and potentially better compression. The additional storage cost is a varint per restart point, and 4 bytes additional block footer. For a data block with a restart interval of 16, it is approximately 1 bit of overhead per entry. But compression actually performs better, resulting in ~3% storage savings from benchmark. For now I've opted to not separate kvs in non-data blocks since restart interval for those blocks is typically 1, and values are typically small and probably better inlined. ### New block layout ``` +------------------+ | Keys Section | <- Key entries with delta encoding +------------------+ | Values Section | <- (new) Values stored contiguously +------------------+ | Restart Array | <- Fixed32 offsets to restart points +------------------+ | Values Offset | <- (new) 4 bytes: offset to values section | Footer | <- 4 bytes: packed index_type + num_restarts +------------------+ ``` ### Entry Format **At restart points** ``` +--------------+------------------+----------------+-----------------+-----------+ | shared (v32) | non_shared (v32) | value_sz (v32) | value_off (v32) | key_delta | +--------------+------------------+----------------+-----------------+-----------+ ``` **At non-restart points** ``` +--------------+------------------+----------------+-----------+ | shared (v32) | non_shared (v32) | value_sz (v32) | key_delta | +--------------+------------------+----------------+-----------+ ``` - `value_offset` is only stored at restart points to save space - For non-restart entries, value offset is computed as: `prev_value_offset + prev_value_size` ### Forward Compatibility - We make use of reserved block footer bits to mark if a block has separated kv format. Should an older version read this, it will assume a very large block restart interval and result in a corruption error. ### Key Changes - **BlockBuilder**: Accumulates values in a separate buffer; value offsets are stored only at restart points (other entries derive offset from previous value's position). There is an additional memcpy cost to place the value data after the key data. - **Block iteration**: Iteration now needs to know if we are at a restart point. This will rely on `cur_entry_idx_`, which was previously only used for per-kv checksum purposes. In this new format, we also need to know the block_restart_interval, which was previously also only calculated for per-kv checksums. - **Table properties**: Store `data_block_restart_interval`, `index_block_restart_interval`, and `separate_key_value_in_data_block` in table properties Pull Request resolved: https://github.com/facebook/rocksdb/pull/14287 Test Plan: - Extended block_test, table_test, compaction_test to contain new separated_kv param - Added new parameter to crash test --- ## Benchmark ### Varying Value Size **Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --value_size=<X> --benchmarks=fillrandom,compact` **Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` | Value Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | | |----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:| | | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | | 1 | 253,280 | 264,977 | 0.516 | 0.497 (**-3.7%**) | 596,328 | 586,625 (**-1.6%**) | 5,904,681 | 6,069,581 (**+2.8%**) | 5.1 | 4.2 (**-18.6%**) | | 16 | 235,757 | 242,367 | 0.572 | 0.533 (**-6.8%**) | 570,751 | 572,815 (**+0.4%**) | 5,371,138 | 5,604,908 (**+4.4%**) | 14.7 | 13.4 (**-8.5%**) | | 100 | 264,299 | 265,427 | 0.461 | 0.454 (**-1.5%**) | 323,696 | 332,790 (**+2.8%**) | 4,239,725 | 4,232,416 (**-0.2%**) | 38.8 | 37.6 (**-3.2%**) | | 1,000 | 238,992 | 242,764 | 2.349 | 2.329 (**-0.9%**) | 244,608 | 261,403 (**+6.9%**) | 1,285,394 | 1,265,868 (**-1.5%**) | 342.1 | 342.0 (**-0.0%**) | ### Varying Block Restart Interval **Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_restart_interval=<X> --benchmarks=fillrandom,compact` **Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` | BRI | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | | |----:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:| | | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | | 1 | 251,654 | 263,707 | 0.453 | 0.485 (**+7.1%**) | 334,653 | 328,708 (**-1.8%**) | 4,194,342 | 3,954,291 (**-5.7%**) | 40.6 | 42.4 (**+4.5%**) | | 4 | 253,797 | 252,394 | 0.476 | 0.476 (**+0.0%**) | 332,719 | 341,676 (**+2.7%**) | 4,135,691 | 4,051,151 (**-2.0%**) | 39.2 | 39.3 (**+0.3%**) | | 8 | 260,143 | 262,273 | 0.496 | 0.460 (**-7.3%**) | 330,859 | 337,567 (**+2.0%**) | 4,144,081 | 4,187,389 (**+1.0%**) | 38.9 | 38.1 (**-2.1%**) | | 16 | 252,875 | 263,176 | 0.464 | 0.455 (**-1.9%**) | 323,783 | 335,418 (**+3.6%**) | 4,127,310 | 4,217,028 (**+2.2%**) | 38.8 | 37.6 (**-3.2%**) | | 32 | 260,224 | 269,422 | 0.464 | 0.451 (**-2.8%**) | 304,001 | 314,989 (**+3.6%**) | 4,310,162 | 4,247,248 (**-1.5%**) | 38.8 | 37.3 (**-3.8%**) | ### Varying Compression **Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --compression_type=<X> [--compression_level=<N>] --benchmarks=fillrandom,compact` **Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` | Compression | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | | |------------|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:| | | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | | None | 252,494 | 260,552 | 0.413 | 0.419 (**+1.5%**) | 356,290 | 371,535 (**+4.3%**) | 4,479,261 | 4,507,133 (**+0.6%**) | 73.5 | 73.6 (**+0.2%**) | | LZ4 | 246,010 | 256,360 | 0.477 | 0.455 (**-4.6%**) | 342,497 | 345,882 (**+1.0%**) | 4,400,570 | 4,268,102 (**-3.0%**) | 38.3 | 37.6 (**-2.0%**) | | ZSTD (L3) | 254,748 | 258,556 | 1.067 | 1.055 (**-1.1%**) | 176,724 | 177,566 (**+0.5%**) | 2,736,841 | 2,717,739 (**-0.7%**) | 32.9 | 31.3 (**-4.7%**) | | ZSTD (L6) | 256,459 | 259,388 | 1.556 | 1.462 (**-6.0%**) | 177,390 | 176,691 (**-0.4%**) | 2,754,336 | 2,688,682 (**-2.4%**) | 32.8 | 31.1 (**-5.1%**) | ### Varying Block Size **Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --block_size=<X> --benchmarks=fillrandom,compact` **Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` | Block Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | | |-----------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:| | | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | | 4 KB | 263,203 | 260,362 | 0.469 | 0.461 (**-1.7%**) | 324,178 | 332,249 (**+2.5%**) | 4,231,537 | 4,217,763 (**-0.3%**) | 38.8 | 37.6 (**-3.1%**) | | 16 KB | 252,742 | 263,161 | 0.426 | 0.428 (**+0.5%**) | 227,805 | 222,873 (**-2.2%**) | 5,146,997 | 5,081,080 (**-1.3%**) | 38.1 | 36.7 (**-3.6%**) | | 64 KB | 257,490 | 260,225 | 0.423 | 0.414 (**-2.1%**) | 86,807 | 91,586 (**+5.5%**) | 5,380,403 | 5,372,372 (**-0.1%**) | 36.3 | 35.0 (**-3.5%**) | ### Varying Key Size **Write:** `db_bench --num=1000000 --separate_key_value_in_data_block=<bool> --min_key_size=10 --max_key_size=100 --benchmarks=fillrandom,compact` **Read:** `db_bench --db=$DB --use_existing_db --benchmarks=readrandom,readseq` | Key Size | FillRandom (ops/s) | | Compact (s) | | ReadRandom (ops/s) | | ReadSeq (ops/s) | | SST Size (MB) | | |---------:|-------------------:|-----:|----------:|-----:|-------------------:|-----:|----------------:|-----:|-------------:|-----:| | | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | baseline | sep_kv | | 10–100 | 243,740 | 255,183 | 0.618 | 0.622 (**+0.6%**) | 284,304 | 307,569 (**+8.2%**) | 3,738,921 | 3,686,676 (**-1.4%**) | 41.5 | 41.2 (**-0.8%**) | ### CPU Profile Notes - No compression: DataBlock::SeekForGet uses less cpu (13.2% vs 13.9%) - https://fburl.com/strobelight/6mwwebft with separated KV - https://fburl.com/strobelight/m9m798ka without - ZSTD compression: rocksdb::DecompressSerializedBlock uses more CPU (45.8% vs 44.9%), while DataBlock::SeekForGet uses less cpu (5.09% vs 6.52%) - https://fburl.com/strobelight/3x5nw1k4 with separated KV - https://fburl.com/strobelight/e7809046 without --- Reviewed By: xingbowang, pdillinger Differential Revision: D92103024 Pulled By: joshkang97 fbshipit-source-id: 47cfeb656ff3c20d34975f0b6c4c0462935a83dc
117 lines
5.1 KiB
C++
117 lines
5.1 KiB
C++
// Copyright (c) Meta Platforms, Inc. and affiliates.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
#include "table/block_based/block_cache.h"
|
|
|
|
#include "table/block_based/block_based_table_reader.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
void BlockCreateContext::Create(std::unique_ptr<Block_kData>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new Block_kData(std::move(block),
|
|
table_options->read_amp_bytes_per_bit,
|
|
statistics, data_block_restart_interval));
|
|
parsed_out->get()->InitializeDataBlockProtectionInfo(protection_bytes_per_key,
|
|
raw_ucmp);
|
|
}
|
|
void BlockCreateContext::Create(std::unique_ptr<Block_kIndex>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new Block_kIndex(std::move(block),
|
|
/*read_amp_bytes_per_bit*/ 0, statistics,
|
|
index_block_restart_interval));
|
|
parsed_out->get()->InitializeIndexBlockProtectionInfo(
|
|
protection_bytes_per_key, raw_ucmp, index_value_is_full,
|
|
index_has_first_key);
|
|
}
|
|
void BlockCreateContext::Create(
|
|
std::unique_ptr<Block_kFilterPartitionIndex>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new Block_kFilterPartitionIndex(
|
|
std::move(block), /*read_amp_bytes_per_bit*/ 0, statistics,
|
|
index_block_restart_interval));
|
|
parsed_out->get()->InitializeIndexBlockProtectionInfo(
|
|
protection_bytes_per_key, raw_ucmp, index_value_is_full,
|
|
index_has_first_key);
|
|
}
|
|
void BlockCreateContext::Create(
|
|
std::unique_ptr<Block_kRangeDeletion>* parsed_out, BlockContents&& block) {
|
|
parsed_out->reset(new Block_kRangeDeletion(
|
|
std::move(block), /*read_amp_bytes_per_bit*/ 0, statistics));
|
|
}
|
|
void BlockCreateContext::Create(std::unique_ptr<Block_kMetaIndex>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new Block_kMetaIndex(
|
|
std::move(block), /*read_amp_bytes_per_bit*/ 0, statistics));
|
|
parsed_out->get()->InitializeMetaIndexBlockProtectionInfo(
|
|
protection_bytes_per_key);
|
|
}
|
|
|
|
void BlockCreateContext::Create(
|
|
std::unique_ptr<Block_kUserDefinedIndex>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new Block_kUserDefinedIndex(std::move(block)));
|
|
}
|
|
|
|
void BlockCreateContext::Create(
|
|
std::unique_ptr<ParsedFullFilterBlock>* parsed_out, BlockContents&& block) {
|
|
parsed_out->reset(new ParsedFullFilterBlock(
|
|
table_options->filter_policy.get(), std::move(block)));
|
|
}
|
|
|
|
void BlockCreateContext::Create(std::unique_ptr<DecompressorDict>* parsed_out,
|
|
BlockContents&& block) {
|
|
parsed_out->reset(new DecompressorDict(
|
|
block.data, std::move(block.allocation), *decompressor));
|
|
}
|
|
|
|
namespace {
|
|
// For getting SecondaryCache-compatible helpers from a BlockType. This is
|
|
// useful for accessing block cache in untyped contexts, such as for generic
|
|
// cache warming in table builder.
|
|
const std::array<const Cache::CacheItemHelper*,
|
|
static_cast<unsigned>(BlockType::kInvalid) + 1>
|
|
kCacheItemFullHelperForBlockType{{
|
|
BlockCacheInterface<Block_kData>::GetFullHelper(),
|
|
BlockCacheInterface<ParsedFullFilterBlock>::GetFullHelper(),
|
|
BlockCacheInterface<Block_kFilterPartitionIndex>::GetFullHelper(),
|
|
nullptr, // kProperties
|
|
BlockCacheInterface<DecompressorDict>::GetFullHelper(),
|
|
BlockCacheInterface<Block_kRangeDeletion>::GetFullHelper(),
|
|
nullptr, // kHashIndexPrefixes
|
|
nullptr, // kHashIndexMetadata
|
|
nullptr, // kMetaIndex (not yet stored in block cache)
|
|
BlockCacheInterface<Block_kIndex>::GetFullHelper(),
|
|
nullptr, // kInvalid
|
|
}};
|
|
|
|
// For getting basic helpers from a BlockType (no SecondaryCache support)
|
|
const std::array<const Cache::CacheItemHelper*,
|
|
static_cast<unsigned>(BlockType::kInvalid) + 1>
|
|
kCacheItemBasicHelperForBlockType{{
|
|
BlockCacheInterface<Block_kData>::GetBasicHelper(),
|
|
BlockCacheInterface<ParsedFullFilterBlock>::GetBasicHelper(),
|
|
BlockCacheInterface<Block_kFilterPartitionIndex>::GetBasicHelper(),
|
|
nullptr, // kProperties
|
|
BlockCacheInterface<DecompressorDict>::GetBasicHelper(),
|
|
BlockCacheInterface<Block_kRangeDeletion>::GetBasicHelper(),
|
|
nullptr, // kHashIndexPrefixes
|
|
nullptr, // kHashIndexMetadata
|
|
nullptr, // kMetaIndex (not yet stored in block cache)
|
|
BlockCacheInterface<Block_kIndex>::GetBasicHelper(),
|
|
nullptr, // kInvalid
|
|
}};
|
|
} // namespace
|
|
|
|
const Cache::CacheItemHelper* GetCacheItemHelper(
|
|
BlockType block_type, CacheTier lowest_used_cache_tier) {
|
|
if (lowest_used_cache_tier > CacheTier::kVolatileTier) {
|
|
return kCacheItemFullHelperForBlockType[static_cast<unsigned>(block_type)];
|
|
} else {
|
|
return kCacheItemBasicHelperForBlockType[static_cast<unsigned>(block_type)];
|
|
}
|
|
}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|