rocksdb/table/block_based
Peter Dillinger 4d3518951a Option to decouple index and filter partitions (#12939)
Summary:
Partitioned metadata blocks were introduced back in 2017 to deal more gracefully with large DBs where RAM is relatively scarce and some data might be much colder than other data. The feature allows metadata blocks to compete for memory in the block cache against data blocks while alleviating tail latencies and thrash conditions that can arise with large metadata blocks (sometimes megabytes each) that can arise with large SST files. In general, the cost to partitioned metadata is more CPU in accesses (especially for filters where more binary search is needed before hashing can be used) and a bit more memory fragmentation and related overheads.

However the feature has always had a subtle limitation with a subtle effect on performance: index partitions and filter partitions must be cut at the same time, regardless of which wins the space race (hahaha) to metadata_block_size. Commonly filters will be a few times larger than indexes, so index partitions will be under-sized compared to filter (and data) blocks. While this does affect fragmentation and related overheads a bit, I suspect the bigger impact on performance is in the block cache. The coupling of the partition cuts would be defensible if the binary search done to find the filter block was used (on filter hit) to short-circuit binary search to an index partition, but that optimization has not been developed.

Consider two metadata blocks, an under-sized one and a normal-sized one, covering proportional sections of the key space with the same density of read queries. The under-sized one will be more prone to eviction from block cache because it is used less often. This is unfair because of its despite its proportionally smaller cost of keeping in block cache, and most of the cost of a miss to re-load it (random IO) is not proportional to the size (similar latency etc. up to ~32KB).

 ## This change

Adds a new table option decouple_partitioned_filters allows filter blocks and index blocks to be cut independently. To make this work, the partitioned filter block builder needs to know about the previous key, to generate an appropriate separator for the partition index. In most cases, BlockBasedTableBuilder already has easy access to the previous key to provide to the filter block builder.

This change includes refactoring to pass that previous key to the filter builder when available, with the filter building caching the previous key itself when unavailable, such as during compression dictionary training and some unit tests. Access to the previous key eliminates the need to track the previous prefix, which results in a small SST construction CPU win in prefix filtering cases, regardless of coupling, and possibly a small regression for some non-prefix cases, regardless of coupling, but still overall improvement especially with https://github.com/facebook/rocksdb/issues/12931.

Suggested follow-up:
* Update confusing use of "last key" to refer to "previous key"
* Expand unit test coverage with parallel compression and dictionary training
* Consider an option or enhancement to alleviate under-sized metadata blocks "at the end" of an SST file due to no coordination or awareness of when files are cut.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/12939

Test Plan:
unit tests updated. Also did some unit test runs with "hard wired" usage of parallel compression and dictionary training code paths to ensure they were working. Also ran blackbox_crash_test for a while with the new feature.

 ## SST write performance (CPU)

Using the same testing setup as in https://github.com/facebook/rocksdb/issues/12931 but with -decouple_partitioned_filters=1 in the "after" configuration, which benchmarking shows makes almost no difference in terms of SST write CPU. "After" vs. "before" this PR
```
-partition_index_and_filters=0 -prefix_size=0 -whole_key_filtering=1
923691 vs. 924851 (-0.13%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=0
921398 vs. 922973 (-0.17%)
-partition_index_and_filters=0 -prefix_size=8 -whole_key_filtering=1
902259 vs. 908756 (-0.71%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
917932 vs. 916901 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=0
912755 vs. 907298 (+0.60%)
-partition_index_and_filters=1 -prefix_size=8 -whole_key_filtering=1
899754 vs. 892433 (+0.82%)
```
I think this is a pretty good trade, especially in attracting more movement toward partitioned configurations.

 ## Read performance

Let's see how decoupling affects read performance across various degrees of memory constraint. To simplify LSM structure, we're using FIFO compaction. Since decoupling will overall increase metadata block size, we control for this somewhat with an extra "before" configuration with larger metadata block size setting (8k instead of 4k). Basic setup:

```
(for CS in 0300 1200; do TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillrandom,flush,readrandom,block_cache_entry_stats -num=5000000 -duration=30 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=10 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters=1 -statistics=1 -cache_size=${CS}000000 -metadata_block_size=4096 -decouple_partitioned_filters=1 2>&1 | tee results-$CS; done)
```

And read ops/s results:

```CSV
Cache size MB,After/decoupled/4k,Before/4k,Before/8k
3,15593,15158,12826
6,16295,16693,14134
10,20427,20813,18459
20,27035,26836,27384
30,33250,31810,33846
60,35518,32585,35329
100,36612,31805,35292
300,35780,31492,35481
1000,34145,31551,35411
1100,35219,31380,34302
1200,35060,31037,34322
```

If you graph this with log scale on the X axis (internal link: https://pxl.cl/5qKRc), you see that the decoupled/4k configuration is essentially the best of both the before/4k and before/8k configurations: handles really tight memory closer to the old 4k configuration and handles generous memory closer to the old 8k configuration.

Reviewed By: jowlyzhang

Differential Revision: D61376772

Pulled By: pdillinger

fbshipit-source-id: fc2af2aee44290e2d9620f79651a30640799e01f
2024-08-16 15:34:31 -07:00
..
binary_search_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
binary_search_index_reader.h Extend Get/MultiGet deadline support to table open (#6982) 2020-06-29 14:53:17 -07:00
block.cc Add initial support for TimedPut API (#12419) 2024-03-14 15:44:55 -07:00
block.h Remove 'virtual' when implied by 'override' (#12319) 2024-01-31 13:14:42 -08:00
block_based_table_builder.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
block_based_table_builder.h Fix/cleanup SeqnoToTimeMapping (#12253) 2024-01-19 21:50:38 -08:00
block_based_table_factory.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
block_based_table_factory.h Record and use the tail size to prefetch table tail (#11406) 2023-05-08 13:14:28 -07:00
block_based_table_iterator.cc Ignore more non-critical IO error in BlockCacheLookupForReadAheadSize() in crash test (#12822) 2024-06-28 12:01:13 -07:00
block_based_table_iterator.h Lazily construct BlockBasedTableIterator::block_handles_ (#12616) 2024-05-03 17:18:13 -07:00
block_based_table_reader.cc Add ticker stats for read corruption retries (#12923) 2024-08-12 15:32:07 -07:00
block_based_table_reader.h Remove redundant no_io parameters to filter functions (#12762) 2024-06-12 18:47:11 -07:00
block_based_table_reader_impl.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
block_based_table_reader_sync_and_async.h Add ticker stats for read corruption retries (#12923) 2024-08-12 15:32:07 -07:00
block_based_table_reader_test.cc Prefer static_cast in place of most reinterpret_cast (#12308) 2024-02-07 10:44:11 -08:00
block_builder.cc internal_repo_rocksdb (-8794174668376270091) (#12114) 2023-12-01 11:10:30 -08:00
block_builder.h Logically strip timestamp during flush (#11557) 2023-06-29 15:50:50 -07:00
block_cache.cc Support compressed and local flash secondary cache stacking (#11812) 2023-09-21 20:30:53 -07:00
block_cache.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
block_prefetcher.cc Respect ReadOptions::read_tier in prefetching (#12782) 2024-06-19 09:53:59 -07:00
block_prefetcher.h Refactor FilePrefetchBuffer code (#12097) 2024-01-05 09:29:01 -08:00
block_prefix_index.cc Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) 2022-06-10 08:51:45 -07:00
block_prefix_index.h Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128) 2022-06-10 08:51:45 -07:00
block_test.cc internal_repo_rocksdb (-8794174668376270091) (#12114) 2023-12-01 11:10:30 -08:00
block_type.h Remove deprecated block-based filter (#10184) 2022-06-16 15:51:33 -07:00
cachable_entry.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
data_block_footer.cc Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
data_block_footer.h Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433) 2020-02-20 12:09:57 -08:00
data_block_hash_index.cc Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
data_block_hash_index.h Fix build with gcc 13 by including <cstdint> (#11118) 2023-01-25 14:30:32 -08:00
data_block_hash_index_test.cc Rename IntTblPropCollector -> InternalTblPropColl (#12320) 2024-02-02 14:14:43 -08:00
filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
filter_block_reader_common.cc Remove redundant no_io parameters to filter functions (#12762) 2024-06-12 18:47:11 -07:00
filter_block_reader_common.h Remove redundant no_io parameters to filter functions (#12762) 2024-06-12 18:47:11 -07:00
filter_policy.cc Attempt fix valgrind FP on std::optional (#12936) 2024-08-15 10:55:29 -07:00
filter_policy_internal.h Optimize, simplify filter block building (fix regression) (#12931) 2024-08-14 15:13:16 -07:00
flush_block_policy.cc Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
flush_block_policy_impl.h Change internal headers with duplicate names (#11408) 2023-05-17 11:27:09 -07:00
full_filter_block.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
full_filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
full_filter_block_test.cc Optimize, simplify filter block building (fix regression) (#12931) 2024-08-14 15:13:16 -07:00
hash_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
hash_index_reader.h Extend Get/MultiGet deadline support to table open (#6982) 2020-06-29 14:53:17 -07:00
index_builder.cc Refactor IndexBuilder::AddIndexEntry (#12867) 2024-07-22 14:27:31 -07:00
index_builder.h Refactor IndexBuilder::AddIndexEntry (#12867) 2024-07-22 14:27:31 -07:00
index_reader_common.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
index_reader_common.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
mock_block_based_table.h Remove deprecated block-based filter (#10184) 2022-06-16 15:51:33 -07:00
parsed_full_filter_block.cc Hide FilterBits{Builder,Reader} from public API (#9592) 2022-02-17 16:34:46 -08:00
parsed_full_filter_block.h Major Cache refactoring, CPU efficiency improvement (#10975) 2023-01-11 14:20:40 -08:00
partitioned_filter_block.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_filter_block.h Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_filter_block_test.cc Option to decouple index and filter partitions (#12939) 2024-08-16 15:34:31 -07:00
partitioned_index_iterator.cc Refactor FilePrefetchBuffer code (#12097) 2024-01-05 09:29:01 -08:00
partitioned_index_iterator.h Format files under table/ by clang-format (#10852) 2022-10-25 11:50:38 -07:00
partitioned_index_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
partitioned_index_reader.h Support pro-actively erasing obsolete block cache entries (#12694) 2024-06-07 08:57:11 -07:00
reader_common.cc Prefer static_cast in place of most reinterpret_cast (#12308) 2024-02-07 10:44:11 -08:00
reader_common.h Remove unnecessary, confusing 'extern' (#12300) 2024-01-29 10:38:08 -08:00
uncompression_dict_reader.cc Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00
uncompression_dict_reader.h Eliminate some parameters redundant with ReadOptions (#12761) 2024-06-12 15:44:37 -07:00