Updates to corresponding submodule position in main fork at v10.10.1 #5

Open

gamesguru wants to merge 221 commits from gamesguru/rocksdb:main-c10y into 10.5.fb

Author	SHA1	Message	Date
Shane Jaroch	9781970807	Merge remote-tracking branch 'ellis/10.5.fb' into 10.10.fb	2026-03-09 12:46:15 -04:00
mszeszko-meta	4595a5e95a	Merge pull request #14294 from mszeszko-meta/cherry_pick_14280_to_10_10 10.10.1 Patch Release	2026-02-02 10:46:07 -08:00
Adam Retter	d4d93a17d2	Fixes the Windows VS 2022 build (#14280 ) Summary: When building a Release on Windows RTTI is not available, so asserts that use dynamic_cast need to be disabled Pull Request resolved: https://github.com/facebook/rocksdb/pull/14280 Reviewed By: nmk70 Differential Revision: D91807791 Pulled By: mszeszko-meta fbshipit-source-id: e29c19c757bcd076a1f09ed40b306bb50ba9e882	2026-02-02 09:52:22 -08:00
Maciej Szeszko	3798025699	10.10.fb	2025-12-18 13:58:50 -08:00
Peter Dillinger	9065ace05a	Disable multiscan+timestamp in crash test (#14189 ) Summary: Causing failures and not yet supported. Also putting a note in db.h about the combination being unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14189 Test Plan: started up blackbox_crash_test_with_ts many times and checked command line to be confident it's excluded. Reviewed By: hx235 Differential Revision: D89297971 Pulled By: pdillinger fbshipit-source-id: c5134351d9ecb37879c7e3319c17dd9228d7f12a	2025-12-16 12:36:07 -08:00
Maciej Szeszko	5a06787a26	IO uring improvements (#14158 ) Summary: `PosixRandomAccessFile::MultiRead` was introduced in Dec 2019 in https://github.com/facebook/rocksdb/pull/5881. Subsequently, 2 years after, we introduced the `PosixRandomAccessFile::ReadAsync` API in https://github.com/facebook/rocksdb/pull/9578, which was reusing the same `PosixFileSystem` IO ring as `MultiRead` API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving `ReadAsync` and `MultiRead` API calls on the very same thread might result in reading 'unknown' events in `MultiRead` leading to `Bad cqe data` errors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the `ReadAsync` API. This change aims to solve this problem by maintaining separate thread local IO rings for `async reads` and `multi reads` assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (`IORING_SETUP_SINGLE_ISSUER` [if available]) and defer the task just before the application intends to process completions (`IORING_SETUP_DEFER_TASKRUN` [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference. ## Benchmark TLDR There's no evident advantage of using `io_uring_submit` (relative to proposed `io_uring_submit_and_wait`) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above `kIoUringDepth` batch size. `io_uring_submit` might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from `io_uring_submit_and_wait` -> `io_uring_submit` can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]). Compilation ``` DEBUG_LEVEL=0 make db_bench ``` Create DB ``` ./db_bench \ --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \ --benchmarks=fillseq \ --num=2500000 \ --key_size=100 \ --value_size=6144 \ --compression_type=LZ4 \ --block_size=16384 \ --seed=1723056275 ``` LSM * L0: 2 files, L1: 5, L2: 49, L3: 79 * Each file is roughly ~35M in size ### MultiReadRandom (with caching disabled) Each run was preceded by OS page cache cleanup with `echo 1 \| sudo tee /proc/sys/vm/drop_caches`. ``` ./db_bench \ --use_existing_db=true \ --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \ --compression_type=LZ4 \ --benchmarks=multireadrandom \ --num= <N> \ --batch_size= <B> \ --io_uring_enabled=true \ --async_io=false \ --optimize_multiget_for_io=false \ --threads=4 \ --cache_size=0 \ --use_direct_reads=true \ --use_direct_io_for_flush_and_compaction=true \ --cache_index_and_filter_blocks=false \ --pin_l0_filter_and_index_blocks_in_cache=false \ --pin_top_level_index_and_filter=false \ --prepopulate_block_cache=0 \ --row_cache_size=0 \ --use_blob_cache=false \ --use_compressed_secondary_cache=false ``` \| B=10; N=100,000 \| B = 250; N=80,000 \| B = 1,000; N=20,000 -- \| -- \| -- \| -- baseline \| 31.5 (± 0.4) us/op \| 17.5 (± 0.5) us/op \| 13.5 (± 0.4) us/op io_uring_submit_and_wait \| 31.5 (± 0.6) us/op \| 17.7 (± 0.4) us/op \| 13.6 (± 0.4) us/op io_uring_submit \| 31.5 (± 0.6) us/op \| 17.5 (± 0.5) us/op \| 13.4 (± 0.45) us/op ### Specs \| Property \| Value -- \| -- RocksDB \| version 10.9.0 Date \| Tue Dec 9 15:57:03 2025 CPU \| 56 * Intel Sapphire Rapids (T10 SPR) Kernel version \| 6.9.0-0_fbk12_0_g28f2d09ad102 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14158 Reviewed By: anand1976 Differential Revision: D88172809 Pulled By: mszeszko-meta fbshipit-source-id: 5198de3d2f18f76fee661a2ec5f447e79ba06fbd	2025-12-12 14:25:40 -08:00
Hui Xiao	a1d8318563	Fix resumable compaction to prevent resumption at truncated range deletion boundaries (#14184 ) Summary: Context/Summary: Truncated range deletion in input files can be output by CompactionIterator with type kMaxValid instead of kTypeRangeDeletion, to satisfy ordering requirement between the truncated range deletion start key and a file's point keys. There was a plan to skip such key in https://github.com/facebook/rocksdb/pull/14122 but blockers remain to fulfill the plan. Resumable compaction is not able to handle resumption from range deletion well at this point and should consider kMaxValid type same as kTypeRangeDeletion for resumption. Previously, it didn't and mistakenly allow resumption from a delete range. That led to an assertion failure, complaining about lacking information to update file boundaries in the presence of range deletion needed during cutting an output file, after the compaction resumes from that delete range and happens to cut the output file shortly after without any point keys in between. ``` frame https://github.com/facebook/rocksdb/issues/9: 0x00007f4f4743bc93 libc.so.6`__GI___assert_fail(assertion="meta.smallest.size() > 0", file="db/compaction/compaction_outputs.cc", line=530, function="rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice, const rocksdb::Slice, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&)") at assert.c:101:3 frame https://github.com/facebook/rocksdb/issues/10: 0x00007f4f4808c68c librocksdb.so.10.9`rocksdb::CompactionOutputs::AddRangeDels(this=0x00007f4f0c27e1a0, range_del_agg=0x00007f4f0c21ecc0, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, range_del_out_stats=0x00007f4f0dffa140, bottommost_level=false, icmp=0x00007f4ef4c93040, earliest_snapshot=13108729, keep_seqno_range=<unavailable>, next_table_min_key=0x00007f4ef4c8f540, full_history_ts_low="") at compaction_outputs.cc:530:7 frame https://github.com/facebook/rocksdb/issues/11: 0x00007f4f480480dd librocksdb.so.10.9`rocksdb::CompactionJob::FinishCompactionOutputFile(this=0x00007f4f0dffb890, input_status=<unavailable>, prev_table_last_internal_key=0x00007f4f0dffa650, next_table_min_key=0x00007f4ef4c8f540, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, c_iter=0x00007f4ef4c8f400, sub_compact=0x00007f4f0c27e000, outputs=0x00007f4f0c27e1a0) at compaction_job.cc:1917:31 ``` This PR simply prevents MaxValid from being a resumption point like regular range deletion - see commit 842d66eb18ea67e965d6acb1fce12c18eeb778d2 Besides that, the PR also improves the testing, variable naming, logging in resumable compaction codes that were needed to debug this assertion failure - see commit https://github.com/facebook/rocksdb/pull/14184/commits/aecd4e7f971f6dd4df672d9e5f1409fe4747c561. These improvements are covered by existing tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14184 Test Plan: - The stress initially surfaced the error. Using the exact same LSM shapes and files that were used in stress test but in a unit test, I'm able to get a deterministic repro and confirmed the fix resolves the error. This is the repro test `1075936e69` ``` ./compaction_service_test --gtest_filter=ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume # Before fix [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from ResumableCompactionServiceTest [ RUN ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume compaction_service_test: db/compaction/compaction_outputs.cc:530: rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice, const rocksdb::Slice, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&): Assertion `meta.smallest.size() > 0' failed. Received signal 6 (Aborted) Invoking GDB for stack trace... [New LWP 2621610] [New LWP 2621611] [New LWP 2621612] [New LWP 2621613] [New LWP 2621614] [New LWP 2621630] [New LWP 2621631] # After fix Note: Google Test filter = ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from ResumableCompactionServiceTest [ RUN ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume [ OK ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume (4722 ms) [----------] 1 test from ResumableCompactionServiceTest (4722 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (4722 ms total) [ PASSED ] 1 test. ``` - Follow-up: I tried a couple time to coerce the truncated range delete from scratch in the unit test but failed doing so. Considering kMaxValid may not be outputted by compaction iterator anymore after https://github.com/facebook/rocksdb/pull/14122/files gets landed again (and obsolete the bug) ADN the simple nature of this fix 842d66eb18ea67e965d6acb1fce12c18eeb778d2 AND the worst case of such fix going wrong is just less resumption, I decided to leave writing a unit test to coerce truncated ranged deletion from scratch a follow-up. Maybe I will draw inspiration from https://github.com/facebook/rocksdb/pull/14122/files. Reviewed By: jaykorean Differential Revision: D88912663 Pulled By: hx235 fbshipit-source-id: 80a01135684c8fea659650faaa00c2dc452c482a	2025-12-11 16:50:42 -08:00
Hui Xiao	eedf1fe068	Display copy-paste friendly flag value in db_crashtest.py (#14180 ) Summary: Context/Summary: Stress test flag printed by db_crashtest.py like `./db_stres ....-secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --otherflags=xxxx` is not copy-paste-run friendly. Directly running this command will cause parsing hiccups due to special characters like // or ;. This PR made the db_crashtest.py print a single-quoted value so at least the copy-paste-run works for unix-like shell (the most common case). Pull Request resolved: https://github.com/facebook/rocksdb/pull/14180 Test Plan: `python3 tools/db_crashtest.py --simple blackbox ...` display the following Before fix, no single-quoted ``` Use random seed for iteration 9698536012932546857 Running db_stress with pid=1280640:./db_stress --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true ... // Directly copy, paste and run the ./db_stress command will encounter Error: Read(-readpercent=0)+Prefix(-prefixpercent=0)+Write(-writepercent=45)+Delete(-delpercent=0)+DeleteRange(-delrangepercent=30)+Iterate(-iterpercent=40)+CustomOps(-customopspercent=0) percents != 100! bash: --set_options_one_in=0: command not found ``` After fix, has single-quoted ``` se random seed for iteration 6017815530972723112 Running db_stress with pid=1234632: ./db_stress --secondary_cache_uri='compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true' .... // Directly copy, paste and run the ./db_stress command is fine ``` Reviewed By: archang19 Differential Revision: D88688584 Pulled By: hx235 fbshipit-source-id: 88b8b2de7c2c5619b6e19900f4144dcd8e032f7b	2025-12-09 10:35:41 -08:00
nsaji-stripe	80c4a67d6a	Remote Compaction C API (#14136 ) Summary: r? cbi42 Exposes RocksDB's remote compaction functionality through the C API, enabling C/FFI clients (Go, Rust, Python, etc.) to offload compaction work to remote workers. ## API Components ### Compaction Service Create service with schedule, wait, cancel, and on_installation callbacks Ownership transfers to options object (auto-destroyed, no manual cleanup) ### Job Info (13 getters) DB/CF metadata and compaction details (priority, reason, levels, flags) ### Schedule Response Create with job ID and status (validated with errptr) Status: success, failure, aborted, use_local ### OpenAndCompact (for remote workers) Execute compaction on worker node with environment/comparator overrides Cancellation support via atomic flags Pull Request resolved: https://github.com/facebook/rocksdb/pull/14136 Reviewed By: hx235 Differential Revision: D88316558 Pulled By: jaykorean fbshipit-source-id: 60a0fee69ff1e650dd785d96ec656649263214f8	2025-12-08 10:08:19 -08:00
anand76	5d0cf98e6c	Surface MultiScan async read failure instead of asserting (#14171 ) Summary: Crash tests have been failing of late with this assertion failure - db_stress: `./table/block_based/block_based_table_iterator.h:656: void rocksdb::BlockBasedTableIterator::PrepareReadAsyncCallBack(rocksdb::FSReadRequest &, void *): Assertion `async_state->status.IsAborted()' failed.` Instead of asserting, surface the failure status so we can troubleshoot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14171 Reviewed By: xingbowang Differential Revision: D88396654 Pulled By: anand1976 fbshipit-source-id: 8d59d7ace0c522c17b7af17c50e16af876911bad	2025-12-05 10:45:26 -08:00
Peter Dillinger	e3b5464785	GitHub Actions nightly crash test runs on ARM (#14172 ) Summary: To help find potential issues not showing up in ARM unit tests. I'm running it with and without TransactionDB (write-committed) for better coverage. The job expands the size of /dev/shm for adequate space on maximum performance storage, and adds swap space to reduce risk of OOM in case we fill that up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14172 Test Plan: earlier drafts of this PR added the job to PR jobs, and the last before putting in "nightly" can be seen here: https://github.com/facebook/rocksdb/actions/runs/19945493840/job/57193797390?pr=14172 Reviewed By: archang19 Differential Revision: D88429479 Pulled By: pdillinger fbshipit-source-id: bd4d9cda9256950c3c6c126c299a44dbbbc30c7e	2025-12-04 20:39:10 -08:00
Xingbo Wang	7c48905ecd	Fix missing const for arg of OptionChangeMigration (#14173 ) Summary: Fix missing const for arg of OptionChangeMigration We switched from std::string to std::string & for API OptionChangeMigration, which caused const qualifier to be lost at call site, which causes compilation failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14173 Test Plan: Unit test Reviewed By: pdillinger Differential Revision: D88431457 Pulled By: xingbowang fbshipit-source-id: a705f3b80cc5ff56dab73aa6a31c940798d8df45	2025-12-04 17:04:43 -08:00
Xingbo Wang	707e405492	Revert #14122 "Fix a bug where compaction ..." (#14170 ) Summary: Revert "Fix a bug where compaction with range deletion can persist kTypeMaxValid in file metadata (https://github.com/facebook/rocksdb/issues/14122)" Add a new unit test to capture the situation found by stress test This reverts commit `8c7c8b8dab`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14170 Test Plan: Unit Test Reviewed By: anand1976 Differential Revision: D88395956 Pulled By: xingbowang fbshipit-source-id: 226649dc79a86010ad326ffb2eae35109dc96bc4	2025-12-04 12:28:01 -08:00
Xingbo Wang	340ac7ea6b	Improve sst_dump raw mode dump result (#14166 ) Summary: Add a new option in sst_dump command to show seq no and value type in raw mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/14166 Test Plan: Sample output ``` sst_dump --file=rocksdb_crashtest_blackbox/000010.sst --command=raw --show_sequence_number_type ... Range deletions: -------------------------------------- HEX 000000000000038D000000000000012B000000000000029A seq: 3016892 type: 15 : 000000000000038D000000000000012B000000000000029E ASCII \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 + \0 \0 \0 \0 \0 \0 : \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 + \0 \0 \0 \0 \0 \0 ------ Data Block # 1 @ 0073 -------------------------------------- HEX 000000000000038D000000000000012B000000000000029D seq: 3004554 type: 0 : ASCII \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 + \0 \0 \0 \0 \0 \0 : ------ HEX 000000000000038D000000000000012B000000000000029D seq: 0 type: 1 : 03000000070605040B0A09080F0E0D0C13121110171615141B1A19181F1E1D1C ASCII \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 + \0 \0 \0 \0 \0 \0 : \0 \0 \0 ``` Reviewed By: hx235 Differential Revision: D88396223 Pulled By: xingbowang fbshipit-source-id: b006cd7f51f941951349e4ec60ed5ef1e838919d	2025-12-04 11:52:58 -08:00
Hui Xiao	d2fe0ee389	Fix use-after-free in BlockBasedTable after best-efforts recovery retry (#14155 ) Summary: Context/Summary: Best-efforts recovery can cause a use-after-free bug after retrying for a failed recovery attempt. The issue occurs in VersionSet::Reset(): - First recovery attempt: Opens SST files, caching BlockBasedTable objects in table_cache_ `ac412b1095/db/version_edit_handler.cc (L565)` - Recovery fails: Calls Reset() which deletes the old ColumnFamilySet (and all CFDs) `ac412b1095/db/version_set.cc (L6631)` - Creates new CFDs: But reuses the same table_cache_ `ac412b1095/db/version_set.cc (L5579)` - Bug: Cached BlockBasedTable objects contain now-dangling reference to previous CFD's member such as rep_->internal_comparator or rep_->ioptions as below. References instead of object copies are used for memory efficiency ``` struct BlockBasedTable::Rep { Rep(const ImmutableOptions& _ioptions, .. const InternalKeyComparator& _internal_comparato...)) {} ~Rep() { status.PermitUncheckedError(); } const ImmutableOptions& ioptions; ... const InternalKeyComparator& internal_comparator; ``` - Crash: Accessing any of the above reference in cached tables during read or compaction after recovery finishes triggers use-after-free This PR calls table_cache_->EraseUnRefEntries() to clear tables containing the dangling reference in VersionSet::Reset() before creating the new ColumnFamilySet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14155 Test Plan: - Add new unit test that fails before the fix under ASAN run and pass after ``` [ RUN ] DBBasicTest.BestEffortRecoveryFailureWithTableCacheUseAfterFree ================================================================= ==1976446==ERROR: AddressSanitizer: heap-use-after-free on address 0x61e00000a8c8 at pc 0x7f6b21beae57 bp 0x7ffd65bacec0 sp 0x7ffd65baceb8 READ of size 8 at 0x61e00000a8c8 thread T0 #0 0x7f6b21beae56 in rocksdb::UserComparatorWrapper::user_comparator() const util/user_comparator_wrapper.h:29 // rep_->ioptions https://github.com/facebook/rocksdb/issues/1 0x7f6b21beb02b in rocksdb::InternalKeyComparator::user_comparator() const db/dbformat.h:421 https://github.com/facebook/rocksdb/issues/2 0x7f6b229a7a50 in rocksdb::BinarySearchIndexReader::NewIterator(rocksdb::ReadOptions const&, bool, rocksdb::IndexBlockIter, rocksdb::GetContext, rocksdb::BlockCacheLookupContext) table/block_based/binary_search_index_reader.cc:62 https://github.com/facebook/rocksdb/issues/3 0x7f6b22a9a649 in rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions const&, bool, rocksdb::IndexBlockIter, rocksdb::GetContext, rocksdb::BlockCacheLookupContext) const table/block_based/block_based_table_reader.cc:1683 https://github.com/facebook/rocksdb/issues/4 0x7f6b22aa39be in rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext, rocksdb::SliceTransform const, bool) table/block_based/block_based_table_reader.cc:2533 https://github.com/facebook/rocksdb/issues/5 0x7f6b2241201c in rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext, rocksdb::MutableCFOptions const&, rocksdb::HistogramImpl, bool, int, unsigned long) db/table_cache.cc:492 0x61e00000a8c8 is located 72 bytes inside of 2784-byte region [0x61e00000a880,0x61e00000b360) freed by thread T0 here: #0 0x7f6b248d20d7 in operator delete(void, unsigned long) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:172 https://github.com/facebook/rocksdb/issues/1 0x7f6b21ca8703 in rocksdb::ColumnFamilyData::UnrefAndTryDelete() db/column_family.cc:785 https://github.com/facebook/rocksdb/issues/2 0x7f6b21cb25ee in rocksdb::ColumnFamilySet::~ColumnFamilySet() db/column_family.cc:1771 https://github.com/facebook/rocksdb/issues/3 0x7f6b225683df in std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet) const (/data/users/huixiao/rocksdb/librocksdb.so.10.10+0x1f683df) https://github.com/facebook/rocksdb/issues/4 0x7f6b22568ceb in std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet) /mnt/gvfs/third-party2/libgcc/d1129753c8361ac8e9453c0f4291337a4507ebe6/11.x/platform010/5684a5a/include/c++/trunk/bits/unique_ptr.h:182 https://github.com/facebook/rocksdb/issues/5 0x7f6b22550c52 in std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet) /mnt/gvfs/third-party2/libgcc/d1129753c8361ac8e9453c0f4291337a4507ebe6/11.x/platform010/5684a5a/include/c++/trunk/bits/unique_ptr.h:456 https://github.com/facebook/rocksdb/issues/6 0x7f6b224fa09e in rocksdb::VersionSet::Reset() db/version_set.cc:5587 https://github.com/facebook/rocksdb/issues/7 0x7f6b2250752c in rocksdb::VersionSet::TryRecover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) db/version_set.cc:6640 https://github.com/facebook/rocksdb/issues/8 0x7f6b220c5a88 in rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, bool, unsigned long, rocksdb::DBImpl::RecoveryContext, bool) db/db_impl/db_impl_open.cc:565 previously allocated by thread T0 here: #0 0x7f6b248d1257 in operator new(unsigned long) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:99 https://github.com/facebook/rocksdb/issues/1 0x7f6b21cb30e0 in rocksdb::ColumnFamilySet::CreateColumnFamily(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, rocksdb::Version, rocksdb::ColumnFamilyOptions const&, bool) db/column_family.cc:1827 https://github.com/facebook/rocksdb/issues/2 0x7f6b22516a11 in rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::ReadOptions const&, rocksdb::VersionEdit const, bool) db/version_set.cc:7715 https://github.com/facebook/rocksdb/issues/3 0x7f6b22494910 in rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) db/version_edit_handler.cc:494 https://github.com/facebook/rocksdb/issues/4 0x7f6b2249005f in rocksdb::VersionEditHandler::Initialize() db/version_edit_handler.cc:209 https://github.com/facebook/rocksdb/issues/5 0x7f6b2248cd13 in rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status) db/version_edit_handler.cc:32 https://github.com/facebook/rocksdb/issues/6 0x7f6b225081db in rocksdb::VersionSet::TryRecoverFromOneManifest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) db/version_set.cc:6679 https://github.com/facebook/rocksdb/issues/7 0x7f6b225074a1 in rocksdb::VersionSet::TryRecover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, bool) db/version_set.cc:6635 https://github.com/facebook/rocksdb/issues/8 0x7f6b220c5a88 in rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, bool, unsigned long, rocksdb::DBImpl::RecoveryContext, bool*) db/db_impl/db_impl_open.cc:565 ``` Reviewed By: anand1976 Differential Revision: D87991593 Pulled By: hx235 fbshipit-source-id: 2379b297ff592cadf02659e355cdc8e170917cfc	2025-12-02 19:26:42 -08:00
Peter Dillinger	4951494a27	Continue migration of HCC impl to BitFields (#14027 ) Summary: Continuing work from https://github.com/facebook/rocksdb/issues/13965. Here I'm migrating the "next with shift" kind of bit field and for that I've added an API for atomic additive transformations that can be combined into a single atomic update for multiple fields. (I implemented more features than needed, just in case they are needed someday and to demonstrate what is possible.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/14027 Test Plan: BitFields unit test updated/added, existing HCC tests Reviewed By: xingbowang Differential Revision: D83895094 Pulled By: pdillinger fbshipit-source-id: e4487f34f5607b20f94b85a645ca654e6401e35d	2025-12-01 13:21:34 -08:00
Andrew Chang	ac412b1095	Add checks to terminate early when backup is stopped (#14129 ) Summary: I want to reduce the time from when we call `StopBackup` to `CreateNewBackup` returning `BackupStopped`. We already check for the `stop_backup_` inside `CopyOrCreateFile` and `ReadFileAndComputeChecksum`, but we should add a check at the top of these methods to abort immediately. This could help save some latency from the file system metadata operations, like creating the sequential file and writable file. We also want to update the API documentation for `StopBackup` which currently does not indicate that once it is called, all subsequent requests to create backups will fail. In a follow up PR, we should also add coverage of `StopBackup` to the crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14129 Test Plan: We were missing unit test coverage for `StopBackup`. I added test cases which cancel backups at different points in time. Once this change is rolled out to production, we can monitor the DB close latencies, which depend on first cancelling ongoing backups Reviewed By: pdillinger Differential Revision: D87356536 Pulled By: archang19 fbshipit-source-id: 687094a41f096f6a156be65b2cce0b5054fb26f2	2025-11-25 09:01:20 -08:00
Xingbo Wang	9e14d06143	Support ccache in make file (#14123 ) Summary: Support ccache in make file Pull Request resolved: https://github.com/facebook/rocksdb/pull/14123 Test Plan: local build Reviewed By: cbi42 Differential Revision: D87332892 Pulled By: xingbowang fbshipit-source-id: 2088bd19bdab1bd7070734c886200be80f1a65af	2025-11-24 10:48:09 -08:00
Peter Dillinger	9c2c8f54fa	Fix AutoSkipCompressorWrapper with new logic (#14150 ) Summary: ... from https://github.com/facebook/rocksdb/issues/14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.) Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14150 Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers. Reviewed By: archang19 Differential Revision: D87793987 Pulled By: pdillinger fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63	2025-11-24 10:36:12 -08:00
Xingbo Wang	42ba71fbbf	Start 10.10.0 development (#14148 ) Summary: 10.9.0 branch has been cut. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14148 Reviewed By: nmk70 Differential Revision: D87688882 Pulled By: xingbowang fbshipit-source-id: 5fe95d3c64851b4f9490aed5d92451b38abe008d	2025-11-24 08:45:40 -08:00
Peter Dillinger	35148aca91	Improve distinct compression for index and data blocks (#14140 ) Summary: This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.) Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also: * Require a Clone() function from Compressors, to support proper implementations of MaybeCloneSpecialized() in wrapper Compressors. * Assert that the default implementation of CompressorWrapper::MaybeCloneSpecialized() is only used in allowable cases. * Convenience function Compressor::CloneMaybeSpecialized() This also fixes a serious bug/oversight in ManagedPtr for (ManagedWorkingArea) that somehow wasn't showing up before. It probably doesn't need a release note because CompressionManager stuff is still considered experimental. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14140 Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors. Reviewed By: hx235 Differential Revision: D87600019 Pulled By: pdillinger fbshipit-source-id: 252ef78910073a0e45f2c81dd45ac87ff8a41fc6	2025-11-21 16:34:49 -08:00
Changyu Bi	8c7c8b8dab	Fix a bug where compaction with range deletion can persist kTypeMaxValid in file metadata (#14122 ) Summary: Range deletion start keys are considered during compaction for cutting output files. Due to some ordering requirement (see comment above InsertNextValidRangeTombstoneAtLevel()) between truncated range deletion start key and a file's point keys, there was logic in `f6c9c3bf1c/db/range_del_aggregator.cc (L39)` that changes the value type to be kTypeMaxValid. However, kTypeMaxValid is not supposed to be persisted per `f6c9c3bf1c/db/dbformat.h (L75-L76)`. This can cause forward compatibility issues reported in https://github.com/facebook/rocksdb/issues/14101. This PR fixes this issue by removing the logic that sets kTypeMaxValid and always skip truncated range deletion start key in CompactionMergingIterator. For existing SST files, we want to avoid using this kTypeMaxValid, so this PR also introduces a new placeholder value type. This allows us to re-strengthen the relevant value type checks (IsExtendedValueType()) that was loosen for kTypeMaxValid. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14122 Test Plan: - a unit test that persists kTypeMaxValid before this fix - crash test with frequent range deletion: `python3 ./tools/db_crashtest.py blackbox --delrangepercent=11 --readpercent=35` - Generate SST files with 0x1A as value type (kTypeMaxValid before this change) in file metadata. Run ldb with the strengthened check in IsExtendedValueType() to dump the MANIFEST. It failed to parse MANIFEST as expected before this PR and succeeds after this PR. ``` Error in processing file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 Corruption: VersionEdit: new-file4 entry The file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 may be corrupted. ``` Reviewed By: pdillinger Differential Revision: D87016541 Pulled By: cbi42 fbshipit-source-id: 9957a095db2cd9947463b403f352bd9a1fd70a76	2025-11-21 14:18:38 -08:00
Jay Huh	2f583aed8f	Move prepared_iter size assertion after cleanup (#14144 ) Summary: Fixing crash test failure caused by `prepared_iters_.size() == 0` Pull Request resolved: https://github.com/facebook/rocksdb/pull/14144 Test Plan: ``` python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --simple blackbox ``` Reviewed By: krhancoc Differential Revision: D87656914 Pulled By: jaykorean fbshipit-source-id: 9ef7cf4ea5d34fe9dee6219b32323e91a2ea3e5f	2025-11-21 13:30:31 -08:00
Jay Huh	c4bbad4dfe	Update format-diff script to add text to new files (#14143 ) Summary: Fixing internal validator failure ``` Every project specific source file must contain a doc block with an appropriate copyright header. Unrelated files must be listed as exceptions in the Copyright Headers Exceptions page in the repo dashboard. A copyright header clearly indicates that the code is owned by Meta. Every open source file must start with a comment containing "Meta Platforms, Inc. and affiliates" https://github.com/facebook/rocksdb/blob/main/buckifier/targets_cfg.py: The first 16 lines of 'buckifier/targets_cfg.py' do not contain the patterns: (Meta Platforms, Inc. and affiliates)\|(Facebook, Inc(\.\|,)? and its affiliates)\|([0-9]{4}-present(\.\|,)? Facebook)\|([0-9]{4}(\.\|,)? Facebook) ``` While fixing the text to pass the linter, I took the opportunity to modify `format-diff.sh` script to add the copyright header automatically if missing in new files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14143 Test Plan: ``` $> make format ``` new python file ``` build_tools/format-diff.sh Checking format of uncommitted changes... Checking for copyright headers in new files... Added copyright header to build_tools/test.py Copyright headers were added to new files. Nothing needs to be reformatted! ``` new header file ``` build_tools/format-diff.sh Checking format of uncommitted changes... Checking for copyright headers in new files... Added copyright header to db/db_impl/db_impl_jewoongh.h Copyright headers were added to new files. Nothing needs to be reformatted! ``` Reviewed By: hx235 Differential Revision: D87653124 Pulled By: jaykorean fbshipit-source-id: 164322cfcd2c162bb3b41bb8f3bafefa3f20b695	2025-11-21 11:32:10 -08:00
Hui Xiao	dc33c1adaf	Include verify_output_flags to check resumable compaction compatibility (#14139 ) Summary: Context/Summary: .. because verify_output_flags contains information of usage of paranoid_file_check that is currently not yet compatible with resumable remote compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/14139 Test Plan: Existing tests Reviewed By: jaykorean Differential Revision: D87582635 Pulled By: hx235 fbshipit-source-id: ef21223da53a0696fa3ca9b1617c2c1ee2e19878	2025-11-21 11:32:00 -08:00
Hui Xiao	c76cacc696	Fix overflow in MultiplyCheckOverflow() due to std::numeric_limits<uint64_t>::max()'s promotion to double (#14132 ) Summary: Context/Summary: Due to double's 53-bit mantissa limitation, large uint64_t values lose precision when converted to double. Value equals to or smaller than UINT64_MAX (but greater than 2^64 - 1024) round up to 2^64 since rounding up results in less error than rounding down, which exceeds UINT64_MAX. `std::numeric_limits<uint64_t>::max() / op1 < op2` won't catch those cases. Casting such out-of-range doubles back to uint64_t causes undefined behavior. T Pull Request resolved: https://github.com/facebook/rocksdb/pull/14132 UndefinedBehaviorSanitizer: undefined-behavior options/cf_options.cc:1087:32 in ``` before the fix but not after. Test Plan: ``` COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CC=clang-18 CXX=clang++-18 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j55 db_stress python3 tools/db_crashtest.py --simple blackbox --compact_range_one_in=5 --target_file_size_base=9223372036854775807 // Half of std::numeric_limits<uint64_t>::max() ``` It fails with ``` stderr: options/cf_options.cc:1087:32: runtime error: 1.84467e+19 is outside the range of representable values of type 'unsigned long' Reviewed By: pdillinger Differential Revision: D87434936 Pulled By: hx235 fbshipit-source-id: 65563edf9faf732410bdba8b9e4b7fd61b958169	2025-11-19 16:25:53 -08:00
Jay Huh	8c8586aa23	Add oncall to BUCK file (#14134 ) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/14134 Test Plan: The following command generated the BUCK file correctly ``` python3 buckifier/buckify_rocksdb.py ``` Reviewed By: anand1976 Differential Revision: D87469877 Pulled By: jaykorean fbshipit-source-id: 9ec330084cfe96ad9b71aa13c8eb16593256a5ac	2025-11-19 14:04:58 -08:00
Peter Dillinger	678690274d	More options for sst_dump recompress (#14133 ) Summary: I have been using sst_dump --command=recompress for some ad hoc automation for compression engineering and these new options help with that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14133 Test Plan: manual Reviewed By: hx235 Differential Revision: D87453635 Pulled By: pdillinger fbshipit-source-id: 2ae54e13a9221ec27c6637fea16623465a9163ae	2025-11-19 13:16:06 -08:00
Peter Dillinger	0762586067	Relax an assertion related to parallel compression (#14130 ) Summary: Saw a mysterious failure of assertion `assert(rep_->props.num_data_blocks == 0)` in DBCompressionTest/CompressionFailuresTest.CompressionFailures/45. This seems to be caused by a parallel compression failure arriving after the emit thread has started Finish() but before the Flush() at the start of Finish(). We can fix this by relaxing the assertion to allow for the !ok() case. Testing revealed more ok() assertions that needed to be relaxed/moved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14130 Test Plan: Added a sync point to inject a failure status in the right place and added to unit test to be sure the case is essentially covered. It would arguably be a more realistic test to force a particular thread interleaving but I believe simple is good here. Reviewed By: hx235 Differential Revision: D87377709 Pulled By: pdillinger fbshipit-source-id: 4bd465673b084afcc235688503d1c2f464eed32d	2025-11-19 09:23:41 -08:00
Hui Xiao	57a6fb9e3a	Refactor and support option migration for db with multiple CFs (#14059 ) Summary: Context/Summary: This PR adds multi-cf support to option migration. The original implementation sets options, opens db, compacts files and reopens the db in almost all the three branches below. Such design makes expanding to multi-cf difficult as it needs to change all these places within each of the branch causing code redundancy. ``` Status OptionChangeMigration(std::string dbname, const Options& old_opts, const Options& new_opts) { if (old_opts.compaction_style == CompactionStyle::kCompactionStyleFIFO) { // LSM generated by FIFO compaction can be opened by any compaction. return Status::OK(); } else if (new_opts.compaction_style == CompactionStyle::kCompactionStyleUniversal) { return MigrateToUniversal(dbname, old_opts, new_opts); } else if (new_opts.compaction_style == CompactionStyle::kCompactionStyleLevel) { return MigrateToLevelBase(dbname, old_opts, new_opts); } else if (new_opts.compaction_style == CompactionStyle::kCompactionStyleFIFO) { return CompactToLevel(old_opts, dbname, 0, 0 /* l0_file_size /, true); } else { return Status::NotSupported( "Do not how to migrate to this compaction style"); } } ``` Therefore this PR - Refactor the option migration implementation by moving the common parts into the high-level `OptionChangeMigration()` through `PrepareNoCompactionCFDescriptors()` and `OpenDBWithCFs()` so `MigrateAllCFs()` can focus on compaction only. - Treat the original OptionChangeMigration() API as a special case of the multi-cf version option migration - Add multiple-cf support A few notes: - CompactToLevel() originally modifies the compaction-related options conditionally before doing compaction. This is moved into earlier steps through `ApplySpecialSingleLevelSettings()` in `PrepareNoCompactionCFDescriptors()` - MigrateToUniversal() originally opens the db twice with essentially the same option. This PR reduces that to one open - Option migration does not always use the old option to compact the db and reopen the db after migration, see ` return CompactToLevel(new_opts, dbname, new_opts.num_levels - 1,/l0_file_size=*/0, false);`. `PrepareNoCompactionCFDescriptors()` is where we handle those decisions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14059 Test Plan: - Existing UTs - New UTs Reviewed By: cbi42 Differential Revision: D84852970 Pulled By: hx235 fbshipit-source-id: 936b456cf9fb4c3ccb687e5d1387f2d67a1448be	2025-11-19 05:10:03 -08:00
Ryan Hancock	b9951ded37	Introducing Prepare all iterators for LevelIterator (#14100 ) Summary: This diff introduces the async prepare of all iterators within a MultiScan. The current state has each iterator be prepared as its needed, and with this diff, we prepare all iterators during the prepare phase of the Level Iterator, this will allow more time for each IO to be dispatched and serviced, increasing the odds that a block is ready as the scan seeks to it. Benchmark is prefilled using ``` KEYSIZE=64 VALUESIZE=512 NUMKEYS=5000000 SCAN_SIZE=100 DISTANCE=25000 NUM_SCANS=15 THREADS=1 ./db_bench --db=$DB \ --benchmarks="fillseq" \ --write_buffer_size=5242880 \ --max_write_buffer_number=4 \ --target_file_size_base=5242880 \ --disable_wal=1 --key_size=$KEYSIZE \ --value_size=$VALUESIZE --num=$NUMKEYS --threads=32 } ``` And benchmark ran is ``` run() { echo 1 \| sudo tee /proc/sys/vm/drop_caches ./db_bench --db=$DB --use_existing_db=1 \ --benchmarks=multiscan \ --disable_auto_compactions=1 --seek_nexts=$SCAN_SIZE \ --multiscan-use-async-io=1 \ --multiscan-size=$NUM_SCANS --multiscan-stride=$DISTANCE \ --key_size=$KEYSIZE --value_size=$VALUESIZE \ --num=$NUMKEYS --threads=$THREADS --duration=60 --statistics } ``` The benchmark uses large stride sides to ensure that two scans would touch separate files. We reduce the size of the block cache to increase likelyhood of reads (and simulate larger data sets) Branch: ``` Integrated BlobDB: blob cache disabled RocksDB: version 10.8.0 Date: Tue Nov 11 13:26:29 2025 CPU: 166 * AMD EPYC-Milan Processor CPUCache: 512 KB Keys: 64 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2746.6 MB (estimated) FileSize: 1525.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ multiscan_stride = 25000 multiscan_size = 15 seek_nexts = 100 DB path: [/data/rocksdb/mydb] multiscan : 837.941 micros/op 1193 ops/sec 60.001 seconds 71605 operations; (multscans:71605) ``` Baseline: ``` Set seed to 1762898809121995 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags Integrated BlobDB: blob cache disabled RocksDB: version 10.9.0 Date: Tue Nov 11 14:06:49 2025 CPU: 166 * AMD EPYC-Milan Processor CPUCache: 512 KB Keys: 64 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2746.6 MB (estimated) FileSize: 1525.9 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ multiscan_stride = 25000 multiscan_size = 15 seek_nexts = 100 DB path: [/data/rocksdb/mydb] multiscan : 1129.916 micros/op 885 ops/sec 60.001 seconds 53102 operations; (multscans:53102) ``` Repeated for confirmation. This introduces a ~20% improvement in latency and op/s. Note: Benchmarks are single threaded as, when increasing thread count, we start seeing large amounts of overhead being induced by block cache contention, finally resulting in both baseline and branch becoming equal. Further on network attached storage with high latency, the level iterator, preparing all iterators so a 20% improvement even at high thread counts. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14100 Reviewed By: anand1976 Differential Revision: D86913584 Pulled By: krhancoc fbshipit-source-id: da9d0c890e25e392a33389ce6b80f9bfb84d3f85	2025-11-18 15:57:03 -08:00
Peter Dillinger	f6c9c3bf1c	Use AutoHCC by default in tools (#14120 ) Summary: Oversight in https://github.com/facebook/rocksdb/issues/13964. More detail: * Applies to cache_bench and db_bench (db_stress already using it) * Make sure those along with db_stress treat "hyper_clock_cache" as "auto_hyper_clock_cache" because this is now the blessed implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14120 Test Plan: manual runs of the tools Reviewed By: krhancoc Differential Revision: D86913202 Pulled By: pdillinger fbshipit-source-id: 07b425d3522103417f4b034735376b9d759af5fb	2025-11-12 21:40:15 -08:00
Viraj Thakur	2cf81e0a20	fix compiler warning for mutex->AssertHeld (#14115 ) Summary: We are seeing Github actions failures due to a compiler error: https://github.com/facebook/rocksdb/actions/runs/19190877461/job/54865138898?fbclid=IwY2xjawN_Hc9leHRuA2FlbQIxMQBicmlkETFZeGlpZXZXMGlDTVhTYldwc3J0YwZhcHBfaWQBMAABHp6JoIoMBbZq-8Kgfc1honBdkAbHAZzW2ORiCM2Br2D9utxtMlq6IIqUUQnu_aem_SOU-DDsjDDMB3mTncKfLwQ&brid=VRqQ-asf2myW425wX1qqhg When UpdatedMutableDbOptions is called from the VersionSet constructor, manifest_file_size_ is 0, and mu is nullptr. This is expected and fine, and we never enter the block where AssertHeld is called. All other times UpdatedMutableDbOptions is called, the mutex must be held. This PR just checks that mu is not null, to satisfy the compiler. We could alternatively intentionally crash if there is concern over a silent failure if mu is passed as nullptr Pull Request resolved: https://github.com/facebook/rocksdb/pull/14115 Reviewed By: pdillinger Differential Revision: D86733318 Pulled By: virajthakur fbshipit-source-id: ce9ed6275c9495a3ea2a12f984dbceef7b441e24	2025-11-12 10:29:44 -08:00
Siying Dong	c757f5b4e3	Java's Get() to directly return for NotFound (#14095 ) Summary: Right now, in Java's Get() calls, the way Get() is treated is inefficient. Status.NotFound is turned into an exception in the JNI layer, and is caught in the same function to turn into not found return. This causes significant overhead in the scenario where most of the queries ending up with not found. For example, in Spark's deduplication query, this exception creation overhead is higher than Get() itself. With the proposed change, if return status is NotFound, we directly return, rather than going through the exception path Pull Request resolved: https://github.com/facebook/rocksdb/pull/14095 Test Plan: Existing tests should cover all Get() cases, and they are passing. Reviewed By: jaykorean Differential Revision: D86797594 Pulled By: cbi42 fbshipit-source-id: 1202d24e46a2358976bb7c8ff38a2fd4783d0f99	2025-11-11 15:58:00 -08:00
Ranjan Banerjee	9fbb68be17	Api to get SST file with key ranges for a particular level and key range (startKey, EndKey)rocksdb [Internal version] (#14009 ) Summary: There are instances where an application might be interested in knowing the distribution in SST files for a key range in a particular level. This implementation creates an overloaded GetColumnFamilyMetaData api where (startKey, EndKey) can be passed along with level information to filter the necessary sst files along with the keyranges for each sst file Pull Request resolved: https://github.com/facebook/rocksdb/pull/14009 Reviewed By: anand1976 Differential Revision: D83389707 fbshipit-source-id: 6df1dc1f9233efe9000b03cc1831b3c618cbcef3	2025-11-10 17:13:34 -08:00
Xingbo Wang	b33c547b06	Add trivial move support in CompactFiles API (#14112 ) Summary: Support trivial move in CompactFiles API, which is not supported previously. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14112 Test Plan: Unit test Reviewed By: cbi42 Differential Revision: D86546150 Pulled By: xingbowang fbshipit-source-id: 08a3ae9a055f3d3d41711403b1695f44977e6ea8	2025-11-10 15:20:50 -08:00
ngina	b897c3789b	Merge BuiltinFilterBitsBuilder into FilterBitsBuilder for accurate filter size estimation (#14111 ) Summary: Summary: Merge the BuiltinFilterBitsBuilder into FilterBitsBuilder. This enables using CalculateSpace() for accurate filter size estimation instead of hardcoded bits-per-key which could result in incorrect estimations for different filter types. The previous hardcoded estimate of 15 bits per key was in the filter block builders UpdateFilterSizeEstimate(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/14111 Test Plan: - Existing filter tests pass (bloom_test, full_filter_block_test, filter_bench, db_bloom_filter_test) Reviewed By: pdillinger Differential Revision: D86473287 Pulled By: nmk70 fbshipit-source-id: cd4a47351e67444e944d5b1b375b3b13274dd6e3	2025-11-10 14:47:36 -08:00
Jay Huh	5879f8b62b	Add option to verify block checksums of output files (#14103 ) Summary: For all compactions, RocksDB performs a lightweight sanity check on output SST files before installation (in `CompactionJob::VerifyOutputFiles()`). However, this lightweight check may not catch corruption that is small enough to allow the SST files to still be opened. There is an existing feature, `paranoid_file_check`, which opens the SST file, iterates through all keys, and checks the hash of each key. While this provides the ultimate level of data integrity checking, it comes at a high computational cost. In this PR, we introduce a new mutable CF option, `verify_output_flags`. The `verify_output_flags` is a bitmask enum that allows users to select various verification types, including block checksum verification, full key iteration, and file checksum verification (to be added in subsequent PRs). Note that the existing `paranoid_file_check` option is equivalent to a full key iteration check. Block-level checksum verification is much lighter than the full key iteration check. Please note that the previously deprecated `verify_checksums_in_compaction` option (removed in version 5.3.0) was for verifying the checksum of input SST files. RocksDB continues to perform this verification for both local and remote compactions, and this behavior remains unchanged. In contrast, this PR focuses on verifying the output SST files. ## To follow up - File-level Checksum verification for output SST files - Deprecate `paranoid_file_checks` option in favor of the new option - Add to stress test / db_bench Pull Request resolved: https://github.com/facebook/rocksdb/pull/14103 Test Plan: New Unit Test added. The corruption is both detected by `paranoid_file_check` and various types of verification set by this new option, `verify_output_flags` ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.CorruptedOutput" ``` Reviewed By: pdillinger Differential Revision: D86357924 Pulled By: jaykorean fbshipit-source-id: a9e04798f249c7e977231e179622a0830d6675fe	2025-11-07 14:22:00 -08:00
Changyu Bi	ea75cdc493	Fix a bug in MultiScan that moves iterator backward (#14106 ) Summary: MultiScanUnexpectedSeekTarget() currently uses user key comparison to decide on the next data block for multiscan. This can cause a multiscan to move backward in the following scenario: data block 1: ..., k@7, k@6 data block 2: k@5, ... DB iter scan through k@7, k@6 and k@5 and decides to seek to k@0 due to option [`max_sequential_skip_in_iterations`](`d56da8c112/include/rocksdb/advanced_options.h (L621-L629)`). Multiscan was on data block 2, but moves to data block 1 after the seek. This can cause assertion failure in debug mode and seg fault in prod since older data blocks are unpinned and freed as we advanced a multiscan. This PR fixes the issue by forcing a multiscan to never go backward. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14106 Test Plan: - added a new unit test that reproduces the scenario: `./db_iterator_test --gtest_filter="ReseekAcrossBlocksSameUserKey"` Reviewed By: xingbowang Differential Revision: D86428845 Pulled By: cbi42 fbshipit-source-id: ab623f93e73298a60857fb2ff268366f289092a0	2025-11-07 11:04:57 -08:00
Peter Dillinger	2bee29729a	CI: move valgrind to weekly (#14110 ) Summary: This test is now taking > 6 hours, timing out, and has low signal, so creating a weekly job for it, with an explicit timeout of 12 hours. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14110 Test Plan: watch CI Reviewed By: virajthakur Differential Revision: D86428262 Pulled By: pdillinger fbshipit-source-id: 44103518064ca378f3fd2ff8d21967ede698c8ea	2025-11-07 10:36:34 -08:00
Peter Dillinger	37176a4a44	Auto-tune manifest file size (#14076 ) Summary: Adds auto-tuning of manifest file size to avoid the need to scale `max_manifest_file_size` in proportion to things like number of SST files to properly balance (a) manifest file write amp and new file creation, vs. (b) manifest file space amp and replay time, including non-incremental space usage in backups. (Manifest file write amp comes from re-writing a "live" record when the manifest file is re-created, or "compacted"; space amp is usage beyond what would be used by a compacted manifest file.) In more detail, * Add new option `max_manifest_space_amp_pct` with default value of 500, which defaults to 0.2 write amp and up to roughly 5.0 space amp, except `max_manifest_file_size` is treated as the "minimum" size before re-creating ("compacting") the manifest file. * `max_manifest_file_size` in a way means the same thing, with the same default of 1GB, but in a way has taken on a new role. What is the same is that we do not re-create the manifest file before reaching this size (except for DB re-open), and so users are very unlikely to see a change in default behavior (auto-tuning only kicking in if auto-tuning would exceed 1GB for effective max size for the current manifest file). The new role is as a file size lower bound before auto-tuning kicks in, to minimize churn in files considered "negligibly small." We recommend a new setting of around 1MB or even smaller like 64KB, and expect something like this to become the default soon. * These two options along with `manifest_preallocation_size` are now mutable with SetDBOptions. The effect is nearly immediate, affecting the next write to the current manifest file. Also in this PR: * Refactoring of VersionSet to allow it to get (more) settings from MutableDBOptions. This touches a number of files in not very interesting ways, but notably we have to be careful about thread-safe access to MutableDBOptions fields, and even fields within VersionSet. I have decided to save copies of relevant fields from MutableDBOptions to simplify testing, etc. by not saving a reference to MutableDBOptions but getting notified of updates. * Updated some logging in VersionSet to provide some basic data about final and compacted manifest sizes (effects of auto-tuning), making sure to avoid I/O while holding DB mutex. * Added db_etc3_test.cc which is intended as a successor to db_test and db_test2, but having "test.cc" in its name for easier exclusion of test files when using `git grep`. Intended follow-up: rename db_test2 to db_etc2_test * Moved+updated `ManifestRollOver` test to the new file to be closer to other manifest file rollover testing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14076 Test Plan: As for correctness, new unit test AutoTuneManifestSize is pretty thorough. Some other unit tests updated appropriately. Manual tests in the performance section were also audited for expected behavior based on the new logging in the DB LOG. Example LOG data with -max_manifest_file_size=2048 -max_manifest_space_amp_pct=500: ``` 2025/10/24-11:12:48.979472 2150678 [/version_set.cc:5927] Created manifest 5, compacted+appended from 52 to 116 2025/10/24-11:12:49.626441 2150682 [/version_set.cc:5927] Created manifest 24, compacted+appended from 2169 to 1801 2025/10/24-11:12:52.194592 2150682 [/version_set.cc:5927] Created manifest 91, compacted+appended from 10913 to 8707 2025/10/24-11:13:02.969944 2150682 [/version_set.cc:5927] Created manifest 362, compacted+appended from 52259 to 13321 2025/10/24-11:13:18.815120 2150681 [/version_set.cc:5927] Created manifest 765, compacted+appended from 80064 to 13304 2025/10/24-11:13:35.590905 2150681 [/version_set.cc:5927] Created manifest 1167, compacted+appended from 79863 to 13304 ``` As you can see, it only took a few iterations of ramp-up to settle on the auto-tuned max manifest size for tracking ~122 live SST files, around 80KB and compacting down to about 13KB. (13KB * (500 + 100) / 100 = 78KB). With the default large setting for max_manifest_file_size, we end up with a 232KB manifest, which is more than 90% wasted space. (A long-running DB would be much worse.) As for performance, we don't expect a difference, even with TransactionDB because actual writing of the manifest is done without holding the DB mutex. I was not able to see a performance regression using db_bench with FIFO compaction and >1000 ~10MB SST files, including settings of -max_manifest_file_size=2048 -max_manifest_space_amp_pct={500,10,0}. No "hiccups" visible with -histogram either. I also tried seeding a 1 second delay in writing new manifest files (other than the first). This had no significant effect at -max_manifest_space_amp_pct=500 but at 100 started causing write stalls in my test. In many ways this is kind of a worst case scenario and out-of-proportion test, but gives me more confidence that a higher number like 500 is probably the best balance in general. Reviewed By: xingbowang Differential Revision: D85445178 Pulled By: pdillinger fbshipit-source-id: 1e6e07e89c586762dd65c65bb7cb2b8b719513f9	2025-11-07 09:04:52 -08:00
ngina	7603712a88	Introduce tail estimation to prevent oversized compaction files (#14051 ) Summary: Summary: This change introduces tail size estimation during SST construction to improve compaction file cutting accuracy to prevent oversized files. The BlockBasedTableBuilder now estimates the SST tail size (index and filter blocks) and uses this estimate, in addition to the data size, to determine when to cut files during compaction. Problem: Currently, file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized. This often leads to SST files that exceed target file size limits. Behavior Change: Implement size estimation methods for index and filter builders, and integrate these estimates into BlockBasedTableBuilder via a new EstimatedTailSize() method. This method aggregates estimates from all tail components and is used for file cutting decisions during compaction. Performance Considerations: To minimize CPU overhead, size estimates are updated when data blocks are finalized rather than on every key add. For index builders, estimates are updated when index entries are added (one per data block). For filter builders, the OnDataBlockFinalized() hook triggers estimate updates when data blocks are cut/finalized. This approach provides: * Minimal impact to compaction hot path (key additions) * Near real-time estimates for file cutting decisions * Meaningful estimate changes only when data blocks are finalized Usage: * Set true mutable cf option `compaction_use_tail_size_estimation` to use tail size estimation for compaction file cutting decisions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14051 Test Plan: * Assert tail size estimate is an overestimate in BlockBasedTableBuilder::Finish * Add new test to verify compaction output file is below target file size Next steps: * Enable tail size estimation for compaction file cutting by default (and other improvements) Reviewed By: pdillinger, cbi42 Differential Revision: D84852285 Pulled By: nmk70 fbshipit-source-id: c43cf5dbd2cb2f623a0622591ef24eee30ce0c87	2025-11-05 20:00:00 -08:00
Peter Dillinger	d56da8c112	More folly build updates (#14099 ) Summary: * Fix nightly build-linux-cmake-with-folly-lite-no-test for real this time with correct include directory. (CMakeLists.txt) * Add test runs to that build (and rename) * Improve folly build caching with a folly.mk file with most of the relevant parts of Makefile that contribute to the checkout_folly and build_folly builds. This reduces the risk of false passing of CI job with cache folly build. This caching is still only for folly debug builds, (which is probably OK with just a single nightly build relying on release folly build, which also serves as a rough canary against false passing because of caching). * Use `make VERBOSE=1` after cmake calls for detailed output Pull Request resolved: https://github.com/facebook/rocksdb/pull/14099 Test Plan: temporary CI change to put the relevant parts in pr-jobs, then back to homes including in nightly Reviewed By: mszeszko-meta Differential Revision: D86243363 Pulled By: pdillinger fbshipit-source-id: f7975fa190ef45195c6d0b74417f7886e551516a	2025-11-05 11:39:21 -08:00
Peter Dillinger	befa6b8050	Fix and check for potential ODR violations (#14096 ) Summary: ... caused by public headers depending on build parameters (macro definitions). This change also adds a check under 'make check-headers' (already in CI) looking for potential future violations. I've audited the uses of '#if' in public headers and either * Eliminated them * Systematically excluded them because they are intentional or similar (details in comments in check-public-header.sh * Manually excluded them as being ODR-SAFE In the case of ROCKSDB_USING_THREAD_STATUS, there was no good reason for this to appear in public headers so I've replaced it with a static bool ThreadStatus::kEnabled. I considered getting rid of the ability to disable this code but some relatively recent PRs have been submitted for fixing that case. I've added a release note and updated one of the CI jobs to use this build configuration. (I didn't want to combine with some jobs like no_compression and status_checked because the interaction might limit what is checked. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14096 Test Plan: manual 'make check-headers' + manual cmake as in new CI config + CI Reviewed By: jaykorean Differential Revision: D86241864 Pulled By: pdillinger fbshipit-source-id: d16addc9e3480706b174a006720a4def0740bf2e	2025-11-04 19:47:42 -08:00
Peter Dillinger	9577b92b55	Fix ODR violation from open source folly build, update (#14094 ) Summary: Following up on https://github.com/facebook/rocksdb/pull/14071, updating folly to `8a9fc1e80a` or beyond was failing an F14Table assertion for a very subtle reason: ODR violation between the folly build and RocksDB build because folly build was release mode and RocksDB build was debug mode. What was happening was that folly change introduced a dependence on kDebug (whether build is debug) in a hashing implementation in a .h file, and the inconsistency between the inlined implementation during RocksDB build and the linked-to implementation from the folly build was leading to inconsistencies in the data structure. The primary fix is to ensure we build folly in debug mode for debug mode RocksDB builds. Also, * Needed to use the `patchelf` tool in `build_folly` to ensure the glog dependency shared library can always find its own gflags dependency. I explored many options for working around this, and this is what would work without reworking folly's own build. * Updated folly to latest commit. * Thrown in an ad hoc folly patch to use ftp.gnu.org mirrors (the canonical is super slow) * Moved the placement of GETDEPS_USE_WGET=1 to apply to local builds also, to avoid the issue of a large download almost reaching completion and then stalling indefinitely. * Fix failing nightly build-linux-cmake-with-folly-lite-no-test with fmt includes in cmake build (as was done with make build) * Add a release mode folly+RocksDB to nightly CI, including both cmake and make. This also serves as a non-cached folly build to detect potential problems with PR jobs working from cached folly build. * Move build-linux-cmake-with-folly to nightly because it's mostly covered by build-linux-cmake-with-folly-coroutines Intended follow-up: * folly-lite build with tests * Make the folly build caching more friendly+accurate by hashing the relevant Makefile parts and tagging whether debug or release. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14094 Test Plan: manual + CI Reviewed By: mszeszko-meta Differential Revision: D85864871 Pulled By: pdillinger fbshipit-source-id: 50009b33422d5781074fcbbdf18089be9e36800d	2025-11-02 16:08:09 -08:00
Peter Dillinger	94d91daddb	Update folly (part way), fix USE_FOLLY_LITE (#14071 ) Summary: Resolving this folly upgrade required fixing the FOLLY_LITE build with header include from the 'fmt' library. I was close to timing out on fixing USE_FOLLY_LITE and removing it altogether - it could be considered obsolete and/or not worth the maintenance cost. Follow-up: make the folly build caching more friendly by hashing the relevant makefile parts. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves. UPDATE/NOTE: I wasn't able to fully update to latest due to a failure seen in F14, using the next folly commit or later. The source of the bug is likely outside of F14 but investigation is in progress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14071 Test Plan: CI Reviewed By: jaykorean Differential Revision: D85268833 Pulled By: pdillinger fbshipit-source-id: 1d0a2d61f095524a20e6ec796ef46c02d0696f4e	2025-10-29 16:02:10 -07:00
anand76	0eb5b43b4f	Change PosixWritableFile Truncate to reseek to new end of file (#14088 ) Summary: Change PosixWritableFile's Truncate to the new end offset. This ensures that future appends are written with no holes or overwrites. RocksDB doesn't guarantee this in the FileSystem contract, and its left up to the specific implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14088 Reviewed By: cbi42 Differential Revision: D85786398 Pulled By: anand1976 fbshipit-source-id: 3520d9d6336362f5128a17bbf396297d821a5da3	2025-10-29 12:58:03 -07:00
zaidoon	1bb704b6e0	optimize memory allocations and vector overhead in RocksDB C API using unique_ptr and PinnableSlice (#14036 ) Summary: Comprehensive performance optimizations for the RocksDB C API that eliminate unnecessary memory allocations and copies. ## Key Changes ### 1. PinnableSlice for Get Operations (50% reduction in copies) - Changed all `rocksdb_get` functions to use `PinnableSlice` internally instead of `std::string` - Before:* RocksDB → std::string → malloc'd buffer (2 copies) - After: RocksDB → malloc'd buffer (1 copy) - Affects: Get, Transaction Get, TransactionDB Get, WriteBatch Get variants ### 2. Array-Based MultiGet with PinnableSlice (30% allocation reduction) - Switched MultiGet operations to use optimized array-based RocksDB API with `PinnableSlice` - Eliminates vector overhead and string allocations - Affects: MultiGet, Transaction MultiGet, TransactionDB MultiGet variants ### New Zero-Copy APIs Added high-performance zero-copy functions for applications that can use them: - `rocksdb_iter_key_slice()` / `value_slice()` / `timestamp_slice()` - Return slices by value (eliminates output param overhead) - `rocksdb_batched_multi_get_cf_slice()` - Batched get with slice array input - `rocksdb_slice_t` - ABI-compatible slice type Note that this pr builds on top of https://github.com/facebook/rocksdb/pull/13911 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14036 Reviewed By: pdillinger Differential Revision: D85604919 Pulled By: jaykorean fbshipit-source-id: 7f04b935eea79af1d45b3125a79b90e4706666f6	2025-10-29 12:57:49 -07:00
Changyu Bi	64817ae604	Disable internal reseeking for multiscan stress test (#14087 ) Summary: Stress test can fail with assertion inside MultiScan in some reseek scenario. E.g., data block 1 ends with k@9, data block 2 starts with k@8, when a DB iter seeks to k@0 (see option `max_sequential_skip_in_iterations`), MultiScan will land in data block 1 due to `fd0b4e0cf0/table/block_based/block_based_table_iterator.cc (L1258-L1263)`. We can't just use internal key as separator since index block might not use it. I plan to follow up with a fix that never moves `cur_data_block_idx` backward within a MultiScan. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14087 Test Plan: CI and internal crash tests Reviewed By: anand1976 Differential Revision: D85701668 Pulled By: cbi42 fbshipit-source-id: d3f1aaff40a12be4e3d1b4b7160bf2547f43b849	2025-10-29 12:42:34 -07:00
Jay Huh	fd0b4e0cf0	Disable mmap_read in Stress Test (#14083 ) Summary: All remote compaction test failures had `mmap_read=1` in common. Unfortunately, the failure hasn't been very reproducible. Try disabling `mmap_read` to see if that shed some light. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14083 Test Plan: CI Reviewed By: hx235 Differential Revision: D85622229 Pulled By: jaykorean fbshipit-source-id: bbe9e08efc369813f0fec388c910446089e43650	2025-10-28 12:59:00 -07:00
Changyu Bi	12b85c8ce9	Fix timestamp handling in LevelIterator MultiScan seeks (#14085 ) Summary: As titled, this fixes some internal crash test failures when UDT is enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14085 Test Plan: monitor crash tests. Reviewed By: anand1976 Differential Revision: D85617949 Pulled By: cbi42 fbshipit-source-id: da6fb21c0ca5803ea24e8daf7de8558321babcf4	2025-10-28 11:15:42 -07:00
Jay Huh	a3aa44a716	Fix regression test script for internal use (#14079 ) Summary: Due to some internal requirements, what's being used for`$SSH` and `$SCP` has changed and it broke the regression test. (e.g. tarball streaming to remote host no longer works) Minor behavior changes to the script to make the internal workflow work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14079 Test Plan: ``` ./tools/regression_test.sh ``` Meta Internal automation Reviewed By: pdillinger Differential Revision: D85502798 Pulled By: jaykorean fbshipit-source-id: d294c2ee47661fbe368ccc318062e891f3ac7c81	2025-10-27 14:22:47 -07:00
zaidoon	32f66712c8	optimize C API to reduce memory allocations and using PinnableSlice for zero-copy reads (#13911 ) Summary: ### Problem The current C API implementation has inefficiencies that impact performance in production environments: 1. Double allocations in Get operations: Values are first copied into a `std::string`, then copied again into a malloc'd buffer 2. Unnecessary string temporaries: Using `std::string` as intermediate storage adds allocation/deallocation overhead 3. No zero-copy read path: All read operations require at least one allocation and copy 4. Redundant operations: CopyString performed unnecessary `sizeof(char)` multiplication ### Solution #### 1. Use PinnableSlice for Get Operations - Before: `DB::Get() → std::string → malloc'd buffer` (2 allocations, 2 copies) - After: `DB::Get() → PinnableSlice → malloc'd buffer` (1 allocation, 1 copy) - Impact: 50% reduction in allocations and copies #### 2. Optimize CopyString Helper - Removed redundant `sizeof(char)` multiplication - Single implementation using `Slice` parameter (works with all types via implicit conversion) - Added `inline` for better optimization #### 3. New Zero-Copy API Functions Added high-performance alternatives for allocation-sensitive workloads: - rocksdb_get_pinned_v2/ rocksdb_get_pinned_cf_v2 - Zero-copy read access - rocksdb_get_into_buffer/ rocksdb_get_into_buffer_cf - Copy into user-provided buffer - `rocksdb_pinnable_handle_` - Handle management functions ### Performance Improvements \| Operation \| Allocations \| Improvement \| \|-----------\|------------\|-------------\| \| [rocksdb_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1391:0-1411:1) \| 2 → 1 \| 50% reduction* \| \| [rocksdb_get_cf](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1411:0-1431:1) \| 2 → 1 \| 50% reduction \| \| [rocksdb_multi_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1495:0-1520:1) (per key) \| 2 → 1 \| 50% reduction \| \| [rocksdb_transaction_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:6730:0-6748:1) \| 2 → 1 \| 50% reduction \| \| [rocksdb_writebatch_wi_get_from_batch](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:2714:0-2732:1) \| 2 → 1 \| 50% reduction \| \| [rocksdb_get_pinned_v2](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:7761:0-7775:1) (new) \| 0 \| 100% reduction \| ### Functions Optimized (30+) - All Get variants (regular, CF, with timestamps) - All MultiGet variants - All Transaction Get/MultiGet operations - All WriteBatch Get operations - KeyMayExist operations - Metadata getters (column family names, SST file keys, transaction names, DB identity) ### Testing - Added tests for new zero-copy functions - Added tests for previously untested functions rocksdb_column_family_handle_get_name, rocksdb_transaction_get_name ### Migration Path Applications can adopt improvements in three ways: 1. No changes needed - Existing code automatically benefits from 50% allocation reduction 2. Incremental adoption - Replace hot-path calls with zero-copy variants 3. Full optimization - Use rocksdb_get_into_buffer Pull Request resolved: https://github.com/facebook/rocksdb/pull/13911 Reviewed By: cbi42 Differential Revision: D83508431 Pulled By: jaykorean fbshipit-source-id: 96146a59b0f9e839f6603b376d4e51f0e97c3a8c	2025-10-27 13:16:33 -07:00
Andrew Kryczka	10478b98a5	Fix unsigned underflow in WAL TTL logic when system clock goes backwards (#14016 ) Summary: The TTL-based WAL archive cleanup logic could incorrectly delete an archived WAL if the system clock moved backwards between the last write to that WAL and `WALManager::PurgeObsoleteWALFiles()`. This happened due to unsigned underflow in subtraction of two wall clock based timestamps: `now_seconds - file_m_time`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14016 Test Plan: unit test repro Reviewed By: pdillinger Differential Revision: D83879806 Pulled By: hx235 fbshipit-source-id: 643e7f623c6b5c31711565854314cfd6cbbcf3a7	2025-10-24 17:10:48 -07:00
Andrew Kryczka	e687ca79b4	Fix a missing CV signal in `FindObsoleteFiles()` (#14069 ) Summary: Fixed a missing CV signal when `FindObsoleteFiles()` decides there is nothing to purge and then decrements `pending_purge_obsolete_files_` to zero. This bug could cause `DB::GetSortedWalFiles()` to hang, at least. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14069 Test Plan: unit test repro Reviewed By: hx235 Differential Revision: D85453534 Pulled By: cbi42 fbshipit-source-id: cf5cfe7f5087459ca1f1f28ce81ea6afc84178f0	2025-10-24 13:11:26 -07:00
Changyu Bi	2edc660e28	Fix multiscan assert failure in stress test (#14077 ) Summary: should not use async_io when not supported to avoid the assert failure here: `dce33f9443/table/block_based/block_based_table_iterator.cc (L1710)`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14077 Test Plan: monitor future CI failure. Reviewed By: anand1976 Differential Revision: D85456447 Pulled By: cbi42 fbshipit-source-id: dccc865a5aedf194029a53616f4bbc99d0162691	2025-10-24 12:51:43 -07:00
Xingbo Wang	dce33f9443	Follow up on MultiScan change in #14040 (#14055 ) Summary: * Address feedback from https://github.com/facebook/rocksdb/issues/14040 * Add additional test for MultiScan * Fix a bug when del range and data are in same file for multi-scan * Rewrite the cases need to be handled in SeekMultiScan Pull Request resolved: https://github.com/facebook/rocksdb/pull/14055 Test Plan: Unit test Reviewed By: cbi42, anand1976 Differential Revision: D84851788 Pulled By: xingbowang fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38	2025-10-23 20:34:21 -07:00
Jay Huh	fac8222bfe	Make Meta Internal Linter happy (#14074 ) Summary: Linter complains like this ``` void foo(Arg parameter_name) {} void bar() { Arg a; foo(/some_other_name=/ a); // Wrong! Comment/parameter name mismatch foo(/parameter_name=/ a); // This is OK; the names match. } ``` ``` Argument name in comment (`read_only`) does not match parameter name (`unchanging`). ``` This used to be warning, but now treated as an error :( Fixing a few other linter warnings before they become errors in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14074 Test Plan: CI Reviewed By: archang19 Differential Revision: D85370353 Pulled By: jaykorean fbshipit-source-id: 20e96aad740d516a29c0424282674e655f99c0a2	2025-10-23 18:10:12 -07:00
Changyu Bi	144e9f1e42	Fix compaction picking with L0 standalone range deletion file (#14061 ) Summary: When a standalone range deletion file is ingested in L0, currently it is compacted with any overlapping L0 files. This is not desirable when we ingest new data on top of the range deletion file. This PR fixes the compaction picking logic to only consider L0 files older than the standalone range deletion file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14061 Test Plan: added a new unit test and updated an existing one. Reviewed By: xingbowang Differential Revision: D84930780 Pulled By: cbi42 fbshipit-source-id: 65f4403ccb40ba964b9e65b09e2f7f7efebe81df	2025-10-23 13:34:07 -07:00
Jay Huh	e691965558	Start 10.9.0 development (#14067 ) Summary: 10.8.0 branch has been cut. Updated - HISTORY.md - include/rocksdb/version.h - tools/check_format_compatible.sh To follow up - folly update Pull Request resolved: https://github.com/facebook/rocksdb/pull/14067 Test Plan: CI Reviewed By: pdillinger Differential Revision: D85186398 Pulled By: jaykorean fbshipit-source-id: 44920156aa2a62ba40626766dc4ebdbc02f23fa8	2025-10-22 12:48:31 -07:00
Hui Xiao	e32c14eb56	Stress/crash test improvement to remote compaction with resumable compaction (#14041 ) Summary: Context/Summary: - Add resumable compaction to stress test with adaptive progress cancellation - Add fault injection to remote compaction - Fix a real minor bug in a couple testing framework bugs with remote compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/14041 Test Plan: - Rehearsal stress test, finding bugs for https://github.com/facebook/rocksdb/pull/13984 effectively and did not create new failures. Reviewed By: jaykorean Differential Revision: D84524194 Pulled By: hx235 fbshipit-source-id: 42b4264e428c6739631ed9aa5eb02723367510bc	2025-10-21 12:13:57 -07:00
Hui Xiao	6d9b526551	Add OpenAndCompact() to db_bench (#14003 ) Summary: Context/Summary: as titled. This can be used to benchmark OpenAndCompact() and OpenAndCompactionOptions::allow_resumption. See below for usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14003 Test Plan: 1. Simple OpenAndCompact() ``` openandcompact_allow_resumption=false ./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_millseconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path ... DB path: [/dev/shm/test] Input files: 101 files, 10000 keys OpenAndCompact() API call : 39746440.000 micros/op 39.746 seconds/op OpenAndCompact status: OK Output: 358 files, average size: 69835396 bytes (66.60 MB) openandcompact : 39977603.000 micros/op 0 ops/sec 39.978 seconds 1 operations; ``` 2. OpenAndCompact() with cancellation (after the whole compaction essentially finishes) and resumption ``` openandcompact_allow_resumption=true ./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact[X2] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_millseconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path .. DB path: [/dev/shm/test] Running benchmark for 2 times Input files: 101 files, 10000 keys OpenAndCompact() API call : 40095045.000 micros/op 40.095 seconds/op OpenAndCompact status: OK Output: 358 files, average size: 69835396 bytes (66.60 MB) openandcompact : 41471226.000 micros/op 0 ops/sec 41.471 seconds 1 operations; Input files: 101 files, 10000 keys OpenAndCompact() API call : 336588.000 micros/op 0.337 seconds/op // Resume OpenAndCompact status: OK Output: 358 files, average size: 69835396 bytes (66.60 MB) openandcompact : 573885.000 micros/op 1 ops/sec 0.574 seconds 1 operations; openandcompact [AVG 2 runs] : 0 (± 1) ops/sec openandcompact [AVG 2 runs] : 0 (± 1) ops/sec; 1132.236 ms/op openandcompact [MEDIAN 2 runs] : 0 ops/sec ``` 3. OpenAndCompact() with cancellation at a fixed point and resumption ``` openandcompact_allow_resumption=true ./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact[X2] --openandcompact_test_cancel_on_odd=true --openandcompact_cancel_after_millseconds=6000 --openandcompact_allow_resumption=$openandcompact_allow_resumption --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path ... DB path: [/dev/shm/test] Running benchmark for 2 times --- Run 1 (odd - will cancel) --- Input files: 101 files, 10000 keys OpenAndCompact() API call : 6005787.000 micros/op 6.006 seconds/op // Cancel accordingly OpenAndCompact status: Result incomplete: Manual compaction paused openandcompact : 7255346.000 micros/op 0 ops/sec 7.255 seconds 1 operations; --- Run 2 (even - resume) --- Input files: 101 files, 10000 keys OpenAndCompact() API call : 33013725.000 micros/op 33.014 seconds/op // Resume OpenAndCompact status: OK Output: 358 files, average size: 69835396 bytes (66.60 MB) openandcompact : 33244026.000 micros/op 0 ops/sec 33.244 seconds 1 operations; openandcompact [AVG 2 runs] : 0 (± 0) ops/sec openandcompact [AVG 2 runs] : 0 (± 0) ops/sec; 11911.234 ms/op openandcompact [MEDIAN 2 runs] : 0 ops/sec ``` Reviewed By: jaykorean Differential Revision: D84839965 Pulled By: hx235 fbshipit-source-id: 21c4cd01be67da0a128e2de1c3aae93aa97662bd	2025-10-20 15:11:45 -07:00
Xingbo Wang	f343f7ecdc	Use ccache to accelerate windows build (#14064 ) Summary: With cache hit and compiler option optimization, the compilation time build time is reduced from 40 min to 2 min. Overall build time is reduced from 60 min to less 20 minutes on cache hit on majority of the source file. On 100% cache miss, it would be around 40 minutes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14064 Test Plan: Github CI Reviewed By: mszeszko-meta Differential Revision: D85023882 Pulled By: xingbowang fbshipit-source-id: 98551880c98f14d36133ff43e6af8c3be94ab465	2025-10-20 10:37:08 -07:00
Xingbo Wang	a8a5ade6fa	Fix a nullptr access bug in MultiScan (#14062 ) Summary: Fixing a nullptr access in multiscan, under following situation. ``` Block Based Table: blk1:[k1,k2], blk2:[k3, k8], blk3:[k9] Scan ranges: [k1, k4), [k5,k6), [k7, k10) Prepared block ranges: [0,2], [2,2], [1,3] ``` 1. Seek key k1 on the first range, read key k1, k2. 2. Seek key k4 on the 2nd range, blocks 0,1 would be unpinned. 3. Seek key k9, block 1 would be accessed, but it is unpinned, which trigger assert failure in debug mode and nullptr access on release build. This fix changes how blocks are unpinned. It is now only unpinning the block, when the cur_data_block_idx has passed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14062 Test Plan: Unit Test rand_seed 304010984 on UserDefinedIndexStressTest Reviewed By: cbi42 Differential Revision: D84976410 Pulled By: xingbowang fbshipit-source-id: 6b99bf85fc9d4108c5267ae77be77ccfe08923cd	2025-10-19 21:24:17 -07:00
ngina	3687dc4ad3	Add prefetch feature enum to FSSupportedOps (#13917 ) Summary: Problem: RocksDB was making unnecessary prefetch system calls on file systems that don't support prefetch operations, potentially leading to wasted CPU cycles. Fix: Add kFSPrefetch to FSSupportedOps enum to allow file systems to indicate prefetch support capability. File systems can now opt out of prefetch calls by not setting this field. Backwards compatibility: File systems that don't override SupportedOps() continue to receive prefetch calls exactly as before. Only file systems that explicitly opt out by not setting kFSPrefetch will avoid the calls. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13917 Test Plan: - Added a new test in block_based_table_reader. - Run existing tests: ```make prefetch_test && ./prefetch_test``` Reviewed By: anand1976 Differential Revision: D81607145 Pulled By: nmk70 fbshipit-source-id: 3bbefa05919034e8776ea4e4540cdc695cdc6d3f	2025-10-17 19:54:49 -07:00
Hui Xiao	8edb99f904	Statistics for successfully resumed compaction output bytes (#14054 ) Summary: Context/Summary: as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/14054 Test Plan: new UT, manually checking Reviewed By: jaykorean Differential Revision: D84828431 Pulled By: hx235 fbshipit-source-id: 56e1a9159f7597a10d6c549657d8b22788aa0599	2025-10-17 11:38:20 -07:00
Andrew Chang	622186adec	Update error message for plain table reader max file size (#14056 ) Summary: Currently we return `File is too large for PlainTableReader!` when the file size exceeds our pre-defined constant. There was a request to have the file size information logged when this error is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14056 Reviewed By: nmk70 Differential Revision: D84834869 Pulled By: archang19 fbshipit-source-id: 8f332b6a31d51f320c7e2db06ad49f50798ff70e	2025-10-17 11:12:35 -07:00
Hui Xiao	ad83352c39	Support dumping compaction progress file in ldb (#14058 ) Summary: Context/Summary: This PR adds support to dump compaction progress file in ldb for debugging resumable compaction issue Pull Request resolved: https://github.com/facebook/rocksdb/pull/14058 Test Plan: ``` /data/users/huixiao/rocksdb$ ./ldb compaction_progress_dump --path=/home/huixiao/COMPACTION_PROGRESS-123 Compaction Progress File: /home/huixiao/COMPACTION_PROGRESS-123 ============================================ Progress Record 0: SubcompactionProgress{ next_internal_key_to_compact=user_key="b" (hex:62), seq=kMaxSequenceNumber, type=24, num_processed_input_records=1, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=1, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } } Progress Record 1: SubcompactionProgress{ next_internal_key_to_compact=user_key="bb" (hex:6262), seq=kMaxSequenceNumber, type=24, num_processed_input_records=2, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=2, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } } Progress Record 2: SubcompactionProgress{ next_internal_key_to_compact=user_key="cancel_before_this_key" (hex:63616E63656C5F6265666F72655F746869735F6B6579), seq=kMaxSequenceNumber, type=24, num_processed_input_records=3, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=3, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } } Total records: 3 ``` Reviewed By: jaykorean Differential Revision: D84840680 Pulled By: hx235 fbshipit-source-id: 8e448c50348eb1dba92c4ffdbd2d1bb6306288d6	2025-10-17 10:54:25 -07:00
Xingbo Wang	a1dad12c8c	Reduce github CI build time (#14057 ) Summary: * Reduce build time of folly from 45m~1hr down to 25m. This is achieved by caching folly build artifact from previous build. * Reduce windows build time of folly from 1hr 15m down to 50m. This is done by increase windows build machine size. * Fix build on macos on other macos target. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14057 Test Plan: github CI Reviewed By: archang19, nmk70 Differential Revision: D84848041 Pulled By: xingbowang fbshipit-source-id: 00306750737070e7e446ee436d607ed6ecae79ae	2025-10-16 17:51:55 -07:00
Jay Huh	42842edc8d	Use new TableFactory for each remote compaction in stress test (#14050 ) Summary: We simulate remote compaction in our stress test by running a separate set of worker threads to run compactions. In reality, these remote compactions run on a different host or (at least in a different process) where we cannot share the TableFactory and BlockCache with the main DB process. To make this simulated remote compaction closer to reality, create a new TableFactory for each remote compaction in stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14050 Test Plan: ``` python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10 ``` Reviewed By: hx235 Differential Revision: D84775656 Pulled By: jaykorean fbshipit-source-id: d6203fcbe0eca3539e008a19fd47b742553537ed	2025-10-15 22:01:49 -07:00
Xingbo Wang	1d18c4ed01	Reduce macos github CI build time (#14048 ) Summary: We are adding more and more tests, so we need to increase the number of shards in macos build to reduce overall CI time. macos-15-xlarge image is ARM, which has 5 vCPU cores, but is still 50% faster than the intel x86 12 vCPU. Test time reduced from 1h 37m to 14m. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14048 Reviewed By: archang19 Differential Revision: D84741917 Pulled By: xingbowang fbshipit-source-id: 9ba9bd696d3b2152f11dec2fb4280572b98233d5	2025-10-15 20:40:05 -07:00
Hui Xiao	f7e4009de1	Integrate compaction resumption with DB::OpenAndCompact() (#13984 ) Summary: ### Context/Summary: This is stacked on top of https://github.com/facebook/rocksdb/pull/13983 and integrate compaction resumption with OpenAndCompact(). Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact() This PR focuses on DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress and Compaction progress file -> DB that is called with OpenAndCompact() Resume Flow 1. Check configuration. Right now paranoid_file_check=true (by default false) is not yet compatible with allow_resumption=true. Also only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway 2. Scan compaction output files for latest, old and temporary compaction progress file and output files. If latest compaction progress file exists, we should resume. 3. Clean up older or temporary progress files if any. They can exist if the last OpenAndCompact() crashed during resume flow 4. If any, parse the latest progress file into CompactionProgress and clean up extra compaction output files that are not yet tracked. These compaction output files can exist as tracking every output file is just best-effort and interrupted output files in the middle is not tracked as progress yet. 5. If allow_resumption=false or no valid compaction progress is found or parsed, clean up the latest progress file and existing compaction output files to start fresh compaction. If the clean up itself fails, fail the OpenAndCompact() call to prevent resuming with inconsistency between output files and progress file. Progress File Creation 1. Create temporary progress file 2. Persist the progress from latest compaction progress file to the temporary progress file. This is to simplify resuming from an interrupted compaction that was just resumed. Similar to how manifest recovery works. 3. Rename the temporary progress file to the newer compaction progress so it atomically becomes the "new" latest progress file 4. Delete the "old" latest progress file since it's useless now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13984 Test Plan: - Integrated unit tests to simulate OpenAndCompact gets canceled and optionally resumed for remote compaction - Existing UTs and stress/crash test - Manual stress test with https://github.com/facebook/rocksdb/pull/14041 ### Performance testing: 1. Latency Using ``` ./db_bench --benchmarks=OpenAndCompact[X5] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path --target_file_size_base=268435456 ``` allow_resumption = false Input files: 101 files, 10000 keys OpenAndCompact() API call : 26766256.000 micros/op 0 ops/sec 26.766 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 27837249.000 micros/op 0 ops/sec 27.837 seconds 1 operations; Input files: 101 files, 10000 keys OpenAndCompact() API call : 26546234.000 micros/op 0 ops/sec 26.546 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 27918621.000 micros/op 0 ops/sec 27.919 seconds 1 operations; OpenAndCompact [AVG 2 runs] : 0 (± 0) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 42243571.000 micros/op 0 ops/sec 42.244 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 43497581.000 micros/op 0 ops/sec 43.498 seconds 1 operations; OpenAndCompact [AVG 3 runs] : 0 (± 0) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 34241357.000 micros/op 0 ops/sec 34.241 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 35655346.000 micros/op 0 ops/sec 35.655 seconds 1 operations; OpenAndCompact [AVG 4 runs] : 0 (± 0) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 27083361.000 micros/op 0 ops/sec 27.083 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 28487999.000 micros/op 0 ops/sec 28.488 seconds 1 operations; OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec; 31669.681 ms/op OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec allow_resumption= true Input files: 101 files, 10000 keys OpenAndCompact() API call : 25446470.000 micros/op 0 ops/sec 25.446 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 26833415.000 micros/op 0 ops/sec 26.833 seconds 1 operations; Input files: 101 files, 10000 keys OpenAndCompact() API call : 240745.000 micros/op 0 ops/sec 0.241 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 244934.000 micros/op 4 ops/sec 0.245 seconds 1 operations; OpenAndCompact [AVG 2 runs] : 2 (± 3) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 24843383.000 micros/op 0 ops/sec 24.843 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 26192235.000 micros/op 0 ops/sec 26.192 seconds 1 operations; OpenAndCompact [AVG 3 runs] : 1 (± 2) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 270819.000 micros/op 0 ops/sec 0.271 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 275140.000 micros/op 3 ops/sec 0.275 seconds 1 operations; OpenAndCompact [AVG 4 runs] : 1 (± 2) ops/sec Input files: 101 files, 10000 keys OpenAndCompact() API call : 23038311.000 micros/op 0 ops/sec 23.038 seconds 1 operations; OpenAndCompact status: OK Output: 92 files, average size: 271747380 bytes (259.16 MB) OpenAndCompact : 24439097.000 micros/op 0 ops/sec 24.439 seconds 1 operations; OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec; 638.417 ms/op OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec Persistence cost: If we compare the odd number of OpenAndCompact() API, it's actually faster. Resumption saving: (0.2 - 26.766 ) / 26.766 * 100 = 99.25% improvement when all the compaction progress is redone without the allow_resumption feature 2. Memory usage (in case SubcompactionProgress storing its own memory copies of output filemetadata in https://github.com/facebook/rocksdb/pull/13983/files is a trouble) ``` // ~= 90 output files /usr/bin/time -f " Resource Summary: Wall time: %e seconds CPU time: %U user + %S system (%P total) Peak memory: %M KB Page faults: %F major + %R minor " ./db_bench --benchmarks=OpenAndCompact[X1] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path --target_file_size_base=268435456 ``` allow_resumption = false Peak memory: 275828 KB allow_resumption = true Peak memory: 277204 KB (regress 0.49% memory usage, most likely due to storing own copies of output files' file metadata in subcompaction progress) ### Near-term follow up: - Add statistics to record the successfully resumed compaction output files bytes - Add stress/crash test support to cover error paths (including progress file sync error), crash/cancel OpenAndCompact() at random compaction progress point and surface feature incompatibility - See https://github.com/facebook/rocksdb/pull/14041 - Resolve the TODO https://github.com/facebook/rocksdb/pull/13984/files#diff-17fbdec07244b1f07d1a4e5aed0a6feecf4474d20b3129818c10fc0ff9f3d547R1303-R1314 - See https://github.com/facebook/rocksdb/pull/14042 Reviewed By: jaykorean Differential Revision: D84299662 Pulled By: hx235 fbshipit-source-id: 69bbf395401604172a1a5c557ca834011a3d51d7	2025-10-15 13:43:53 -07:00
anand76	112ff5bb70	Allow empty MultiScan result in BlockBasedTableIterator Prepare (#14046 ) Summary: Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after https://github.com/facebook/rocksdb/issues/14040 . Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14046 Test Plan: 1. Add unit test in db_iterator_test 2. Run crash test Reviewed By: xingbowang Differential Revision: D84623871 Pulled By: anand1976 fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4	2025-10-14 14:22:29 -07:00
Xingbo Wang	1585f2240c	Move the MultiScan seek key check to upper layer (#14040 ) Summary: The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14040 Test Plan: Unit test Reviewed By: anand1976 Differential Revision: D84292297 Pulled By: xingbowang fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869	2025-10-13 12:48:04 -07:00
anand76	04c085a3fa	Disable skip_stats_update_on_db_open in crash tests for multiscan (#14039 ) Summary: Multi scan crash/stress tests are failing when skip_stats_update_on_db_open is true, because LevelIterator::Prepare relies on these stats in FileMetaData to make decisions. Disable it in crash tests until the proper fix is ready. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14039 Reviewed By: archang19 Differential Revision: D84280059 Pulled By: anand1976 fbshipit-source-id: f9f58b94c24d1f455432b05f3bf97f25c7233e3c	2025-10-09 17:31:54 -07:00
Peter Dillinger	2d331cc125	Blog post for parallel compression revamp (#14035 ) Summary: self-explanatory. First drafts using AI then heavily revised. AI help with diagram also. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14035 Test Plan: https://pdillinger.github.io/rocksdb/blog/2025/10/08/parallel-compression-revamp.html Reviewed By: hx235 Differential Revision: D84277660 Pulled By: pdillinger fbshipit-source-id: 4d76f60f3f7304392836fa4df7a819e67d531a52	2025-10-09 16:55:27 -07:00
Hui Xiao	f722e68d88	New FlushWAL() API to take extra fields such as rate limiter priority (#14037 ) Summary: Context/Summary: There is no way to tag or rate-limit write IO occurs during FlushWAL() with priority. Under `Options::manual_wal_flush=true`, it is the major source of write IO during user writes so we decide to add that support. A new option struct `FlushWALOptions` is introduced to avoid making the API ugly for future new fields. Also, we can't use the WriteOptions (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/options.h#L2293-L2302 i) since is associated with that particular Put/Merge/.. associated with that option but FlushWAL() can happen after that write. There is no way to carry that write option over in RocksDB. I also avoided using the WriteOptions since it's mostly for live write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14037 Test Plan: New UTs `TEST_P(DBRateLimiterOnManualWALFlushTest, ManualWALFlush)` Reviewed By: archang19 Differential Revision: D84193522 Pulled By: hx235 fbshipit-source-id: 18feb5235672010d19a101ce52c8abdcc4a789f2	2025-10-09 14:31:47 -07:00
Jay Huh	cbfcac8d1d	Stress Test Improvement (#14022 ) Summary: - Include Status in RemoteCompactionResultMap in SharedState so that we can directly check the status of the remote compaction in `DbStressCompactionService::Wait()` - If result is empty, populate the result with the status that was returned from `GetRemoteCompactionResult()` so that the status can be bubbled up to the primary (main db thread) - Get rid of Timeout in `Wait()` Pull Request resolved: https://github.com/facebook/rocksdb/pull/14022 Test Plan: With fall-back ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=1 ``` Without fall-back ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=0 ``` Reviewed By: hx235 Differential Revision: D83789172 Pulled By: jaykorean fbshipit-source-id: 08f710c4ece5fcc1d4b95b3f9c353831882851b7	2025-10-07 17:31:59 -07:00
anand76	5ace84ebae	Pass the correct comparator to MultiScanArgs (#14033 ) Summary: Fix assertion failure in crash tests with timestamp due to the wrong comparator passed to MultiScanArgs Pull Request resolved: https://github.com/facebook/rocksdb/pull/14033 Reviewed By: xingbowang Differential Revision: D84036954 Pulled By: anand1976 fbshipit-source-id: 526be21c0754dcccf8e4d2b9fba33716fe35860a	2025-10-07 08:35:28 -07:00
anand76	194160d534	Use wget for folly dependency download (#14030 ) Summary: Fix the binutils truncated download issue by switching to wget in the folly build scripts for downloading dependencies. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14030 Test Plan: make build_folly Reviewed By: jaykorean Differential Revision: D84033126 Pulled By: anand1976 fbshipit-source-id: bc6706d7e57c97d6edff149a965aa12c7959825f	2025-10-07 02:13:35 -07:00
anand76	4ab1bc865c	Disable standlone delete range file ingest in db_crashtest.py if multiscan enabled (#14026 ) Summary: MultiScan currently doesn't handle delete range properly. In this specific case, a file with only delete range will have an empty index resulting in BlockBasedTableIterator wrongly thinking that a scan doesn't intersect the file due to empty result. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14026 Test Plan: Run crash test Reviewed By: xingbowang Differential Revision: D83881266 Pulled By: anand1976 fbshipit-source-id: dc1faa494ea23f36391b700dd1ee0430a1f20ac5	2025-10-06 18:47:24 -07:00
Xingbo Wang	27625f4fc2	Fix range delete file caused MultiScan issue (#14028 ) Summary: When there is an ingested SST file that only contains delete range operations, MultiScan may return error "Scan does not intersect with file". This is due to file selection during Prepare uses the file smallest and largest key without considering whether there is any key in the file. This is only a temporary fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14028 Test Plan: Unit test Reviewed By: anand1976 Differential Revision: D83986964 Pulled By: xingbowang fbshipit-source-id: e0961ca854e2062c2457be4324817ba073ae785d	2025-10-06 14:35:15 -07:00
anand76	bdf5a8fffb	Avoid reseeking upon skipping too many keys in crash tests (#14015 ) Summary: Implicit reseek in the middle of an iteration is not supported with MultiScan. Avoid this for now in crash tests by setting max_sequential_skip_in_iterations to an absurdly high value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14015 Reviewed By: xingbowang Differential Revision: D83761612 Pulled By: anand1976 fbshipit-source-id: 16f4e856374b79170c0a79c11c275cbb0fc83a70	2025-10-03 18:16:33 -07:00
Pierre Moulon	2fab774697	Typo fix (#14024 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14024 Fix some typo found along the codebase Reviewed By: pdillinger Differential Revision: D83789182 fbshipit-source-id: feb24d7d47a6faaf735fcfd50dd3ecce4a6c8cd5	2025-10-03 14:28:37 -07:00
Xingbo Wang	7c22fbe0d5	Disable a param combo in crash test to fix a data race (#14023 ) Summary: When inplace_update_support and memtable_veirfy_per_key_checksum_on_seek are enabled at the same time, it would cause data race in memtable. inplace_update_support allows key/value pair in place update in memtable. memtable_veirfy_per_key_checksum_on_seek performs key checksum verification during seek. It is possible that one thread is updating the key/value pair in place, while another thread is reading the key/value pair for checksum verification during seek. Therefore, there these 2 configurations could not be enabled at the same time Pull Request resolved: https://github.com/facebook/rocksdb/pull/14023 Test Plan: local stress test run stops reporting race condition Reviewed By: anand1976 Differential Revision: D83812322 Pulled By: xingbowang fbshipit-source-id: 6cb9f0f3faa8deba97305bfe87266f2fe78e0501	2025-10-03 00:01:02 -07:00
Peter Dillinger	9d3afcf543	Fix regression in LZ4 compression performance since 10.6 (#14017 ) Summary: In RocksDB 10.6 with https://github.com/facebook/rocksdb/issues/13805, due to inaccurate testing of an async system, it went undetected at the time that LZ4 compression was using more CPU despite making a change to reuse stream objects which dramatically improved LZ4HC compression efficiency. This change switches to using a basic LZ4 compress API which appears to be faster than all of these: * Legacy behavior of creating LZ4_stream_t for each compression * 10.6-10.7 behavior of re-using streams between compressions for the same file (with stream-as-WorkingArea) * using LZ4's extState APIs without streams (with extState-as-WorkingArea) (data not shown in below results) Also in this PR: more improvements to sst_dump --recompress, which is arguably the best SST construction benchmark right now since db_bench seems to be so noisy due to backgroun flush+compaction, even with no compaction (FIFO). Streamlined some output and added a SST read time test, mostly for decompression performance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14017 Test Plan: Performance test using sst_dump --recompress with newer sst_dump back-ported to 10.5: ``` ./sst_dump --command=recompress --compression_types=kLZ4Compression test5.sst --compression_level_from=-6 --compression_level_to=-1 ``` and with default compression level. 10.5: ``` Cx level: -6 Cx size: 61608137 Write usec: 880404 Cx level: -5 Cx size: 60793749 Write usec: 840903 Cx level: -4 Cx size: 58134030 Write usec: 836365 Cx level: -3 Cx size: 55193773 Write usec: 857113 Cx level: -2 Cx size: 54013891 Write usec: 855642 Cx level: -1 Cx size: 50400393 Write usec: 865194 Cx level: 32767 Cx size: 50400393 Write usec: 886310 ``` Before this change (showing the regression, more time, from 10.6: ``` Cx level: -6 Cx size: 61608137 Write usec: 933448 Cx level: -5 Cx size: 60793749 Write usec: 893826 Cx level: -4 Cx size: 58134030 Write usec: 891138 Cx level: -3 Cx size: 55193773 Write usec: 898461 Cx level: -2 Cx size: 54013891 Write usec: 897485 Cx level: -1 Cx size: 50400393 Write usec: 936970 Cx level: 32767 Cx size: 50400393 Write usec: 958764 ``` After this change (faster than both the above): ``` Cx level: -6 Cx size: 63641883 Write usec: 874190 Cx level: -5 Cx size: 58860032 Write usec: 834662 Cx level: -4 Cx size: 57150188 Write usec: 832707 Cx level: -3 Cx size: 58791894 Write usec: 850305 Cx level: -2 Cx size: 53145885 Write usec: 839574 Cx level: -1 Cx size: 49809139 Write usec: 845639 Cx level: 32767 Cx size: 49809139 Write usec: 875199 ``` Similar tests with dictionary compression show essentially no difference (need to use stream APIs and reuse doesn't seem to matter). LZ4HC also unaffected (still improved vs. 10.5) Reviewed By: hx235 Differential Revision: D83722880 Pulled By: pdillinger fbshipit-source-id: 30149dd187686d5dd98321e6aa7d74bd7653a905	2025-10-02 08:34:08 -07:00
Xingbo Wang	742741b175	Support Super Block Alignment (#13909 ) Summary: Pad block based table based on super block alignment Pull Request resolved: https://github.com/facebook/rocksdb/pull/13909 Test Plan: Unit Test No impact on perf observed due to change in the inner loop of flush. upstream/main branch 202.15 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1; grep fillseq /tmp/x1 \| grep -Po "\d+\.\d+ MB/s" \| grep -Po "\d+\.\d+" \| awk '{sum+=$1} END {print sum/NR}' ``` After the change without super block alignment 203.44 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1 ``` After the change with super block alignment 204.47 MB/s ``` for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 --super_block_alignment_size=131072 --super_block_alignment_max_padding_size=4096 >> /tmp/x1 2>&1; ``` Reviewed By: pdillinger Differential Revision: D83068913 Pulled By: xingbowang fbshipit-source-id: eecd65088ab3e9dbc7902aab8c2580f1bc8575df	2025-10-01 18:20:35 -07:00
Hui Xiao	1e5fa69c99	Resuming and persisting subcompaction progress in CompactionJob (#13983 ) Summary: ### Context/Summary: Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact() This PR focuses on SubcompactionProgress -> CompactionJob and CompactionJob -> SubcompactionProgress -> Compaction progress file. For now only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway. The actual triggering of progress persistence and resuming (i.e, integration) is through DB::OpenAndCompact() in the upcoming PR. Resume Flow 1. input_iter->Seek(next_internal_key_to_compact) // Position iterator 2. ReadTableProperties() // Validate existing outputs 3. RestoreCompactionOutputs() in CompactionOutputs // Rebuild output file metadata 4. Restore critical statistics about processed input and output records count for verification later 5. AdvanceFileNumbers() // Prevent file number conflicts 6. Continue normal compaction from positioned iterator or fallback to not resuming compaction in limited case or fail the compaction entirely Persistence Strategy 1. When: At each SST file completion (FinishCompactionOutputFile()). This is the simplest but most expensive frequency. See below for benchmarking and potential follow-up items 2. What: Serialize, write and sync the in-memory SubcompactionProgress to a dedicated manifest-like file 3. For simplicity: Only persist at "clean" boundaries (no overlapping user keys, no range deletions, no timestamp for now) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13983 Test Plan: - New unit test in CompactionJob level to cover basic compaction progress resumption - Existing UTs and stress/crash test to test no correctness regression to existing compaction code - Run benchmark to ensure no performance regression to existing compaction code ``` ./db_bench --benchmarks=fillseq[-X10] --db=$db --disable_auto_compactions=true --num=100000 --value_size=25000 --compression_type=none --target_file_size_base=268435456 --write_buffer_size=268435456 ``` Pre-PR: fillseq [AVG 10 runs] : 45127 (± 799) ops/sec; 1076.6 (± 19.1) MB/sec fillseq [MEDIAN 10 runs] : 45375 ops/sec; 1082.5 MB/sec Post-PR (regressed 0.057%, ignorable) fillseq [AVG 10 runs] : 45101 (± 920) ops/sec; 1076.0 (± 22.0) MB/sec fillseq [MEDIAN 10 runs] : 45385 ops/sec; 1082.8 MB/sec Reviewed By: jaykorean Differential Revision: D82889188 Pulled By: hx235 fbshipit-source-id: 8553fd478f134969d331af2c5a125b94bd747268	2025-10-01 14:21:55 -07:00
ngina	13172e2be3	Add method to estimate index size (#14010 ) Summary: This method will be used to improve the compaction logic by accounting for the tail size, in addition to the data size, when determining when to cut a file. Problem: Currently the file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized. Key changes: - Add EstimateCurrentIndexSize() to IndexBuilder interface - Implement in ShortenedIndexBuilder with buffer that accounts for the next index entry. The buffer addresses under-estimation where the current index size doesn't account for the next index entry associated with the data block currently being built. The 2x multiplier bounds the estimate in the right direction and handles outlier cases with large keys. - Add num_index_entries_ member to track added index entries (== data blocks emitted). This is thread-safe since it's updated/read in the serialized emit step. Next steps: - Partitioned index size estimation implementation - Update compaction file cutting logic to consider index size estimation Pull Request resolved: https://github.com/facebook/rocksdb/pull/14010 Test Plan: Added a new test class with unit tests for new builder size estimation across all IndexBuilder implementations. Reviewed By: pdillinger Differential Revision: D83501741 Pulled By: nmk70 fbshipit-source-id: d58fc2a9e92e12a162f6244d4abd707a9c9e1885	2025-10-01 07:38:08 -07:00
anand76	035242415f	Fix incorrect MultiScan handling of range limit between files (#14011 ) Summary: This PR fixes a bug in how MultiScan handled a scan range limit falling in the key range between files. The bug was in LevelIterator, where Prepare() relied on FindFile to determine the lower bound file for the range limit. FindFile returns the smallest file index with `range.limit < file.largest_key`. However, that doesn't guarantee that the range overlaps the file, as the `range.limit` could be smaller than `file.smallest_key`. This also fixes a bug in BlockBasedTableIterator of Valid() returning true even if status() returned error. This was exposed by the previous bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14011 Test Plan: Add unit tests in db_iterator_test and table_test Reviewed By: cbi42 Differential Revision: D83496439 Pulled By: anand1976 fbshipit-source-id: a9d2d138d69d0c816d9f4160a984b273d00d683f	2025-09-30 11:45:49 -07:00
Peter Dillinger	f5fb597bac	Resolve missing/inconsistent tickers in Java (#14012 ) Summary: Pretty self-explanatory from the changes, including re-arranging the "COOL" entries for easier tracking of which values are used. I'm not touching the TICKER_ENUM_MAX issue because IIRC we've gotten in trouble in the past for changing any Java ticker values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14012 Test Plan: CI, sufficient prompts to get AI to discover the known issues relayed by hx235, to help ensure we found any other outstanding issues. Reviewed By: hx235 Differential Revision: D83497503 Pulled By: pdillinger fbshipit-source-id: ec0bd7e28188e0430fb03fc5bd79c2ed7b28f3ad	2025-09-29 14:21:00 -07:00
Hui Xiao	d8c058c5fe	Blog about unified memory limit (#14002 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14002 Test Plan: verify according to https://github.com/facebook/rocksdb/tree/main/docs Reviewed By: jaykorean Differential Revision: D83209262 Pulled By: hx235 fbshipit-source-id: 688c855387e08c9b22644d4de3bc539e51a0ba0a	2025-09-29 10:55:16 -07:00
Jay Huh	feb1486e37	No StandaloneRangeDeletionFile Optimization for Leveled Compaction (#14007 ) Summary: In https://github.com/facebook/rocksdb/pull/13816, we added `earliest_snapshot` in the Compaction object picked by remote compaction which is required for Standalone Range Deletion Optimization (introduced in https://github.com/facebook/rocksdb/pull/13078) The Standalone Range Deletion Optimization was supposed to be supported by Universal Compaction only. This PR properly skips the feature when the compaction style is not kUniversal Pull Request resolved: https://github.com/facebook/rocksdb/pull/14007 Test Plan: Unit Test updated to include Leveled Compaction ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.StandaloneDeleteRangeTombstoneOptimization" ``` In Stress Test, we were able to repro before, but not anymore ``` ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=1 --allow_setting_blob_options_dynamically=0 --allow_unprepared_value=1 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=3.4547746144863423 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=tiered_fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.5 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=mixed --compression_max_dict_buffer_bytes=15 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/tmp/jewoongh/rocksdb_crashtest_blackbox_remote_compaction --db_write_buffer_size=1048576 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=1 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/jewoongh/rocksdb_crashtest_expected_remote_compaction --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=7 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=100 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_veirfy_per_key_checksum_on_seek=0 --memtable_whole_key_filtering=1 --metadata_charge_policy=1 --metadata_read_fault_one_in=1000 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --multiscan_use_async_io=0 --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --remote_compaction_failure_fall_back_to_local=1 --remote_compaction_worker_threads=8 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_ingest_standalone_range_deletion_one_in=0 --test_secondary=0 --top_level_index_pinning=2 --track_and_verify_wals=0 --uncache_aggressiveness=72 --universal_max_read_amp=0 --universal_reduce_file_locking=1 --unpartitioned_pinning=1 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_get_entity=1 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --write_identity_file=0 --writepercent=35 ``` Reviewed By: hx235 Differential Revision: D83375779 Pulled By: jaykorean fbshipit-source-id: 6dad06e3a825c4e9a7101ab8603d1c966be6a4f4	2025-09-29 09:26:59 -07:00
Hui Xiao	c0e484c36e	Blog about IO tagging (#14005 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14005 Test Plan: verify according to https://github.com/facebook/rocksdb/tree/main/docs Reviewed By: archang19 Differential Revision: D83365540 Pulled By: hx235 fbshipit-source-id: b674aca6a9977721b64cafcdfaf8690d1c5940b7	2025-09-26 15:57:06 -07:00
Xingbo Wang	3d53af9746	Allow passing comparator in UDI (#14001 ) Summary: Pass the comparator to UDI interface for both reader and builder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14001 Test Plan: Unit test Reviewed By: anand1976 Differential Revision: D83339943 Pulled By: xingbowang fbshipit-source-id: 7f6541776b0995260e28224329f0cca37f13b3d4	2025-09-26 15:32:50 -07:00
Peter Dillinger	e859c3b7af	Improve version macros (#14004 ) Summary: * Delete obsolete double-underscore version macros, `__ROCKSDB_MAJOR__` etc. * Add convenient ROCKSDB_VERSION_GE(x, y, z) macro for conditional compilation Pull Request resolved: https://github.com/facebook/rocksdb/pull/14004 Test Plan: Unit test added Reviewed By: jaykorean Differential Revision: D83264938 Pulled By: pdillinger fbshipit-source-id: 23dcfb2760751fb87e232b8e0bbda610fd4ac73c	2025-09-25 17:35:23 -07:00
Changyu Bi	862438a7a1	Fix handling of out-of-range scan option (#13995 ) Summary: currently BlockBasedTableIterator::Prepare() fails the iterator with non-ok status if an out-of-range scan option is detected. This is due to the interaction between LevelIterator and BlockBasedTableIterator, see added comment above BlockBasedTableIterator::Prepare(). This can fail stress test for L0 files since it doesn't use LevelIterator and scan options are not pruned. This PR fixes this by adding an internal option to MultiScanArgs that enables this check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13995 Test Plan: - new unit test - stress test that fails before this pr: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=0 --mmap_read=0 --level0_file_num_compaction_trigger=20` Reviewed By: anand1976 Differential Revision: D83166088 Pulled By: cbi42 fbshipit-source-id: 241a7d43c8c00d9a98eea0cabb03d2174d51aae5	2025-09-25 17:33:57 -07:00
Peter Dillinger	1c8a012727	Add kCool Temperature (#14000 ) Summary: also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000 Test Plan: unit tests updated Reviewed By: archang19 Differential Revision: D83200479 Pulled By: pdillinger fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c	2025-09-25 11:27:00 -07:00
Andrew Chang	90241e18c8	Add shared mutex field to IODebugContext (#13993 ) Summary: There can be concurrent reads/writes to fields in `IODebugContext`. One example we have seen is for the `cost_info` field which is of type `std::any`. In fact, in RocksDB's async MultiRead implementation, the same `IODebugContext` is re-used across separate async read requests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13993 Test Plan: Update code which reads/writes to `cost_data` to first acquire shared/exclusive lock on the `mutex` field. There should not be any race conditions when async MultiRead is used. Reviewed By: pdillinger Differential Revision: D83091423 Pulled By: archang19 fbshipit-source-id: 4db86d33cf162ed39114b1cd115fcd8964c8ff9b	2025-09-24 16:31:13 -07:00
anand76	169f90cdea	Allow UDIs with non BytewiseComparator (#13999 ) Summary: Remove the restriction of only using BytewiseComparator(). In a follow on PR, the UDI interface will be updated to take the Comparator as a parameter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13999 Test Plan: Add a unit test in table_test.cc Reviewed By: cbi42 Differential Revision: D83179747 Pulled By: anand1976 fbshipit-source-id: 60222533c71022aa0701ac61c39268d36ca86338	2025-09-24 14:59:20 -07:00
Peter Dillinger	134cfb6b22	Speed up AutoHCC check in dtor (#13998 ) Summary: In https://github.com/facebook/rocksdb/issues/13964 I changed an expensive DEBUG check in ~AutoHyperClockTable to only run in ASAN builds. It's still expensive so I'm modifying it to scan only about one page beyond what we expect to have written to the anonymous mmap, rather than scanning the whole thing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13998 Test Plan: manually checked that lru_cache_test running time went from 5.0s to 4.0s after the change. Verified that existing unit test ClockCacheTest.Limits uses the full anonymous mmap to be sure it is sized as expected, by temporarily breaking AutoHyperClockTable::Grow() to allow slightly exceeding the anonymous mmap size. Reviewed By: cbi42 Differential Revision: D83178493 Pulled By: pdillinger fbshipit-source-id: a2bf093e98bf68b540c073800be7e193021f2692	2025-09-24 14:06:56 -07:00
anand76	6051d843d5	Prohibit unsupported multiscan + delrange combo in crash tests (#13992 ) Summary: This combination causes MultiScan iteration to fail due to internal reseek by the iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13992 Reviewed By: cbi42 Differential Revision: D83094631 Pulled By: anand1976 fbshipit-source-id: 96410747d88de391e6d65857d39063d4fb113d65	2025-09-23 20:09:47 -07:00
Xingbo Wang	bbd8f0d4bf	Bug fix in random seed override support in stress test (#13991 ) Summary: Fix the bug in Improve random seed override support in stress test. The Bug: `parser.parse_known_args()` is used to parse command line argument. When it is called without any argument, it uses sys.argv as input parameter. In sys.argv, the first argument is the command itself, so parser.parse_known_args skip the first argument. Meantime, the return value `remain_argv` of `parser.parse_known_args()` does not contain the command itself. When `remain_arg` replaces `sys.argv`, the first argument is treated as the command itself, which is skipped by `parser.parse_known_args()`. In the internal stress test tool, the first argument is `--stress_cmd`, therefore, it is skipped. Instead, the default value `./stress_db` is used. This is why `./stress_db` showed up in the error message. This is also why it works in local, as stress_db is located in the local folder. The Fix: When `parser.parse_known_args()` is called first time, the remain_argv is saved as a global variable. It is used in the second call of the `parser.parse_known_args(remain_argv)`. When argument is passed to `parser.parse_known_args` directly, the first argument will not be skipped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13991 Test Plan: The the value of first argument `--stress_cmd` is parsed correctly, and shown up in the error message. ``` /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress --cleanup_cmd='' --simple blackbox --print_stderr_separately Start with random seed 11107847853133580500 Running blackbox-crash-test with interval_between_crash=120 total-duration=6000 Use random seed for iteration 8577470137673434540 Traceback (most recent call last): File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1650, in <module> main() File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1639, in main blackbox_crash_main(args, unknown_args) File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1358, in blackbox_crash_main hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1294, in execute_cmd child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: '/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress' ``` Reviewed By: hx235 Differential Revision: D83068960 Pulled By: xingbowang fbshipit-source-id: 28334d38a444c6f8525444e15f460ec6b257ef38	2025-09-23 12:35:35 -07:00
anand76	afbbc90b06	Fail multi scan upon Prepare failure or bad scan options (#13974 ) Summary: Return a failure status for multi scan if Prepare fails, or if the scan options are unsupported, instead of falling back on a regular scan. This PR also fixes a bug in LevelIterator that caused max_prefetch_size to be ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13974 Test Plan: Add new test in db_iterator_test and table_test Reviewed By: xingbowang Differential Revision: D82843944 Pulled By: anand1976 fbshipit-source-id: f12756c40ebd38d8d4e4425e97438b6e766a4663	2025-09-22 18:13:10 -07:00
Hui Xiao	eaeafa7819	Revert "Improve random seed override support in stress test (#13952 )" (#13989 ) Summary: Context/Summary This reverts commit `73432a3f36`. This is due to it mysteriously fails our internal CI running with this change to db_crashtest.py. The root-cause is unknown but the error only reproed with this commit frequently but not the one before it. The error message appears to be the command parsing leading to the db_stress binary can't be found ``` Traceback (most recent call last): File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1638, in <module> main() File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1627, in main blackbox_crash_main(args, unknown_args) File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1347, in blackbox_crash_main hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1283, in execute_cmd child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__ self._execute_child(args, executable, preexec_fn, close_fds, File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: './db_stress' ``` Test plan - Rehearsal crash test Pull Request resolved: https://github.com/facebook/rocksdb/pull/13989 Reviewed By: xingbowang Differential Revision: D83010751 Pulled By: hx235 fbshipit-source-id: d8cfc70564074065b6bb8a3986d6c1011064dd5e	2025-09-22 17:44:16 -07:00
Changyu Bi	54373ba0e8	Revert "Create a new API FileSystem::SyncFile for file sync (#13762 )" (#13987 ) Summary: This is causing some internal failure, we decide to revert this for now until we have a proper fix. This reverts commit `961880b458`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13987 Reviewed By: anand1976 Differential Revision: D82990294 Pulled By: cbi42 fbshipit-source-id: 5f5b4d18d0afe47599738d27e11e3eb2d08d88a0	2025-09-22 15:30:24 -07:00
Hui Xiao	ab10ea0aac	Add in-memory data structures and (de)serialization support for subcompaction progress (#13928 ) Summary: Context Resuming compaction is designed to periodically record the progress of an ongoing compaction and can resume from that saved progress after interruptions such as cancellation, database shutdown, or crashes. This PR introduces the data structures needed to store subcompaction progress in memory, along with serialization and deserialization support to persist and parse this progress to/from "a manifest-like compaction progress file" (the actual creation of such file is in upcoming PRs). Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact() Summary Progress represented by `SubcompactionProgress` will be tracked at the scope of a subcompaction, which is the smallest independent unit of compaction work. The frequency of recording this progress is once every N compaction output files (to be detailed in future PRs). When recording, all fields, except for the output files metadata in `SubcompactionProgress`, will directly overwrite the corresponding fields from the last saved progress (See `SubcompactionProgress` and `SubcompactionProgressBuilder` for more). As a bonus, this PR refactors the file metadata encoding and decoding utilities into two static helper functions, EncodeToNewFile4() and DecodeNewFile4From(), to support subcompaction progress usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13928 Test Plan: - Added various `SubcompactionProgressTest` unit tests in version_edit_test.cc to verify basic serialization/deserialization and forward compatibility handling - Existing UTs and stress/crash test Follow up: - Move output entry number and file verification to after each file creation so we can remove kNumProcessedOutputRecords persistence support and make resuming compaction work with `paranoid_file_checks=true` (by default false). Output verification will be done before persistence of progress. As long as this follow-up is done before the landing of the integration PR to create the progress file, we can change the manifest-like compaction progress file format freely. Reviewed By: jaykorean Differential Revision: D81986583 Pulled By: hx235 fbshipit-source-id: b42766da7d9c2e2f596c892d050c753238d1039f	2025-09-22 15:03:46 -07:00
Changyu Bi	eb1d924308	Fix an assertion failure in stress test (#13988 ) Summary: for MultiScan and UDI we start to use bound check from index iterator, so removing this assert here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13988 Test Plan: existing test Reviewed By: hx235 Differential Revision: D82993180 Pulled By: cbi42 fbshipit-source-id: 442b2e83cb3aef96fc1a825bf733af9ce59c21c1	2025-09-22 14:28:38 -07:00
Josh Kang	7ae602e80a	Support output temperature in CompactFiles (#13955 ) Summary: It is useful to be able to specify output temperatures in the CompactFiles API. For example it may be useful to store small L0 files produced by flushes locally, while larger intra-L0 compactions can store the compacted L0 file remotely. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13955 Test Plan: New unit tests Reviewed By: jaykorean Differential Revision: D82492503 Pulled By: joshkang97 fbshipit-source-id: e1225fe572a15d7c5c30a265762b048a4a9e7f0b	2025-09-22 13:36:26 -07:00
Changyu Bi	3cdd3281ba	Update main for 10.8 (#13980 ) Summary: - updated release note - updated version to 10.8 in version.h - added 10.7 to check_format_compatible.sh - did not updated folly commit hash due to some build failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13980 Reviewed By: xingbowang Differential Revision: D82882035 Pulled By: cbi42 fbshipit-source-id: b5e0e78570fdd492d592ee77bd3901e4b39c25fb	2025-09-22 08:51:17 -07:00
Changyu Bi	841e364238	Fix flaky unit test `IngestDBGeneratedFileTest2.NonZeroSeqno` (#13979 ) Summary: the test did not consider the ingestion_option settings that can result in different error message. This PR fixes the relevant check and ensure we have enough randomness in this test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13979 Test Plan: `gtest-parallel --repeat=20 --workers=20 ./external_sst_file_test --gtest_filter="VaryingOptions/IngestDBGeneratedFileTest2.NonZeroSeqno/"` Reviewed By: hx235 Differential Revision: D82873439 Pulled By: cbi42 fbshipit-source-id: b0d74bf26a502ca3db59b4a0ea9717bf7d027400	2025-09-20 00:08:12 -07:00
Xingbo Wang	73432a3f36	Improve random seed override support in stress test (#13952 ) Summary: Support random seed for white box test Support per iteration random seed override, so that we could skip previous iterations, as sometimes failure happens after a few iterations. The reason we still need initial random seed is that some of the parameter is initialized before each iteration, and not all of the parameters are randomized again in each iteration. The reason is that we want some of the parameters to be stable across the run. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13952 Test Plan: Example for using per iteration random seed override to jump the to second iteration. Simulate a normal run. 4205502355970671733 is the seed used for the second iteration. ``` [xbw@devvm16622.vll0 ~/workspace/ws2/rocksdb (plm_stress_fix)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=96000 --max_key=2500000 --interval=10 --initial_random_seed_override=10 Start with random seed 10 Running blackbox-crash-test with interval_between_crash=10 total-duration=96000 Use random seed for iteration 13278846177722289202 Running db_stress with pid=2102945: ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=2 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=snappy --blob_file_size=1073741824 --blob_file_starting_level=3 --blob_garbage_collection_age_cutoff=0.0 --blob_garbage_collection_force_threshold=0.5 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=1 --bloom_bits=27.321469575655275 --bottommost_compression_type=zstd --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=autoskip --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/tmp/rocksdb_crashtest_blackboxs39kubu3 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_blob_files=1 --enable_blob_garbage_collection=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=1 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_rvq7p3ow --fifo_allow_compaction=0 --file_checksum_impl=xxh64 --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=1000 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=8 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --statistics=1 --stats_dump_period_sec=0 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=3225 --universal_max_read_amp=-1 --universal_reduce_file_locking=0 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35 KILLED 2102945 stdout: Use random seed for iteration 4205502355970671733 Running db_stress with pid=2107447: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=3 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=1 --async_io=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=4194304 --blob_compaction_readahead_size=1048576 --blob_compression_type=snappy --blob_file_size=1048576 --blob_file_starting_level=2 --blob_garbage_collection_age_cutoff=1.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=randommixed --compression_max_dict_buffer_bytes=34359738367 --compression_max_dict_bytes=16384 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxs39kubu3 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/tmp/rocksdb_crashtest_expected_rvq7p3ow --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=5 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=11 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kCold --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=7 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=1 --track_and_verify_wals=0 --uncache_aggressiveness=136 --universal_max_read_amp=-1 --universal_reduce_file_locking=1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35 ``` Override the per iteration random seed directly 4205502355970671733, to jump to the second iteration parameter set. Only the file path name is different. The rest of the parameters are all same ``` [xbw@devvm16622.vll0 ~/workspace/ws2/rocksdb (plm_stress_fix)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=96000 --max_key=2500000 --interval=10 --initial_random_seed_override=10 --per_iteration_random_seed_override=4205502355970671733 Start with random seed 10 Running blackbox-crash-test with interval_between_crash=10 total-duration=96000 Use random seed for iteration 4205502355970671733 Running db_stress with pid=2110794: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=3 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=1 --async_io=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=4194304 --blob_compaction_readahead_size=1048576 --blob_compression_type=snappy --blob_file_size=1048576 --blob_file_starting_level=2 --blob_garbage_collection_age_cutoff=1.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=randommixed --compression_max_dict_buffer_bytes=34359738367 --compression_max_dict_bytes=16384 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxo1xvo_2n --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/tmp/rocksdb_crashtest_expected_s0kmvlrj --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=5 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=11 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kCold --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=7 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=1 --track_and_verify_wals=0 --uncache_aggressiveness=136 --universal_max_read_amp=-1 --universal_reduce_file_locking=1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35 ``` Reviewed By: jaykorean Differential Revision: D82399857 Pulled By: xingbowang fbshipit-source-id: 38f3bfefdd0adc7f527fd68982e2edc22b2304f4	2025-09-19 19:52:55 -07:00
Peter Dillinger	f9f408f536	Start migration of HCC implementation to BitFields (#13965 ) Summary: Start the process of migrating the HCC implementation over to my new system of "bit field atomics" to clean up the code. Here I took on the simplest of the three "bit field atomic" formats in HCC, but ended up moving some things around to end up with less plumbing of definitions and values overall. In the process, updated BitFields to use the CRTP pattern to simplify some things (see updated example, etc.) https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern Pull Request resolved: https://github.com/facebook/rocksdb/pull/13965 Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs. Reviewed By: xingbowang Differential Revision: D82669582 Pulled By: pdillinger fbshipit-source-id: b73dd47361cbe9fbd334413dd4ce01b3c667159e	2025-09-19 17:34:48 -07:00
Peter Dillinger	a843991930	Allow standalone file and directory arguments to sst_dump (#13978 ) Summary: longtime wanted e.g. for easy tab-completion, now implemented Pull Request resolved: https://github.com/facebook/rocksdb/pull/13978 Test Plan: pretty good unit test updates, manual testing Reviewed By: cbi42 Differential Revision: D82857671 Pulled By: pdillinger fbshipit-source-id: d2b63b7d15e61ebf22c58a6ecd3003311e2d03cb	2025-09-19 16:01:43 -07:00
Peter Dillinger	fa3e61cce2	Improve sst_dump --command=recompress (#13977 ) Summary: * There was a bug where the compression manager would actually not be used for recompress because the options passed to SstFileDumper were not respected. That is now fixed by respecting the Options. * Refactored SstFileDumper not to take explicit options that could naturally be embedded in Options. * Report compressed and uncompressed data block sizes (and ratio) instead of total file size (without a useful ratio). Needed to add a new table property to support that. * Allow --block_size instead of --set_block_size to be consistent with other tools * Allow --compression_level as shorthand for both _from and _to options, for simplicity and consistency with other tools * Support --compression_parallel_threads option Pull Request resolved: https://github.com/facebook/rocksdb/pull/13977 Test Plan: * sst_dump manual testing * TableProperties unit tests updated * Made it much easier to detect when a functional change requires an update to ParseTablePropertiesString() (rather than causing cryptic downstream failures) Reviewed By: cbi42 Differential Revision: D82841412 Pulled By: pdillinger fbshipit-source-id: 8d3421be4d2a3e25b7590cd59d204a3779c2a928	2025-09-19 13:52:05 -07:00
Peter Dillinger	19c8d1b7ed	(Re-)fix initialization order dep on kPageSize (#13976 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13976 Missed an occurrence of kPageSize in the last PR https://github.com/facebook/rocksdb/pull/13973 Reviewed By: mszeszko-meta Differential Revision: D82826713 fbshipit-source-id: b112cd7c94b7d6604623ee80274b2b25911245eb	2025-09-19 10:56:50 -07:00
Changyu Bi	798373975c	Unpin skipped data blocks in MultiScan (#13972 ) Summary: Currently in MultiScan we only unpins a block after we scan through it. This PR adds unpinning during Seek to release all blocks pinned by the previous scan range. This is useful when users do not scan through the entire scan range. I plan to follow up with support for aborting async IOs from the previous scan. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13972 Test Plan: new test MultiScanUnpinPreviousBlocks validates unpinning behavior Reviewed By: xingbowang Differential Revision: D82779504 Pulled By: cbi42 fbshipit-source-id: 17ba7d1e5a6d8ff09ceea57b79c18febfba75584	2025-09-19 10:21:38 -07:00
Pavel Tcholakov	e9fc03eed7	Expose C bindings for Column Family export/import (#13874 ) Summary: This change adds FFI support for exporting column family checkpoints, basic access to the export/import files metadata, and creating column families by import. I've been able to successfully use this to [add checkpoint export and import support to `rust-rocksdb`](https://github.com/pcholakov/rust-rocksdb/pull/2), a forked version of which has been successfully used in production for some time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13874 Reviewed By: hx235 Differential Revision: D82343565 Pulled By: jaykorean fbshipit-source-id: fb4182bdfd5cce10743c021a1ac636fd6ac48df3	2025-09-19 09:52:15 -07:00
Peter Dillinger	ef6fbe7ff9	Attempt fix initialization order dep on kPageSize (#13973 ) Summary: If there's a static initialization of Options() this could now instantiate an AutoHyperClockTable before kPageSize is initialized. Break the dependency because it's a very minor optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13973 Test Plan: internal CI (not able to reproduce locally) Reviewed By: hx235 Differential Revision: D82789849 Pulled By: pdillinger fbshipit-source-id: 3f32b5779a4f56d2071be5aadacda2bf0f4b895d	2025-09-19 01:55:06 -07:00
Xingbo Wang	94e65a2e0b	Add option to validate key during seek in SkipList Memtable (#13902 ) Summary: Add a new CF immutable option `paranoid_memory_check_key_checksum_on_seek` that allows additional data integrity validations during seek on SkipList Memtable. When this option is enabled and memtable_protection_bytes_per_key is non zero, skiplist-based memtable will validate the checksum of each key visited during seek operation. The option is opt-in due to performance overhead. This is an enhancement on top of paranoid_memory_checks option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13902 Test Plan: * new unit test added for paranoid_memory_check_key_checksum_on_seek=true. * existing unit test for paranoid_memory_check_key_checksum_on_seek=false. * enable in stress test. Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR: ### Memtable-only randomread ops/sec: * Value size = 100 Bytes ``` for B in 0 1 2 4 8; do (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=100 --num=250000 --reads=500000 --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done; ``` 1. Main: 928999 2. PR with paranoid_memory_check_key_checksum_on_seek=false: 930993 (+0.2%) 3. PR with paranoid_memory_check_key_checksum_on_seek=true: 3.1 memtable_protection_bytes_per_key=1: 464577 (-50%) 3.2 memtable_protection_bytes_per_key=2: 470319 (-49%) 3.3 memtable_protection_bytes_per_key=4: 468457 (-50%) 3.4 memtable_protection_bytes_per_key=8: 465061 (-50%) * Value size = 1000 Bytes ``` for B in 0 1 2 4 8; do (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=1000 --num=250000 --reads=500000 --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 \| grep "readrandom"; done;) \| awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done; ``` 1. Main: 601321 2. PR with paranoid_memory_check_key_checksum_on_seek=false: 607885 (+1.1%) 3. PR with paranoid_memory_check_key_checksum_on_seek=true: 3.1 memtable_protection_bytes_per_key=1: 185742 (-69%) 3.2 memtable_protection_bytes_per_key=2: 177167 (-71%) 3.3 memtable_protection_bytes_per_key=4: 185908 (-69%) 3.4 memtable_protection_bytes_per_key=8: 183639 (-69%) Reviewed By: pdillinger Differential Revision: D81199245 Pulled By: xingbowang fbshipit-source-id: e3c29552ab92f2c5f360361366a293fa26934913	2025-09-18 16:15:50 -07:00
Xingbo Wang	5a1ff2cb14	Force caller to pass comparator in MultiScanArgs (#13970 ) Summary: Force caller of MultiScanArgs to pass comparator. Pass comparator from CF handle to MultiScanArgs in NewMultiScan. Expand MultiScanArgs unit test with different comparator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13970 Test Plan: unit test Reviewed By: cbi42 Differential Revision: D82739270 Pulled By: xingbowang fbshipit-source-id: e709f4a333ad547c0ba6d24d8fb2b22e50e8a12f	2025-09-18 15:18:18 -07:00
Hui Xiao	6a202c5570	Fix nullptr access in IsInjectedError() for stress test (#13968 ) Summary: Context/Summary: `Status::state` can be nullptr when created with no specific error message. std::strstr on nullptr caused some segfault in our stress test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13968 Test Plan: Monitor stress test Reviewed By: jaykorean Differential Revision: D82695541 Pulled By: hx235 fbshipit-source-id: cf08f70163a9ee6c911cdc3a3d79acd3429f0d15	2025-09-18 15:10:04 -07:00
Peter Dillinger	6127a42f98	Use/endorse (Auto)HyperClockCache by default over LRUCache (#13964 ) Summary: After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates: * In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration). * In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC. * When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free. * Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message. * Cleaned up some includes * Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes) * Added a new mode to cache_bench (detailed below) * Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build. Planned follow-up: * Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain Possible follow-up: * Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964 Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs. ## Performance Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example: ``` /usr/bin/time ./cache_bench -stress_ cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432 ``` Release build: Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each) -> Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each) Debug build: Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each) -> Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each) Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`: Release build: Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each) -> Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each) Debug build: Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each) -> Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each) Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm: 3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed There is increased kernel time but acceptable. With ASAN+UBSAN: 11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed Acceptable given that our ASAN+UBSAN builds are not the slowest in CI Reviewed By: hx235 Differential Revision: D82661067 Pulled By: pdillinger fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b	2025-09-18 13:27:51 -07:00
Changyu Bi	20bcd01758	Record smallest seqno in table properties for faster file ingestion (#13942 ) Summary: when ingesting DB generated file with non-zero sequence number, we need smallest seqno of each file for file meta data. To avoid full table scan, we record this information in table property and use it during file ingestion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13942 Test Plan: new unit test and updated existing unit test. Reviewed By: hx235 Differential Revision: D82331802 Pulled By: cbi42 fbshipit-source-id: 3009a6801ca7092cd0fde33692db1a13567068a9	2025-09-17 20:20:33 -07:00
anand76	631fb8670b	Correctly handle upper bound iteration result from a UDI (#13960 ) Summary: This PR fixes a bug in BlockBasedTableIterator::Prepare in conjunction with a user defined index (UDI). If the UDI determines a scan range to be empty and thus returns the kOutOfBound iteration result during Seek, the iteration result is not propagated up and Prepare() assumes end of file and aborts the remaining scans. This results in incorrect behavior and unpredictable multi scan results. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13960 Test Plan: Add unit test to table_test.cc Reviewed By: xingbowang Differential Revision: D82590892 Pulled By: anand1976 fbshipit-source-id: 8cfaaae2bb1a9509ddf8ec967cb8a8801748413d	2025-09-17 09:59:18 -07:00
Peter Dillinger	3c85aa8a69	Some follow-up to parallel compression revamp (#13959 ) Summary: * Fix compaction/flush CPU usage stats to include CPU usage by parallel compression workers. (Validated with manual db_bench testing.) * Disable the parallel compression framework when compression is disabled. See new code comment for details, because in theory it could be useful to hide SST write latency, but manual testing with db_bench and -rate_limiter_bytes_per_sec or -simulate_hdd options shows no useful increase in throughput, just more CPU usage. * Fix some minor clean-up items in the implementation Pull Request resolved: https://github.com/facebook/rocksdb/pull/13959 Test Plan: Also ran some tests like in https://github.com/facebook/rocksdb/issues/13910 to ensure the new CPU usage tracking did not regress performance, all good. Reviewed By: xingbowang Differential Revision: D82556686 Pulled By: pdillinger fbshipit-source-id: 77c522159a7e6ab0ab6f7fb1d662070a46661557	2025-09-17 08:43:19 -07:00
Xingbo Wang	95813a84cd	Fix error from transactiondb layer in stress test (#13950 ) Summary: The stress test runs concurrent transactions through many threads at the same time on a shared key space. It is possible that a dead lock or a timeout is detected from the transactiondb layer. When this happens, simply return from the function and continue the test, instead of fail the test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13950 Test Plan: Stress test pass locally with the same random seed from stress test 14723229280871643749. Reviewed By: hx235 Differential Revision: D82373959 Pulled By: xingbowang fbshipit-source-id: 5d72e89998171c5844fb22f13d8f061f81014c7d	2025-09-16 17:43:02 -07:00
Peter Dillinger	7c3472b4d9	Work around GCC TSAN bug (#13958 ) Summary: ... reporting false positive double-lock on some of the new parallel compression code. Switching from std::condition_variable to condition_variable_any simply changes the FP from double-lock to lock inversion. In addition, leaking ParallelCompressionRep instances to avoid memory location reuse fails to fix the FP reports. Thus, I've decided to disable the watchdog with GCC+TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13958 Test Plan: local crash test runs could reproduce, now don't reproduce. CLANG TSAN doesn't seem to be reporting the same supposed issues Reviewed By: xingbowang Differential Revision: D82555968 Pulled By: pdillinger fbshipit-source-id: 537fbc3a787f917915a6faf0bdedd1449a7f378a	2025-09-16 16:51:33 -07:00
Changyu Bi	2620c85638	Support async IO for MultiScan (#13932 ) Summary: add option MultiScanArgs::use_async_io option and implementation for using ReadAsync() for multiscan. Read requests are submitted during Prepare() and polled during actual scanning. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13932 Test Plan: - updated existing unit test to use async_io. - crash test: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=1 --mmap_read=0` Benchmark: - Default multiscan benchmark: ``` Set up: /db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=1 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=1000 --write_buffer_size=268435456 Without async IO: ./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=0 multiscan : 415.569 micros/op 75805 ops/sec 10.355 seconds 784968 operations; (multscans:24999) rocksdb.read.async.micros COUNT : 0 With asycn IO: ./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=1 multiscan : 413.236 micros/op 76044 ops/sec 10.375 seconds 788968 operations; (multscans:24999) rocksdb.read.async.micros COUNT : 3916499 Similar performance. ``` - Larger scan, more scans per multiscan, do not coalesce IO so that async IO can progress while scanning, and use one thread: ``` multiscan_stride = 1000 multiscan_size = 100 seek_nexts = 1000 ./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --threads=1 --duration=10 --statistics=0 --use_direct_reads=1 --cache_size=2097152 --multiscan_size=100 --multiscan_stride=1000 --seek_nexts=1000 --seed=1 --multiscan_coalesce_threshold=0 --multiscan_use_async_io=0 Without async IO: multiscan : 20495.205 micros/op 48 ops/sec 10.002 seconds 488 operations; (multscans:488) With async IO: multiscan : 18337.883 micros/op 54 ops/sec 10.013 seconds 546 operations; (multscans:546) ~10% improvement in throughput ``` Reviewed By: xingbowang Differential Revision: D82077818 Pulled By: cbi42 fbshipit-source-id: 66e32cf4039183c4841827409286dfbaa6dfbcd8	2025-09-15 11:39:45 -07:00
Peter Dillinger	29d9798ae8	Revamp of parallel compression (#13910 ) Summary: Complete redo of parallel compression in block_based_table_builder.cc to greatly reduce cross-thread hand-off and blocking. A ring buffer of blocks-in-progress is used to essentially bound working memory while enabling high throughput. Unlike before, all threads can participate in compression work, for a kind of work-stealing algorithm that reduces the need for threads to block. This builds on improvements in https://github.com/facebook/rocksdb/pull/13850 Previously, there was either * parallel_threads==1, the emit thread (caller from flush/compaction) doing all the work * parallel_threads > 1, the emit thread generates uncompressed blocks, `parallel_threads` worker threads compress blocks, and a writer thread writes to the SST file. Total of `parallel_threads + 2` threads participating. (Other bookkeeping in emit and write steps omitted from description for simplicity.) Now we have either * parallel_threads==1 (same), the emit thread doing all the work * parallel_threads > 1, the emit thread generates uncompressed blocks and can take up compression work when the ring buffer is full; `parallel_threads` worker threads have as their top priority to write compressed blocks to the SST file but also take up compression work in priority order of next-to-write. Total of `parallel_threads + 1` threads participating. In some cases, this could result in less throughput than before, but arguably the previous implementation was using more threads than explicitly allowed. ## Future/alternate considerations Although we could likely have used some framework for micro-work sharing across threads, that could be difficult with the asymmetry of work loads and thread affinity. Specifically, (a) it would be quite challenging to allow emit work in other threads, because it happens in the caller of BlockBasedTableBuilder, (b) async programming is unlikely to pay off until we have an async interface for writing SST files, and (c) this implementation will nevertheless serve as a benchmark for what we lose or gain in such a framework vs. a hand-tuned system. This implementation still creates and destroys threads for each SST file created. We hope in the future to have more governance and/or pooling of worker threads across various flushes and compactions, but that is not available currently and would require significant design and implementation work. ## More details * This implementation makes use of semaphores for idling and re-waking threads. `std::counting_semaphore` and `binary_semaphore` offer the best performance (see benchmark results below) but some implementations are known to have correctness bugs. Also, my attempt at upgrading CI for C++20 support (required for these) in https://github.com/facebook/rocksdb/pull/13904 is actually incomplete. Therefore, using these structures is opt-in with `-DROCKSDB_USE_STD_SEMAPHORES` at compile time, and a naive semaphore implementation based on mutex and condvar is used by default. A folly alternative (folly::fibers::Semaphore) was dropped in during development and found to be less efficient than the naive implementation. One CI job is upgraded to test with the new opt-in. * One of the biggest concerns about correctness/reliability for this implementation is the possibility of hitting a deadlock, in part because that is not well checked in the DB crash test (a challenging problem!). Note also that with the parallel compression improvements in this release, I am calling the feature production-ready, so there is an extra level of confidence needed in the reliability of the feature. Thus, for DEBUG builds including crash test, I have added a watchdog thread to each parallel SST construction that heuristically checks for the most likely kinds of deadlock that could happen, including for the case of buggy semaphore implementations. It periodically verifies that some thread is outside of its "idle" state, and if the watchdog wakes up repeatedly to see all live threads stuck in their idle state (even if wake-up was attempted) then it declares a deadlock. This feature was manually verified for several seeded deadlock bugs. (More details in code comments.) * For CPU efficiency, this implementation greatly simplifies the logic to estimate the outstanding or "inflight" size not yet written to the SST file. I expect this size to generally be insignificant relative to the full SST file size so is not worth careful engineering. And based on Meta's current needs, landing under-size for an SST file is better than over-size. See comments on `estimated_inflight_size` for details. * Some other existing atomics in block_based_table_builder.cc modified to use safe atomic wrappers. * Status handling in BlockBasedTableBuilder was streamlined to get rid of essentially redundant `status`+`io_status` fields and associated code. Made small optimizations to reduce unnecessary IOStatus copies (with StatusOk()) and mark status conditional branches as LIKELY or UNLIKELY. * Prefer inline field initialization to initialization in constructor. * Minimize references to the `parallel_threads` configuration parameter for better separation of concerns / sanitization / etc. For example, use non-nullity of `pc_rep` to indicate that parallel compression is enabled (and active). * Some other refactoring to aid the new implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13910 Test Plan: ## Correctness Already integrated into unit tests and crash test. CI updated for opt-in semaphore implementation. Basic semaphore unit tests added/updated. As for the tremendous simplification of logic relating to hitting target SST file size, as expected, the new behavior could under-shoot the single-threaded behavior by a small number of blocks, which will typically affect the file size by ~1/1000th or less. I think that's a good trade-off for cutting out unnecessarily complex code with non-trivial CPU cost (FileSizeEstimator). ``` ./db_bench -db=/dev/shm/dbbench_filesize_after8 -benchmarks=fillseq,compact -num=10000000 -compression_type=zstd -compression_level=8 -compression_parallel_threads=8 ``` Before, PT=8 & PT=1, and After PT=1 the same or very similar ``` -rw-r--r-- 1 peterd users 67474097 Sep 12 15:32 000052.sst -rw-r--r-- 1 peterd users 67474214 Sep 12 15:32 000053.sst -rw-r--r-- 1 peterd users 67473834 Sep 12 15:32 000054.sst -rw-r--r-- 1 peterd users 67473437 Sep 12 15:32 000055.sst -rw-r--r-- 1 peterd users 67473835 Sep 12 15:32 000056.sst -rw-r--r-- 1 peterd users 67473204 Sep 12 15:33 000057.sst -rw-r--r-- 1 peterd users 67473294 Sep 12 15:33 000058.sst -rw-r--r-- 1 peterd users 67473839 Sep 12 15:33 000059.sst ``` After, PT=8 (worst case here ~0.05% smaller) ``` -rw-r--r-- 1 peterd users 67463189 Sep 12 14:55 000052.sst -rw-r--r-- 1 peterd users 67465233 Sep 12 14:55 000053.sst -rw-r--r-- 1 peterd users 67466822 Sep 12 14:55 000054.sst -rw-r--r-- 1 peterd users 67466221 Sep 12 14:55 000055.sst -rw-r--r-- 1 peterd users 67441675 Sep 12 14:55 000056.sst -rw-r--r-- 1 peterd users 67467855 Sep 12 14:55 000057.sst -rw-r--r-- 1 peterd users 67455132 Sep 12 14:55 000058.sst -rw-r--r-- 1 peterd users 67458334 Sep 12 14:55 000059.sst ``` ## Performance, modest load We are primarily interested in balancing throughput in building SST files and CPU usage in doing so. (For example, we could maximize throughput by having worker threads only spin waiting for work, but that would likely be extra CPU usage we want to avoid to allow other productive CPU work to be scheduled.) No read path code has been touched. A benchmark script running "before" and "after" configurations at the same time to minimize random machine load effects: ``` $ SUFFIX=`tty \| sed 's\|/\|_\|g'`; for CT in none lz4 zstd; do for PT in 1 2 3 4 6 8; do echo -n "$CT pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1; done) \| awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpus: " cpu; }'; done; done ``` Before this change: ``` none pt=1 -> ops/s: 1999603 cpus: 72.08 none pt=2 -> ops/s: 1871094 cpus: 148.3 none pt=3 -> ops/s: 1882907 cpus: 147.7 lz4 pt=1 -> ops/s: 1987858 cpus: 94.74 lz4 pt=2 -> ops/s: 1590192 cpus: 182.65 lz4 pt=3 -> ops/s: 1896294 cpus: 174.7 lz4 pt=4 -> ops/s: 1949174 cpus: 172.26 lz4 pt=6 -> ops/s: 1912517 cpus: 175.91 lz4 pt=8 -> ops/s: 1930585 cpus: 176.71 zstd pt=1 -> ops/s: 1239379 cpus: 129.85 zstd pt=2 -> ops/s: 1171742 cpus: 226.12 zstd pt=3 -> ops/s: 1832574 cpus: 214.21 zstd pt=4 -> ops/s: 1887124 cpus: 212.51 zstd pt=6 -> ops/s: 1920936 cpus: 211.7 zstd pt=8 -> ops/s: 1885544 cpus: 214.87 ``` After this change: ``` none pt=1 -> ops/s: 1964361 cpus: 72.66 none pt=2 -> ops/s: 1914033 cpus: 104.95 none pt=3 -> ops/s: 1978567 cpus: 100.24 lz4 pt=1 -> ops/s: 2041703 cpus: 92.88 lz4 pt=2 -> ops/s: 1903210 cpus: 121.64 lz4 pt=3 -> ops/s: 1973906 cpus: 122.22 lz4 pt=4 -> ops/s: 1952605 cpus: 123.05 lz4 pt=6 -> ops/s: 1957524 cpus: 124.31 lz4 pt=8 -> ops/s: 1986274 cpus: 129.06 zstd pt=1 -> ops/s: 1233748 cpus: 130.43 zstd pt=2 -> ops/s: 1675226 cpus: 158.41 zstd pt=3 -> ops/s: 1929878 cpus: 159.77 zstd pt=4 -> ops/s: 1916403 cpus: 160.99 zstd pt=6 -> ops/s: 1942526 cpus: 166.21 zstd pt=8 -> ops/s: 1966704 cpus: 171.56 ``` For parallel_threads=1, results are very similar, as expected. For parallel_threads>1, throughput is usually improved a bit, but cpu consumption is dramatically reduced. For zstd, maximum throughput is essentially achieved with pt=3 rather than the previous roughly pt=4 to 6. And the old used about 30% more CPU. We can also compare with more expensive compression by raising the compression level. ``` SUFFIX=`tty \| sed 's\|/\|_\|g'`; CT=zstd; for CL in 4 6 8; do for PT in 1 4 8; do echo -n "$CT@$CL pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT -compression_level=$CL 2>&1; done) \| awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpus: " cpu; }'; done; done ``` Before: ``` zstd@4 pt=1 -> ops/s: 883630 cpus: 161.12 zstd@4 pt=4 -> ops/s: 1878206 cpus: 243.25 zstd@4 pt=8 -> ops/s: 1885002 cpus: 245.89 zstd@6 pt=1 -> ops/s: 710767 cpus: 189.44 zstd@6 pt=4 -> ops/s: 1706377 cpus: 277.29 zstd@6 pt=8 -> ops/s: 1866736 cpus: 275.07 zstd@8 pt=1 -> ops/s: 529047 cpus: 237.87 zstd@8 pt=4 -> ops/s: 1401379 cpus: 330.61 zstd@8 pt=8 -> ops/s: 1895601 cpus: 321.59 ``` After: ``` zstd@4 pt=1 -> ops/s: 889905 cpus: 161.03 zstd@4 pt=4 -> ops/s: 1942240 cpus: 193.18 zstd@4 pt=8 -> ops/s: 1922367 cpus: 205.21 zstd@6 pt=1 -> ops/s: 713870 cpus: 188.91 zstd@6 pt=4 -> ops/s: 1832314 cpus: 219.66 zstd@6 pt=8 -> ops/s: 1949631 cpus: 229.34 zstd@8 pt=1 -> ops/s: 530324 cpus: 238.02 zstd@8 pt=4 -> ops/s: 1479767 cpus: 271.65 zstd@8 pt=8 -> ops/s: 1949631 cpus: 275.6 ``` And we can also look at the cumulative effect of this change and https://github.com/facebook/rocksdb/pull/13850 that will combine for the parallel compression improvements in the upcoming 10.7 release: Before both: ``` lz4 pt=1 -> ops/s: 1954445 cpus: 95.14 lz4 pt=3 -> ops/s: 1687043 cpus: 186.62 lz4 pt=5 -> ops/s: 1708196 cpus: 188.33 zstd pt=1 -> ops/s: 1220649 cpus: 131.2 zstd pt=3 -> ops/s: 1658100 cpus: 227.08 zstd pt=5 -> ops/s: 1685074 cpus: 226.08 ``` After: ``` lz4 pt=1 -> ops/s: 2048214 cpus: 93.24 lz4 pt=3 -> ops/s: 1922049 cpus: 122.9 lz4 pt=5 -> ops/s: 1980165 cpus: 122.49 zstd pt=1 -> ops/s: 1245165 cpus: 128.84 zstd pt=3 -> ops/s: 1956961 cpus: 158.73 zstd pt=5 -> ops/s: 1970458 cpus: 161.02 ``` In summary, before with zstd default level, you could see only * about 38% increase in throughput for about 73% increase in CPU usage Now you can get * about 58% increase in throughput for about 25% increase in CPU usage ## Performance, high load To validate this for usage on remote compaction workers, we also need to test whether it falls over at high load or anything concerning like that. For this I did a lot of testing with concurrent db_bench and zstd compression_level=8 and parallel_thread (PT) in {1,8} trying to observe "bad" behaviors such as stalls due to preempted threads and such. On a 166 core machine where a "job" is a db_bench process running a fillseq benchmark similar to above in parallel with others, I could summarize the results like this: 10 jobs PT=8 vs. PT=1 -> 12% more CPU usage, 75% reduction in wall time, 1.9 jobs/sec (vs. 0.5) 50 jobs PT=8 vs. PT=1 -> 89% more CPU usage, 27% reduction in wall time, 3.1 jobs/sec (vs. 2.3) 100 jobs PT=8 vs. PT=1 -> 24% more CPU usage, 5% reduction in wall time, 3.25 jobs/sec (vs. 3.1) 150 jobs PT=8 vs. PT=1 -> 4% more CPU usage, 2% increase in wall time, 3.3 jobs/sec (vs. 3.4) 500 jobs PT=8 vs. PT=1 -> 1% more CPU usage, insignificant difference in wall time, 3.3 jobs/sec Even when there are 4000 threads potentially competing for 166 cores, the throughput (3.3 jobs / sec) is still very close to maximum (3.4). Enabling parallel compression didn't result in notably less throughput (based on wall clock time for all jobs to complete) in any case tested above, and much higher throughput for many cases. If parallel compression causes us to tip from comfortably under-saturating to over-saturating the cores (as in the 50 jobs case), the overall CPU usage can be much higher, presumably due to lower CPU cache hit rates and maybe clock throttling, but parallel compression still has the throughput advantage in those cases. In other words, what would we stand to gain from being able to intelligently share worker threads between compaction jobs? It doesn't seem that much. Reviewed By: xingbowang Differential Revision: D81365623 Pulled By: pdillinger fbshipit-source-id: 5db5151a959b5d25b84dbe185bc208bd188f2d1c	2025-09-14 07:38:00 -07:00
Changyu Bi	acf9d4e445	Fix UDT handling in MultiScan (#13938 ) Summary: we saw some crash test failure at `f46242cef6/table/block_based/block_based_table_iterator.cc (L964-L965)`. This is likely due to timestamp not being considered properly in some places in MultiScan code paths. This PR fixes the issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13938 Test Plan: crash test with timestamp and multiscan: `python3 -u ./tools/db_crashtest.py whitebox --enable_ts --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60` Reviewed By: anand1976 Differential Revision: D82175263 Pulled By: cbi42 fbshipit-source-id: 5d40ede1aec15f8faeaa7fd041b939e68611ff73	2025-09-12 15:56:49 -07:00
Hui Xiao	54941a8d42	Fix a race condition in FIFO size-based compaction where concurrent threads could select the same non-L0 file (#13946 ) Summary: Context/Summary: Fix a race condition (illustrated below) in FIFO size-based compaction where concurrent threads could select the same non-L0 file, causing assertion failures in debug builds or "Cannot delete table file from LSM tree" errors in release builds. ``` Thread 1 Thread 2 -------- -------- FIFO size-based compaction ↓ Pick L2 file ↓ Mark: file.being_compacted = true (file.being_compacted was false) ↓ WriteManifestStart (unlock mutex) ─→ FIFO size-based compaction starts ↓ ↓ Continue manifest write... Pick SAME L2 file ↓ Mark: file.being_compacted = true (file.being_compacted was true) ❌ ↓ ↓ ↓ Unlock mutex, wait for manifest ↓ ↓ Lock mutex ←─────────────────────────────────┘ ↓ Delete L2 file ✅ ↓ Complete ─────────────────────────────→ Try delete same file ❌ ↓ ERROR: "file not in LSM tree" 🐛 BUG: Both threads pick the same file! Thread 2 doesn't properly check file.being_compacted flag ``` Test New test that fails before the fix and passes after Pull Request resolved: https://github.com/facebook/rocksdb/pull/13946 Reviewed By: xingbowang Differential Revision: D82279731 Pulled By: hx235 fbshipit-source-id: b426517f2d1b23dd7d4951157822a2d322fe1435	2025-09-12 13:52:10 -07:00
Jay Huh	4f12c55e3e	Make Remote Compaction Failures fall back to local in Stress Test (#13945 ) Summary: This PR enables Stress Test to fall back to local compaction when a remote compaction fails, allowing the compaction to be retried on the main thread. If the local compaction succeeds, the stress test will continue without failing. The main thread will log that the remote compaction failed and was retried locally, while detailed failure logs from the remote compaction attempt will still be printed by the worker thread for further investigation. This approach allows us to keep collecting useful logs for diagnosing remote compaction failures in Stress Test, while ensuring the test continues to run with remote compaction enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13945 Test Plan: ``` python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10 ``` # Internal Only https://www.internalfb.com/sandcastle/workflow/1315051091202224133 https://www.internalfb.com/sandcastle/workflow/3382203320165521367 https://www.internalfb.com/sandcastle/workflow/2616591383512372892 https://www.internalfb.com/sandcastle/workflow/4607182418810099066 Reviewed By: hx235 Differential Revision: D82279337 Pulled By: jaykorean fbshipit-source-id: 6f663ec2eeb642fd4ad885a90efb344432a32f89	2025-09-12 11:42:48 -07:00
Hui Xiao	799f83a934	Rename and clarify CompactionJobStats::has_num_input_records for clarity and set true by default (#13929 ) Summary: Context/Summary: Internally `CompactionJobStats ::num_input_records` is only used for input record count [verification](`1aca60c089/db/compaction/compaction_job.cc (L2535)`) and such verification always checks for `CompactionJobStats::has_num_input_records` (now renamed) before using this field. This is needed because the `CompactionJobStats::num_input_records` gets its number from `CompactionIterator::NumInputEntryScanned()` in a subcompaction and this number can be inaccurate purposefully to increase performance, see [CompactionIterator::must_count_input_entries](https://github.com/facebook/rocksdb/pull/13929/files#diff-e6c876f655a21865c0f3dff94b9763f1bd40cf88a8a86f04868201b2e845a890R186-R199) for more. - This PR renames the `CompactionJobStats::has_num_input_records` to more explicit naming and adds more comments. Not a behavior change. Also, aggregation of `CompactionJobStats::has_num_input_records` among all subcompactions is done by [AND](`1aca60c089/util/compaction_job_stats_impl.cc (L62)`) operation so it's false if any of the subcompaction has this field being false. The default value of this field should be "true" in order to not mistakenly "false" by default. We are currently fine because `CompactionJobStats::Reset()` that [sets the value to be true](`1aca60c089/util/compaction_job_stats_impl.cc (L14)`) is always called before such aggregation. - This PR changes the default value to be true. - Resumable compaction development plans to set `CompactionJobStats::has_num_input_records` to be false if the previous compaction carries inaccurate records. In order for this not be overwritten by the subsequent progress in [here](`1aca60c089/db/compaction/compaction_job.cc (L1540-L1543)`), this PR also changes this = to AND operation and +=. With the default value `CompactionJobStats::has_num_input_records` now to be true (or Reset() already called) and `CompactionJobStats::num_input_records=0` already, this will not a behavior change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13929 Test Plan: - Existing UT to test "...changes the default value to be true" is safe. Reviewed By: jaykorean Differential Revision: D82014912 Pulled By: hx235 fbshipit-source-id: 6f211c3b2c9eb7d39abf37271d21a4d3f407b934	2025-09-11 12:19:11 -07:00
Andrew Chang	d87e598f70	Update error logging and status reporting for unsupported iouring (#13936 ) Summary: We should add error logging to be able to pinpoint why RocksDB is returning status `NotSupported` for `ReadAsync`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13936 Test Plan: Look at logs (and client logs of error status) Reviewed By: anand1976 Differential Revision: D82141529 Pulled By: archang19 fbshipit-source-id: c71b70967457be35ef5168321d449f96b2b9441d	2025-09-10 17:54:26 -07:00
Xingbo Wang	f46242cef6	Fix uninitialized value complaint in valgrind (#13934 ) Summary: Fix uninitialized value complaint in valgrind due to gtest print padded struct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13934 Test Plan: CI. Verified that valgrind no longer complains about it. Reviewed By: pdillinger Differential Revision: D82124983 Pulled By: xingbowang fbshipit-source-id: 99eb7bab99726c45affe0a231777e5951844d73b	2025-09-10 10:42:07 -07:00
Peter Dillinger	67af5bdc38	Add Temperature::kIce (#13927 ) Summary: ... and associated statistics, etc. Someone needs it, so here it is. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927 Test Plan: Updated / extended / added some unit tests Reviewed By: cbi42 Differential Revision: D81981469 Pulled By: pdillinger fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363	2025-09-10 10:29:49 -07:00
Xingbo Wang	8b8a3de2c6	Fix PointLockManager in C++20 (#13933 ) Summary: Fix broken build in PointLockManager change with C++20 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13933 Test Plan: CI Reviewed By: pdillinger Differential Revision: D82073490 Pulled By: xingbowang fbshipit-source-id: 0bd4936fe0a27a28db61ca5f23d3bea90bce73ef	2025-09-09 21:45:50 -07:00
anand76	0e59c3864f	Add copyright to header file (#13930 ) Summary: Add copyright notice to any_lock_manager_test.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/13930 Reviewed By: xingbowang Differential Revision: D82035581 Pulled By: anand1976 fbshipit-source-id: 2275f7c8b41fbd4384bdae011d244bfa117225f7	2025-09-09 15:57:13 -07:00
Andrew Chang	85f1ba572e	Add support for custom IOActivity types (#13924 ) Summary: There are some internal use cases that do not map cleanly onto the existing `IOActivity` enums. This PR creates new custom IOActivity types that internal users can use as they see fit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13924 Test Plan: Wrote a simple unit test Reviewed By: pdillinger Differential Revision: D82029992 Pulled By: archang19 fbshipit-source-id: a3e23c360baa96cd2e9adf570e71c6e43947bfc8	2025-09-09 14:47:29 -07:00
Xingbo Wang	1aca60c089	Improve efficiency in PointLockManager by using separate Condvar (#13731 ) Summary: PointLockManager manages point lock per key. The old implementation partition the per key lock into 16 stripes. Each stripe handles the point lock for a subset of keys. Each stripe have only one conditional variable. This conditional variable is used by all the transactions that are waiting for its turn to acquire a lock of a key that belongs to this stripe. In production, we notified that when there are multiple transactions trying to write to the same key, all of them will wait on the same conditional variables. When the previous lock holder released the key, all of the transactions are woken up, but only one of them could proceed, and the rest goes back to sleep. This wasted a lot of CPU cycles. In addition, when there are other keys being locked/unlocked on the same lock stripe, the problem becomes even worse. In order to solve this issue, we implemented a new PerKeyPointLockManager that keeps a transaction waiter queue at per key level. When a transaction could not acquire a lock immediately, it joins the waiter queue of the key and waits on a dedicated conditional variable. When previous lock holder released the lock, it wakes up the next set of transactions that are eligible to acquire the lock from the waiting queue. The queue respect FIFO order, except it prioritizes lock upgrade/downgrade operation. However, this waiter queue change increases the deadlock detection cost, because the transaction waiting in the queue also needs to be considered during deadlock detection. To resolve this issue, a new deadlock_timeout_us (microseconds) configuration is introduced in transaction option. Essentially, when a transaction is waiting on a lock, it will join the wait queue and wait for the duration configured by deadlock_timeout_us without perform deadlock detection. If the transaction didn't get the lock after the deadlock_timeout_us timeout is reached, it will then perform deadlock detection and wait until lock_timeout is reached. This optimization takes the heuristic where majority of the transaction would be able to get the lock without perform deadlock detection. The deadlock_timeout_us configuration needs to be tuned for different workload, if the likelihood of deadlock is very low, the deadlock_timeout_us could be configured close to a big higher than the average transaction execution time, so that majority of the transaction would be able to acquire the lock without performing deadlock detection. If the likelihood of deadlock is high, deadlock_timeout_us could be configured with lower value, so that deadlock would get detected faster. The new PerKeyPointLockManager is disabled by default. It can be enabled by TransactionDBOptions.use_per_key_point_lock_mgr. The deadlock_timeout_us is only effective when PerKeyPointLockManager is used. When deadlock_timeout_us is set to 0, transaction will perform deadlock detection immediately before wait. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13731 Test Plan: Unit test. Stress unit test that validates deadlock detection and exclusive, shared lock guarantee. A new point_lock_bench binary is created to help perform performance test. Reviewed By: pdillinger Differential Revision: D77353607 Pulled By: xingbowang fbshipit-source-id: 21cf93354f9a367a78c8666596ed14013ac7240b	2025-09-08 15:52:54 -07:00
Peter Dillinger	86bb0c0d1b	Use C++20 in public API, fix CI (#13915 ) Summary: A follow-up to https://github.com/facebook/rocksdb/issues/13904 which was incomplete in updating CI jobs to support C++20 because the C++20 usage was only in tests. Here we add subtle C++20 usage in the public API ("using enum" feature in db.h) to force the issue. A lot of the work for this PR was in updating the Ubuntu22 docker image, for earlier compiler/runtime versions supporting C++20, and generating a new Ubuntu24 docker image, for later compiler/runtime versions. The Ubuntu22 image needed to be updated because there are incompatibilities with clang-13 + c++20 + libstdc++ for gcc 11, seen on these examples ``` #include <chrono> int main(int argc, char *argv[]) { std::chrono::microseconds d = {}; return 0; } ``` and ``` #include <coroutine> int main() { return 0; } ``` The second was causing recurring failures in build-linux-clang-13-asan-ubsan-with-folly, now fixed. So we have to install clang's libc++ to compile with clang-13. I haven't been able to get this to work with some of the libraries like benchmark, glog, and/or gflags, but I'm able to compile core RocksDB with clang-13. On this docker image, an extra compiler parameter is needed to compile with gcc and glog because it's built from source perhaps not perfectly, because the ubuntu package transitively conflicts with libc++. The Ubuntu24 image seems to be low-drama and generally work for testing out newer compiler versions. The mingw build uses Ubuntu24 because the mingw package on Ubuntu22 uses a gcc version that is too old. And the mass of other code changes are trying to work around new warnings, mostly from clang-analyze, which I upgraded to clang-18 in CI. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13915 Test Plan: CI, including temporarily including the nightly jobs in the PR jobs in earlier revisions to test and stabilize Reviewed By: archang19 Differential Revision: D81933067 Pulled By: pdillinger fbshipit-source-id: 7e33823006a79d5f3cf5bc1d625f0a3c08a7d74c	2025-09-08 13:11:28 -07:00
Hui Xiao	6b02f137a4	Turn on stats collection in crash test (#13926 ) Summary: Context/Summary: it's for formal testing to cover statistics in our stress test Pull Request resolved: https://github.com/facebook/rocksdb/pull/13926 Reviewed By: anand1976, jaykorean Differential Revision: D81943762 Pulled By: hx235 fbshipit-source-id: 4186be0b35839976b7299667492d0cc722128a06	2025-09-08 13:03:42 -07:00
Jay Huh	5a498bf688	Disable Remote Compaction In Stress Test (#13925 ) Summary: After running stress test over a week, we've identified more failures to fix. While we work on the fix, disable the remote compaction temporarily to reduce noise and avoid these failures hiding other failures. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13925 Test Plan: CI Reviewed By: anand1976 Differential Revision: D81934248 Pulled By: jaykorean fbshipit-source-id: 9ac11926429eebe1aebf7b520a548dc5987b7d76	2025-09-08 11:30:42 -07:00
Andrew Chang	96f796f93a	Add logging for errors in external file ingestion path (#13905 ) Summary: This diff adds logging in various places in the external file ingestion code where we check for non-OK status codes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13905 Test Plan: Debugging external file ingestion should be easier with additional logging. Differential Revision: D81814033 Pulled By: archang19 fbshipit-source-id: 77f8b342cbad892acedc4603c02865c38886f2f4	2025-09-08 09:25:34 -07:00
anand76	0044a76d36	Make failure to load UDI when opening an SST a soft failure (#13921 ) Summary: If user_defined_index_factory in BlockBasedTableOptions is configured and we try to open an SST file without the corresponding UDI (either during DB open or file ingestion), ignore a failure to load the UDI by default. If fail_if_no_udi_on_open in BlockBasedTableOptions is true, then treat it as a fatal error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13921 Test Plan: Update unit tests Reviewed By: xingbowang Differential Revision: D81826054 Pulled By: anand1976 fbshipit-source-id: f4fe0b13ccb02b9448622af487680131e349c52b	2025-09-05 19:06:28 -07:00
Changyu Bi	a805c9b9a8	Add option to limit max prefetching in MultiScan (#13920 ) Summary: Add a new option `MultiScanArgs::max_prefetch_size` that limits the memory usage of per file pinning of prefetched blocks. Note that this only accounts for compressed block size. This is intended to be a stopgap until we implement some kind of global prefetch manager that limits the global multiscan memory usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13920 Test Plan: new unit test `./block_based_table_reader_test --gtest_filter="MultiScanPrefetchSizeLimit/"` Reviewed By: xingbowang Differential Revision: D81630629 Pulled By: cbi42 fbshipit-source-id: 9f66678915242fe1220620531a4b9fd22747cdea	2025-09-05 12:40:32 -07:00
Jay Huh	dfbcdaf70e	Disable Remote Compaction in UDT enabled Stress Tests (#13919 ) Summary: # Summary Until we get WAL + Remote Compaction in Stress Test working, temporarily disable this Pull Request resolved: https://github.com/facebook/rocksdb/pull/13919 Test Plan: Meta Internal CI run Reviewed By: anand1976 Differential Revision: D81605621 Pulled By: jaykorean fbshipit-source-id: 6e1f9a0a7a0f27e7465512689b51364b63ef3e2b	2025-09-03 12:33:44 -07:00
Jay Huh	a34683bf54	Disable Remote Compaction when Integrated BlobDB is enabled in Stress Test (#13916 ) Summary: Fixing "Integrated BlobDB is currently incompatible with Remote Compaction" error https://github.com/facebook/rocksdb/actions/runs/17417658959/job/49449586139 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13916 Test Plan: CI Reviewed By: anand1976 Differential Revision: D81537676 Pulled By: jaykorean fbshipit-source-id: f5e2c40cd498a17cb08486a1cb9404ccf1d812e0	2025-09-02 21:23:11 -07:00
Jay Huh	8fa2aae7f4	Re-enable Remote Compaction Stress Test (#13913 ) Summary: Re-enabling Remote Compaction Stress Test with some changes to stress test feature combo sanitization changes Pull Request resolved: https://github.com/facebook/rocksdb/pull/13913 Test Plan: Ran Meta Internal Tests for a few days # Follow up - Skip recovering from WAL in remote worker and re-enable WAL - Investigate and fix races with Integrated BlobDB Reviewed By: hx235 Differential Revision: D81509225 Pulled By: jaykorean fbshipit-source-id: 949762c48ece0a25e3d0281e3510f1e7d3fe3667	2025-09-02 15:32:12 -07:00
Hui Xiao	fc8bc60f2d	Avoid overwriting non-okay status due to shutdown or manual compaction pause (#13891 ) Summary: Context/Summary: A small change as titled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13891 Test Plan: - Existing UT and rehearsal stress test Reviewed By: jaykorean Differential Revision: D80588011 Pulled By: hx235 fbshipit-source-id: 6987e08a4855782305ad742eef6c0196da0d67ca	2025-09-02 12:37:16 -07:00
Xingbo Wang	ac4d563dd1	Add random seed to db_crashtest.py to make reproduce test easier. (#13906 ) Summary: Add a new argument --random_seed to script db_crashtest.py to allow reusing the same random seed to produce exactly same test argument. When the argument is missing, a random seed is used, and printed. When developer wants to reproduce the exactly same setup, they could use the same seed with --random_seed for reproduction. The example below shows running the command without and with the argument. All of the arguments are same, except --db and --expected_values_dir, which does not use python random. * Without --random_seed, a new seed is generated and printed. ``` [xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000 Start with random seed 17953760416546706382 Running blackbox-crash-test with interval_between_crash=120 total-duration=960 Running db_stress with pid=2957716: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxqishhgdc --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_udz8mw68 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35 ``` * With --random_seed, the seed specified in the argument is used. ``` [xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000 --random_seed=17953760416546706382 Start with random seed 17953760416546706382 Running blackbox-crash-test with interval_between_crash=120 total-duration=960 Running db_stress with pid=2959006: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackbox0kxvhzbm --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_hhk9kcgo --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13906 Test Plan: stress test Reviewed By: hx235 Differential Revision: D81201034 Pulled By: xingbowang fbshipit-source-id: 0bb4e0cbcdcf2de9b730492342dcfa18f07e93d6	2025-08-28 23:04:13 -07:00
Peter Dillinger	2950e99219	Require C++20 (#13904 ) Summary: I am wanting to use std::counting_semaphore for something and the timing seems good to require C++20 support. The internets suggest: * GCC >= 10 is adequate, >= 11 preferred * Clang >= 10 is needed * Visual Studio >= 2019 is adquate And popular linux distributions look like this: * CentOS Stream 9 -> GCC 11.2 (CentOS 8 is EOL) * Ubuntu 22.04 LTS -> GCC 11.x (Ubuntu 20 just ended standard support) * Debian 12 (oldstable) -> GCC 12.2 * (Debian 11 has ended security updates, uses GCC 10.2) This required generating a new docker image based on Ubuntu 22 for CI using gcc. The existing Ubuntu 20 image works for covering appropriate clang versions (though we should maybe add a much later version as well, in the next increment of our Ubuntu 22 image; however the minimum available clang build from apt.llvm.org for Ubuntu 22 is clang 13). Update to SetDumpFilter is to quiet a mysterious gcc-13 warning-as-error. Removed --compile-no-warning-as-error from a cmake command line because cmake in the new docker image is too old for this option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13904 Test Plan: CI, one minor unit test added to verify std::counting_semaphor works Reviewed By: xingbowang Differential Revision: D81266435 Pulled By: pdillinger fbshipit-source-id: 26040eeccca7004416e29a6ff4f6ea93f2052684	2025-08-28 16:59:16 -07:00
Hui Xiao	68efd6fd8e	Refactor ProcessKeyValueCompaction into smaller functions (#13879 ) Summary: Context/Summary: `ProcessKeyValueCompaction()` has grown too long to resonate or add any logic to resume from some key and save progress for resumable compaction. This PR breaks this function into smaller functions. Almost all of them are cosmetic changes, except for one thing pointed out in below PR conversation. Specially, this PR did the following: - Added `SubcompactionInternalIterators`, `SubcompactionKeyBoundaries` and `BlobFileResources` to manage the lifetime of the local variables of the original functions to be used across smaller functions - Moved AutoThreadOperationStageUpdater, some IO stats measurement to a different place that makes more sense Pull Request resolved: https://github.com/facebook/rocksdb/pull/13879 Test Plan: Existing UT Reviewed By: jaykorean Differential Revision: D80216092 Pulled By: hx235 fbshipit-source-id: 515615906e5e5fd5ec191bcdd4126f17d282cac2	2025-08-28 13:46:54 -07:00
Peter Dillinger	e59bbd7241	First step to improve parallel compression efficiency (#13850 ) Summary: The implementation of parallel compression has historically scaled rather poorly, or perhaps modestly with heavy compression, topping out around 3x throughput vs. serial and incurring big overheads in CPU consumption relative to the throughput. This change addresses one source of that extra CPU consumption: stashing all the keys of a block for later processing into building index and filter blocks. Historically with parallel compression, the index and filter block updates were handled in the last stage of processing along with writing each data block to the file writer. This was because the index blocks needed to know the BlockHandle of the new data block, which could only be known after every preceeding data block was compressed, to know the starting location for the BlockHandle. And because index and filter partitions were historically coupled (see decouple_partitioned_filters), filter updates had to happen at the same time. Here we get rid of stashing the keys for later processing and the extra CPU associated with it, by * Creating a two stage process of adding to index blocks ("prepare" and "finish" each entry; one entry per data block). The two stages must be executable in parallel for separate index entries. NOTE: not yet supported by UserDefinedIndex * Requiring decouple_partitioned_filters=true for parallel compression, because we now add to filters in the first stage of processing when each key is readily available and we cannot couple that with finalizing index entries in the last stage of processing. It might seem like adding to filters is something that is expensive (hashing etc.) and should be kept out of the bottle-neck first stage of processing (which includes walking the compaction iterator) but it's probably similar cost to simply stashing the keys away for later processing. (We might be able to reduce a bottle-neck by stashing hashes, but we're not to a point where that is worth the effort.) And it makes sense to make two more simple public API updates in conjunction with this: * Set decouple_partitioned_filters=true by default. No signs of problems in production. * Mark parallel compression as production-ready. It's being thoroughly tested in the crash test, successfully, and in limited production uses. Follow-up: * Improve the threading/sychronization model of parallel compression for the next major efficiency improvement * Consider supporting the parallel-compatible index building APIs with UserDefinedIndex, unless it's considered too dangerous to expect users to safely handle the multi-threading. * (In a subsequent release) remove all the code associated with coupling filter and index partitions and mark the option as ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13850 Test Plan: for correctness, existing tests ## Performance Data The "before" data here includes revert of https://github.com/facebook/rocksdb/issues/13828 for combined performance measurement of this change and that one. ``` SUFFIX=`tty \| sed 's\|/\|_\|g'`; for CT in lz4 zstd lz4; do for PT in 1 2 3 4 6 8; do echo "$CT pt=$PT"; (for I in `seq 1 1`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=30000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1 \| tail -n 3 \| head -n 2; done); done; done ``` To get a sense of the overall performance relative to number of parallel threads, we vary that with popular fast compression and popular heavier weight compression (some noise in this data, don't interpret each data point too strongly) lz4 pt=1 2107431 -> 2112941 ops/sec (+0.3% - improvement) (26.51 + 0.75) = 27.26 CPU sec -> (26.63 + 0.79) = 27.42 CPU sec (+0.6% - regression) lz4 pt=2 1606660 -> 1580333 ops/sec (-1.6% - regression) (47.10 + 8.37) = 55.47 CPU sec -> (45.05 + 9.23) = 54.28 CPU sec (-2.2% - improvement) lz4 pt=3 1701353 -> 1889283 ops/sec (+11.1% - improvement) (47.23 + 8.29) = 55.52 CPU sec -> (43.89 + 8.33) = 52.22 CPU sec (-6.0% - improvement) lz4 pt=4 1651504 -> 1817890 ops/sec (+10.1% - improvement) (48.07 + 8.31) = 56.38 CPU sec -> (44.77 + 8.45) = 53.22 CPU sec (-5.6% - improvement) lz4 pt=6 1716099 -> 1888523 ops/sec (+10.1% - improvement) (47.50 + 8.45) = 55.95 CPU sec -> (44.25 + 8.73) = 52.98 CPU sec (-5.3% - improvement) lz4 pt=8 1696840 -> 1797256 ops/sec (+5.9% - improvement) (48.09 + 8.61) = 56.70 CPU sec -> (45.90 + 8.68) = 54.58 CPU sec (-3.8% - improvement) Clearly parallel threads do not help with fast compression like LZ4, but it's not as bad as it was before. zstd pt=1 1214258 -> 1202863 ops/sec (-0.9% - regression) (38.26 + 0.66) = 38.92 CPU sec -> (39.37 + 0.69) = 40.06 CPU sec (+2.9% - regression) zstd pt=2 1194673 -> 1152746 ops/sec (-3.5% - regression) (61.01 + 9.85) = 70.86 CPU sec -> (58.28 + 9.99) = 68.27 CPU sec (-3.7% - improvement) zstd pt=3 1653661 -> 1825618 ops/sec (+10.4% - improvement) (60.07 + 8.45) = 68.52 CPU sec -> (56.03 + 8.43) = 64.46 CPU sec (-5.9% - improvement) zstd pt=4 1691723 -> 1890976 ops/sec (+11.8% - improvement) (59.72 + 8.46) = 68.18 CPU sec -> (55.96 + 8.27) = 64.23 CPU sec (-5.7% - improvement) zstd pt=6 1684982 -> 1900002 ops/sec (+12.8% - improvement) (58.89 + 8.26) = 67.15 CPU sec -> (55.98 + 8.48) = 64.46 CPU sec (-4.0% - improvement) zstd pt=8 1648282 -> 1892531 ops/sec (+14.8% - improvement) (59.43 + 8.63) = 68.06 CPU sec -> (56.49 + 8.32) = 64.81 CPU sec (-4.8% - improvement) The throughput is now able to increase by more than half with lots of parallelism, rather than only about a third. Scalability is a bit better with higher compression level, and we still see a benefit from this change. (We've also enabled partitioned indexes and filters here, which sees essentially the same benefits): zstd pt=1 compression_level=7 595720 -> 597359 ops/sec (+0.3% - improvement) (63.45 + 0.73) = 64.18 CPU sec -> (63.25 + 0.71) = 63.96 CPU sec (-0.3% - improvement) zstd pt=4 compression_level=7 1527116 -> 1501779 ops/sec (-1.7% - regression) (85.00 + 8.14) = 93.14 CPU sec -> (81.85 + 9.02) = 90.87 CPU sec (-2.5% - improvement) zstd pt=6 compression_level=7 1678239 -> 1956070 ops/sec (+16.5% - improvement) (83.77 + 8.11) = 91.88 CPU sec -> (79.87 + 7.78) = 87.65 CPU sec (-4.6% - improvement) zstd pt=8 compression_level=7 1696132 -> 1953041 ops/sec (+15.1% - improvement) (83.97 + 8.14) = 92.11 CPU sec -> (80.61 + 7.78) = 88.39 CPU sec (-4.1% - improvement) With more tests, not really seeing any consistent differences with no parallelism (despite some micro-optimizations thrown in) Reviewed By: hx235 Differential Revision: D79853111 Pulled By: pdillinger fbshipit-source-id: 7a34fd7811217fb74fa6d3efaea7ffcce72beec7	2025-08-27 18:57:44 -07:00
ngina	749e11f0ad	Add compaction on deletion-trigger test to db stress test (#13894 ) Summary: Enable stress testing of deletion-triggered compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13894 Test Plan: ``` python3 -u tools/db_crashtest.py --simple whitebox --enable_compaction_on_deletion_trigger=true ``` Reviewed By: jaykorean Differential Revision: D81175559 Pulled By: nmk70 fbshipit-source-id: c5128b7c1e2d07833b0e9385e04b342bc42c65cf	2025-08-27 17:08:15 -07:00
Hui Xiao	b67149a55e	Skip DumpStats() on dropped CF (#13900 ) Summary: Context/Summary: DumpStats() do not skip dropped CF and can run into a seg fault like below ``` 2025-08-23T06:44:05.0469230Z �[0;32m[ RUN ] �[mFormatLatest/ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily/0 2025-08-23T06:44:05.0470050Z Received signal 11 (Segmentation fault: 11) 2025-08-23T06:44:05.0470510Z #0 0x7000069305e0 2025-08-23T06:44:05.0471070Z https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::DumpStats() (in librocksdb.10.6.0.dylib) (db_impl.cc:1076) ``` This PR skipped it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13900 Test Plan: - Deterministically repro-ed the seg fault before the fix and ensure it doesn't happen after the fix ``` diff --git a/db/column_family_test.cc b/db/column_family_test.cc index 3a2ca0617..f57d6f757 100644 --- a/db/column_family_test.cc +++ b/db/column_family_test.cc @@ -2372,11 +2372,17 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) { int kKeysNum = 10000; PutRandomData(1, kKeysNum, 100); { + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency( + {{"PostDrop", "BeforeAccessCFD"}, {"PostAccessCFD", "BeforeGo"}}); + + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing(); std::unique_ptr<Iterator> iterator( db_->NewIterator(ReadOptions(), handles_[1])); iterator->SeekToFirst(); DropColumnFamilies({1}); + TEST_SYNC_POINT("PostDrop"); + TEST_SYNC_POINT("BeforeGo"); // Make sure iterator created can still be used. int count = 0; @@ -2386,6 +2392,9 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) { } ASSERT_OK(iterator->status()); ASSERT_EQ(count, kKeysNum); + + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing(); + ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks(); } Reopen(); diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc index a8e4f5f8f..a8a0499c0 100644 --- a/db/db_impl/db_impl.cc +++ b/db/db_impl/db_impl.cc @@ -1073,8 +1073,10 @@ void DBImpl::DumpStats() { continue; } - auto* table_factory = - cfd->GetCurrentMutableCFOptions().table_factory.get(); + TEST_SYNC_POINT("BeforeAccessCFD"); + auto moptions = cfd->GetCurrentMutableCFOptions(); + auto* table_factory = moptions.table_factory.get(); + TEST_SYNC_POINT("PostAccessCFD"); assert(table_factory != nullptr); // FIXME: need to a shared_ptr if/when block_cache is going to be mutable Cache* cache = ~ ``` Reviewed By: archang19 Differential Revision: D81003739 Pulled By: hx235 fbshipit-source-id: bdf3c4cc45988f43e79ebc191a20af5b70ac289f	2025-08-26 11:20:41 -07:00
Hui Xiao	d399165109	Ignore IOActivity check for ManagedSnapshot snapshot_guard(db_); for TestMultiScan (#13898 ) Summary: Context/Summary: RocksDB stress test verifies IOActivity is set correctly through reusing the pass-in Read/Write options through assertion. This is too strict for API that does not take or do not need to take Read/WriteOptions yet hence assertion failure. ``` stderr: db_stress: ... db_stress_tool/db_stress_env_wrapper.h:24: void rocksdb::(anonymous namespace)::CheckIOActivity(const IOOptions &): Assertion `io_activity == Env::IOActivity::kUnknown \|\| io_activity == options.io_activity' failed. Received signal 6 (Aborted) ``` An example is ManagedSnapshot snapshot_guard(db_); in TestMultiScan(). This PR ignores such check. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13898 Test Plan: The same command repro-ed this assertion failure passes after this fix Reviewed By: archang19 Differential Revision: D80983214 Pulled By: hx235 fbshipit-source-id: d8b660f8c8771198bc7fa0e805c3e86d2584f03e	2025-08-26 11:03:13 -07:00
Hui Xiao	8d2f420db2	Shorten the lifetime of statistics object in db stress (#13899 ) Summary: Context/Summary: Clear statistics reference from options_ to intentionally shorten the statistics object lifetime to be same as the db object (which is the common case in practice) and detect if RocksDB access the statistics beyond its lifetime. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13899 Test Plan: - [Ongoing] Stress test rehearsal Reviewed By: pdillinger Differential Revision: D80985435 Pulled By: hx235 fbshipit-source-id: ab238231cd81f47fa451aea12a0c85fa11d9ac81	2025-08-26 11:01:12 -07:00
anand76	1842a4029f	Update main for 10.7 (#13897 ) Summary: * Release notes from 10.6 branch * Update version.h * Add [10.6.fb](https://github.com/facebook/rocksdb/tree/10.4.fb) (to check_format_compatible.sh * No update to folly commit hash due to build failures Pull Request resolved: https://github.com/facebook/rocksdb/pull/13897 Reviewed By: mszeszko-meta Differential Revision: D80971628 Pulled By: anand1976 fbshipit-source-id: a24dbe90b5c54f781b2d017497ea3a22fcf6e148	2025-08-25 16:13:13 -07:00
Changyu Bi	82b5a2d3fc	Allow ingestion of any DB generated SST file (#13878 ) Summary: `IngestExternalFileOptions::allow_db_generated_files` requires SST files to have zero sequence number. This PR opens it up for any DB generated SST files. Currently we don't do global sequence number assignment when `allow_db_generated_files` is true, so we require that files do not overlap with any key in the CF. One behavior difference is that now we allow ingesting overlapping files when `allow_db_generated_files` is true. Users need to ensure that files are ordered such that later files have more recent updates. Intended follow ups: - Record smallest seqno in table property, so that we don't need to scan the file for it. - Cover allow_db_generated_files in crash test. We may create a new DB and ingest all files from a CF for verification. - Add APIs that uses allow_db_generated_files. For example, an API for ingesting SST files from a source CF, so that we take care of ingestion file ordering for user. If we are already getting metadata from the source CF, we may be use it as a hint for level placement instead of dividing input files into batches again (`ExternalSstFileIngestionJob::DivideInputFilesIntoBatches`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/13878 Test Plan: two new unit tests. Reviewed By: hx235, xingbowang Differential Revision: D80233727 Pulled By: cbi42 fbshipit-source-id: 74209386d8426c434bff2d9a734f06db537eb50c	2025-08-22 16:05:56 -07:00
Changyu Bi	439e1707fc	Fix MultiScan Prepare() to support dictionary compression (#13896 ) Summary: I saw failure when added some asserts near `b9957c991c/table/block_based/block_based_table_iterator.cc (L1201-L1205)` in stress test. The decompression failed with error message like "Corruption: Failed zlib inflate: -3". This PR fixes the issue to use the right decompressor for dictionary compression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13896 Test Plan: updated unit test that checks no I/O is done after Prepare(), this would fail before this change. Reviewed By: anand1976 Differential Revision: D80821500 Pulled By: cbi42 fbshipit-source-id: a4322c0da99a2d10e9787d0ec168668567c0c19a	2025-08-22 13:32:10 -07:00
Andrew Chang	239b06cefb	Retry on some io_uring_wait_cqe error codes (#13890 ) Summary: RocksDB currently aborts whenever `io_uring_wait_cqe` returns an error code. It also does not log what error code was returned. While experimenting with `IO_URING`, my application crashed because of this. I asked the Linux Kernel user group the best way to handle unsuccessful `io_uring_wait_cqe`. It was recommended to retry on `EINTR`, `EAGAIN`, and `ETIME`. `ETIME` only happens when waiting with a timeout, so I am not handling it. I also write to `stderr` so that we have some debugging information if we abort. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13890 Test Plan: Unfortunately this is hard to cover through unit/stress tests. We have to see what sort of errors get encountered in production. Reviewed By: anand1976 Differential Revision: D80639955 Pulled By: archang19 fbshipit-source-id: e3a230bd37552ec0f36be34e6a4e53cfd2a254f1	2025-08-22 12:31:50 -07:00
zaidoon	b9957c991c	actually expose rocksdb_status_ptr_get_error via c api (#13875 ) Summary: the function implementation is here: `8f0ab1598e/db/c.cc (L928-L930)` but it wasn't fully exposed Pull Request resolved: https://github.com/facebook/rocksdb/pull/13875 Reviewed By: hx235 Differential Revision: D80717828 Pulled By: cbi42 fbshipit-source-id: d6aaa984f24e469aa8ddb81524dc156b85e891f2	2025-08-21 14:50:22 -07:00
zaidoon	444f1ed07f	expose compact on deletion factory with min file size via C api (#13887 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13887 Reviewed By: hx235 Differential Revision: D80717735 Pulled By: cbi42 fbshipit-source-id: efecf436188d473a18359e715df979ff24f2fd2e	2025-08-21 11:51:28 -07:00
anand76	a5d4db64e2	Fix multiscan crash when fill_cache=false (#13889 ) Summary: When fill_cache is ReadOptions is false, multi scan Prepare crashes with the following assertion failure. In this case, CreateAndPibBlockInCache needs to directly create a block with full ownership. https://github.com/facebook/rocksdb/issues/9 0x00007f2fc003bc93 in __GI___assert_fail (assertion=0x7f2fc2147361 "pinned_data_blocks_guard[block_idx].GetValue()", file=0x7f2fc2146e08 "table/block_based/block_based_table_iterator.cc", line=1178, function=0x7f2fc2147262 "virtual void rocksdb::BlockBasedTableIterator::Prepare(const rocksdb::MultiScanArgs )") at assert.c:101 101 in assert.c https://github.com/facebook/rocksdb/issues/10 0x00007f2fc1d73088 in rocksdb::BlockBasedTableIterator::Prepare(rocksdb::MultiScanArgs const) () from /data/users/anand76/rocksdb_anand76/librocksdb.so.10.6 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13889 Test Plan: Parameterize the DBMultiScanIteratorTest tests with fill_cache Reviewed By: cbi42 Differential Revision: D80552069 Pulled By: anand1976 fbshipit-source-id: 1a0b64af1e14c63d826add1f994a832ebff12757	2025-08-21 08:55:47 -07:00
Changyu Bi	0b426ff58d	Enable multiscan in crash test (#13888 ) Summary: I ran multiple runs of crash test jobs internally, so far I've seen one iterator mismatch and one assertion failure. I've added relevant logging improvements to help debugging them. use_multiscan will be stable within a crash test run to make it easier to triage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13888 Test Plan: `python3 tools/db_crashtest.py whitebox --prefix_size=-1 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888` Reviewed By: anand1976 Differential Revision: D80627399 Pulled By: cbi42 fbshipit-source-id: 2fa3f77e730f5bc7d1d200dc122cf84e3558c588	2025-08-20 12:02:20 -07:00
Changyu Bi	618f660eab	Configurable multiscan IO coalescing threshold (#13886 ) Summary: Add a new filed `io_coalesce_threshold` to MultiScanArgs to make IO coalescing threshold configurable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13886 Test Plan: db_bench showing less IO requests with higher io_coalesce_threshold ``` Single L0 file, iterator uses BlockBasedTableIterator directly, skipping LevelIterator DB Set up: ./db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=1 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=1000 --write_buffer_size=268435456 ./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 .. --multiscan_coalesce_threshold=0 rocksdb.non.last.level.read.bytes COUNT : 54591304136 rocksdb.non.last.level.read.count COUNT : 7680204 multiscan : 397.197 micros/op 79401 ops/sec 10.377 seconds 823968 operations; (multscans:24999) --multiscan_coalesce_threshold=16384 rocksdb.non.last.level.read.bytes COUNT : 95960989272 rocksdb.non.last.level.read.count COUNT : 912008 multiscan : 389.099 micros/op 81064 ops/sec 10.312 seconds 835968 operations; (multscans:25999) --multiscan_coalesce_threshold=163840 rocksdb.non.last.level.read.bytes COUNT : 98805008718 rocksdb.non.last.level.read.count COUNT : 827893 multiscan : 392.831 micros/op 80357 ops/sec 10.353 seconds 831968 operations; (multscans:25999) DB with multiple files in a level, iterator will use LevelIterator ./db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=6 --num=10000000 ./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --num=10000000 --multiscan_coalesce_threshold=0 multiscan : 1161.734 micros/op 26995 ops/sec 10.667 seconds 287968 operations; (multscans:8999) rocksdb.non.last.level.read.bytes COUNT : 23917753523 rocksdb.non.last.level.read.count COUNT : 2868907 --multiscan_coalesce_threshold=16384 rocksdb.non.last.level.read.bytes COUNT : 35022281853 rocksdb.non.last.level.read.count COUNT : 287375 multiscan : 1195.336 micros/op 26265 ops/sec 10.850 seconds 284968 operations; (multscans:8999) ``` Reviewed By: anand1976 Differential Revision: D80381441 Pulled By: cbi42 fbshipit-source-id: 57cc67df4a808e27c3a48ddf3ef6907bec131ee9	2025-08-18 10:56:16 -07:00
Maciej Szeszko	84f814454a	Remove reservation mismatch assert in cache adapter destructor (#13885 ) Summary: The assert occasionally throws off the stress test runs. We already have sufficient logging in place to collect the signal about secondary cache capacity exceeding primary cache reservation for further investigation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13885 Reviewed By: anand1976 Differential Revision: D80355513 Pulled By: mszeszko-meta fbshipit-source-id: b36926f0493a3aca19818a1980ef79277db9fe7e	2025-08-15 15:41:01 -07:00
anand76	772e342a92	Add an option to sst_dump to list all metadata blocks (#13838 ) Summary: Add the --list_meta_blocks option to sst_dump. This PR also refactors some of the test code in sst_dump_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13838 Reviewed By: cbi42 Differential Revision: D80320812 Pulled By: anand1976 fbshipit-source-id: 921b6560fbd756f5f8b364893700d240d3b7ad00	2025-08-15 09:42:42 -07:00
Peter Dillinger	b3fdb9b3cc	Use safer atomic APIs for some memtable code (#13844 ) Summary: Two instances of change that are not just cosmetic: * InlineSkipList<>::Node::CASNext() was implicitly using memory_order_seq_cst to access `next_` while it's intended to be accessed with acquire/release. This is probably not a correctness issue for compare_exchange_strong but potentially a previously missed optimization. * Similar for `max_height_` in Insert which is otherwise accessed with relaxed memory order. * One non-relaxed access to `is_range_del_table_empty_` in a function only used in assertions. Access to this atomic is otherwise relaxed (and should be - comment added) Didn't do all of memtable.h because some of them are more complicated changes and I should probably add FetchMin and FetchMax functions to simplify and take advantage of C++27 functions where available (intended follow-up). Pull Request resolved: https://github.com/facebook/rocksdb/pull/13844 Test Plan: existing tests Reviewed By: xingbowang Differential Revision: D79742552 Pulled By: pdillinger fbshipit-source-id: d97ce72ba9af6c105694b7d40622db9e994720cd	2025-08-14 21:54:52 -07:00
Peter Dillinger	5c7162da27	Set decouple_partitioned_filters=true by default (#13881 ) Summary: This is an important feature for avoiding (reducing) unfair block cache treatment for a lot of blocks. It should also unlock some parallel optimizations (https://github.com/facebook/rocksdb/issues/13850) and code simplification. Consider for follow-up: * Feature to avoid majorly under0sized data blocks and filter and index partition blocks Pull Request resolved: https://github.com/facebook/rocksdb/pull/13881 Test Plan: existing tests, been looking good in production Reviewed By: hx235 Differential Revision: D80288192 Pulled By: pdillinger fbshipit-source-id: 5e274ffffb044713278d2a286db6bceaab2dadec	2025-08-14 21:03:47 -07:00
Changyu Bi	972fd9adf1	Remove `expect_valid_internal_key` parameter from CompactionIterator (#13882 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13882 The `expect_valid_internal_key` parameter was always passed as true, with false only used in one unit test. This change removes the parameter and always fail compaction when encountering corrupted internal keys, which is the expected production behavior. Reviewed By: mszeszko-meta Differential Revision: D80287672 fbshipit-source-id: e30a282ac30d7fded677504cec11173de8d15167	2025-08-14 16:40:25 -07:00
anand76	1369c7b169	Allow a user defined index to be configured from a string (#13880 ) Summary: Allow a user defined index to be configured from a string Pull Request resolved: https://github.com/facebook/rocksdb/pull/13880 Test Plan: Add a unit test in table_test.cc Reviewed By: bikash-c Differential Revision: D80237701 Pulled By: anand1976 fbshipit-source-id: 8b3d0bcdfbb4bb76803916ea1b1f940a4d985dfd	2025-08-14 09:05:39 -07:00
Hui Xiao	7e9c96020b	Improve two error messages on WAL recovery (#13876 ) Summary: Context/Summary: ... for better readability Pull Request resolved: https://github.com/facebook/rocksdb/pull/13876 Test Plan: Existing UT Reviewed By: mszeszko-meta Differential Revision: D80185817 Pulled By: hx235 fbshipit-source-id: 534d37dd747369da48fc5903acc66bb9c8f5206d	2025-08-13 12:02:12 -07:00
anand76	8f0ab1598e	Make UDI interface consistently use the user key (#13865 ) Summary: The original intention of the User Defined Index interface was to use the user key. However, the implementation mixed user and internal key usage. This PR makes it consistent. It also clarifies the UDI contract. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13865 Test Plan: Update tests in table_test.cc Reviewed By: pdillinger Differential Revision: D80050344 Pulled By: anand1976 fbshipit-source-id: ace47737d21684ec19709640a09e198cee2d98bd	2025-08-12 14:00:40 -07:00
Hui Xiao	e12734d51f	Disable track_and_verify_wals temporarily (#13869 ) Summary: ... as we see some issues that rehearsal stress test didn't surface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13869 Reviewed By: cbi42 Differential Revision: D80103341 Pulled By: hx235 fbshipit-source-id: 8b2c1d76d4c3099727ba3a69de44de67afd64369	2025-08-12 11:57:29 -07:00
Karthik Krishnamurthy	99bbc2d7fa	Fix bug in the generation of index and meta blocks when constructing UDI (#13846 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13846 This diff addresses few issues that was identified during testing of the user defined index. 1. During the finishing of the index blocks, we run into an infinite loop because the user defined index wrapper returns early on incomplete status. This happens because the wrapper blindly returns the status if it not OK. But, the status could legitimately be `Incomplete()` for some indices like Partitioned Index (serving as the internal index for the UDI wrapper). Fix is to exclude `Incomplete()` check from the status check early in the UDI wrapper's finish. 2. Once we fixed (1), we noticed that the meta blocks for the UDI-based index writer were not written out to the final SST file. This is because the UDI's meta blocks are created after the internal index's meta blocks and the block-based index builder didn't account for this. The fix is to finish the UDI wrapper first which will create the necessary meta blocks and then finish the internal index. If the internal index is incomplete, the block-based index builder should still continue to write out the meta blocks. 3. OnKeyAdded when delegating to the user-defined index should only pass the user key. The UDI builder doesn't understand RocksDB's internal key format and while that poses interesting challenges when the UDI is used for non last level SST files, our plan is to restrict the usage of the UDI to last level files only (for now). Reviewed By: pdillinger Differential Revision: D79781453 fbshipit-source-id: 2239c8fc016da55df5c24be6aacc8f6357cab029	2025-08-12 08:41:55 -07:00
Changyu Bi	496eebaee8	Fix compilation error using CLANG (#13864 ) Summary: fix the following error showing up in continuous tests: ``` Makefile:186: Warning: Compiling in debug mode. Don't use the resulting binary in production port/mmap.cc:46:15: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'rocksdb::MemMapping' [-Werror,-Wnontrivial-memcall] 46 \| std::memcpy(this, &other, sizeof(this)); \| ^ port/mmap.cc:46:15: note: explicitly cast the pointer to silence this warning 46 \| std::memcpy(this, &other, sizeof(this)); \| ^ \| (void) 1 error generated. make: [Makefile:2580: port/mmap.o] Error 1 make: * Waiting for unfinished jobs.... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13864 Test Plan: `make USE_CLANG=1 j=150 check` with `13f054febb/build_tools/build_detect_platform (L61-L70)` commented out. Reviewed By: mszeszko-meta Differential Revision: D80033441 Pulled By: cbi42 fbshipit-source-id: b2330eea71fe28243236b75128ec6f3f1e971873	2025-08-11 15:15:26 -07:00
Hui Xiao	d8835f918c	Enable track_and_verify_wal in stress test (#13853 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/13508 accidentally didn't enable track_and_verify_wal back and this PR will enable it. Test [ongoing] Rehearsal stress test Pull Request resolved: https://github.com/facebook/rocksdb/pull/13853 Reviewed By: pdillinger Differential Revision: D79909991 Pulled By: hx235 fbshipit-source-id: aea91c98e43f26dec9a8988c837a6ed821979a3c	2025-08-11 13:13:21 -07:00
Changyu Bi	13f054febb	Support DbStressCustomCompressionManager in ldb and sst_dump (#13827 ) Summary: while debugging stress test failure, I noticed that sst_dump and ldb do not work if custom db_stress compression manager is used. This PR adds support for it. ``` ./sst_dump --command=raw --show_properties --file=/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst options.env is 0x7f2b1f4b9000 Process /tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst Sst file format: block-based /tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst: Not implemented: Could not load CompressionManager: DbStressCustom1 /tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst is not a valid SST file ./ldb idump --db=/tmp/rocksdb_crashtest_whiteboxy_emah11 --ignore_unknown_options --hex >> /tmp/i_dump Failed: Not implemented: Could not load CompressionManager: DbStressCustom1 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13827 Test Plan: manually tested that ldb and sst_dump work with DbStressCustomCompressionManager after this PR Reviewed By: pdillinger Differential Revision: D79461175 Pulled By: cbi42 fbshipit-source-id: c8c092b10b4fde3a295b00751057749e8f0cf095	2025-08-08 11:04:14 -07:00
Ryan Hancock	0b44282a9d	Introduction of MultiScanOptions (#13837 ) Summary: To better support future options, and changes, we need to convert the std::vector<ScanOptions> to something more malleable. This diff introduces the MultiScanOptions structure and pipes it through the various points in the code in the Prepare path. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13837 Test Plan: Ensure all associated tests pass ``` make check all ``` Reviewed By: cbi42 Differential Revision: D79655229 Pulled By: krhancoc fbshipit-source-id: 3a90fb7420e9655021de85ed0158b866f8bfba05	2025-08-08 10:33:36 -07:00
Hui Xiao	b8b42b7a68	Simple cleanup to CompactionJob::Run() (#13851 ) Summary: Context/Summary: This update, which should have been part of a previous refactoring [PR](`d2ac955881`), involves simple renaming for clarity and ensures output table properties are only set when compaction succeeds. Output properties are not meaningful if compaction fails, so this change prevents their population in such cases. Additionally, subsequent statistics updates already do not rely on output file table properties, maintaining correctness regardless of compaction success. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13851 Test Plan: Existing unit tests Reviewed By: jaykorean Differential Revision: D79862244 Pulled By: hx235 fbshipit-source-id: 1db16b8dc7b820fab3ec1d5c8a4b757466590e2c	2025-08-08 10:09:55 -07:00
Hui Xiao	d2ac955881	Refactor CompactionJob::Run() into smaller focused methods (#13849 ) Summary: Context/Summary: The `CompactionJob::Run()` method has grown too large and complex, making it difficult to implement moderate changes or reason about the code flow (e.g., determining where to save compaction progress for resuming). This PR refactors the method into smaller, more focused functions to improve readability and maintainability. The refactoring consists mostly of cosmetic changes that extract logical sections into separate methods, with two notable functional improvements: 1. Relocated output processing logic: Moved code under `RemoveEmptyOutputs()` and `HasNewBlobFiles()` to where it's actually needed, rather than piggy-backing on the subcompaction state loop. While this introduces 2 additional loops over subcompactions, the performance impact should be negligible given the improved code clarity. 2. Repositioned statistics updates: Moved `UpdateCompactionJobInputStats()` and `UpdateCompactionJobOutputStats()` from the record verification section to the end `FinalizeCompactionRun()` methods. This change is safe since record verification is a read-only operation that doesn't modify any statistics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13849 Test Plan: Existing unit tests Reviewed By: jaykorean Differential Revision: D79824429 Pulled By: hx235 fbshipit-source-id: 6b73136f32ecc6842a04a77502b7dbb0bbf507f7	2025-08-07 17:22:01 -07:00
Jay Huh	b43a84fc37	Temporarily Disable Remote Compaction In Stress Test (#13848 ) Summary: Previous attempts were not enough keep the stress test running with remote compaction enabled - https://github.com/facebook/rocksdb/pull/13845, https://github.com/facebook/rocksdb/pull/13843, https://github.com/facebook/rocksdb/pull/13835 We will disable the remote compaction in stress test and address this with a better strategy (using internal Meta infra) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13848 Test Plan: CI Reviewed By: cbi42 Differential Revision: D79816733 Pulled By: jaykorean fbshipit-source-id: e93b037adf4f775202e06c3fd4aa8a3b4b85c274	2025-08-07 11:02:33 -07:00
Jay Huh	d0051d9314	Disable other incompatible features when disabled WAL + Remote Compaction in Stress Test (#13845 ) Summary: We temporarily disabled WAL when Remote Compaction is enabled in Stress Test (https://github.com/facebook/rocksdb/pull/13843). There are few others to incompatible features when WAL is disabled. Due to the sanitization order, WAL was disabled at the end of the sanitization and these incompatible features weren't set properly. Stress Test failed with an error like the following. e.g. `reopen` stress test is not compatible with `disable_wal` - `Error: Db cannot reopen safely with disable_wal set!` This PR changes the order of sanitization so that `disable_wal` is set earlier when `remote_compaction_worker_threads > 0` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13845 Test Plan: ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --txn_write_policy=2 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0 ``` Reviewed By: cbi42 Differential Revision: D79758670 Pulled By: jaykorean fbshipit-source-id: aa6f4a74cc86c23f442928c301187b06e8137f53	2025-08-07 09:22:29 -07:00
zaidoon	f2b646713e	allow setting sst file manager via c api (#13826 ) Summary: https://github.com/facebook/rocksdb/pull/13404 exposed pretty much everything via c api except allowing the user to set the sst file manager that was created Pull Request resolved: https://github.com/facebook/rocksdb/pull/13826 Reviewed By: hx235 Differential Revision: D79733147 Pulled By: cbi42 fbshipit-source-id: 6a18741581717a8b8b644b9f85bcd8fbeba94e6a	2025-08-06 16:08:21 -07:00
Peter Dillinger	1bba680ebb	Improve handling of GetFileSize failure (#13842 ) Summary: https://github.com/facebook/rocksdb/issues/13676 unfortunately treated some IOErrors as corruption, which is not appropriate when remote storage is involved. To help enforce this, our crash test injects errors that are expected to be propagated back to the user rather than causing some other failure. Saw crash test failures like this: ``` TestMultiGetEntity (AttributeGroup) error: Corruption: Failed to get file size: Not implemented: GetFileSize Not Supported for file ... ``` So fixing this handling by not injecting a false Corruption failure and allowing smooth fallback from FSRandomAccessFile::GetFileSize to FileSystem::GetFileSize Pull Request resolved: https://github.com/facebook/rocksdb/pull/13842 Test Plan: unit test added Reviewed By: xingbowang Differential Revision: D79728861 Pulled By: pdillinger fbshipit-source-id: 33f7dfc85d86d88cb4ab24a8defd26618c95c954	2025-08-06 15:20:07 -07:00
Jay Huh	3dd6c6f9cb	Disable Incompatible Tests with Remote Compaction (#13843 ) Summary: To reduce the noise, disable the incompatible ones for now when `remote_compaction_worker_threads > 0`. We will investigate each, fix as needed and re-enable them as follow up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13843 Test Plan: ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0 ``` Reviewed By: cbi42 Differential Revision: D79735166 Pulled By: jaykorean fbshipit-source-id: ae3be38a21073fd3282d6e8cd7d71f0363df3590	2025-08-06 11:54:23 -07:00
ngina	dfb4efaae3	Add test for deletion-triggered compaction with min file size (#13825 ) Summary: Summary: This test verifies that compaction respects the min_file_size parameter when triggered by deletions, preventing the compaction of files with deletions smaller than the threshold. The test logic includes two scenarios: 1. Verify that a large L0 file with deletions exceeding the minimum file size threshold triggers deletion-triggered compaction (DTC) and compacts to L1. 2. Verify that a small L0 file with deletions, but below the minimum file size threshold, does not trigger DTC and remains at L0. Added the DeletionTriggeredCompactionWithMinFileSizeTestListener, which verifies that files selected for compaction based on deletion triggers meet the minimum file size threshold. The listener validates in OnCompactionBegin that all input files have sizes greater than or equal to the configured min_file_size parameter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13825 Test Plan: Tested this feature on our devserver using the following commands: ``` DEBUG_LEVEL=2 make -j64 db_compaction_test && KEEP_DB=1 ./db_compaction_test --gtest_filter="DBCompactionTest.CompactionWith" ``` Test output confirms the expected behavior: ``` 2025/07/31-11:24:49.473181 1431671 [/compaction/compaction_job.cc:2291] [default] [JOB 6] Compacting 2@0 files to L1, score 0.04 2025/07/31-11:24:49.473240 1431671 [/compaction/compaction_job.cc:2297] [default]: Compaction start summary: Base version 6 Base level 0, inputs: [15(52KB) 9(103KB)] 2025/07/31-11:24:49.473304 1431671 EVENT_LOG_v1 {"time_micros": 1753986289473273, "job": 6, "event": "compaction_started", "cf_name": "default", "compaction_reason": "FilesMarkedForCompaction", "files_L0": [15, 9], "score": 0.04, "input_data_size": 159848, "oldest_snapshot_seqno": -1} ``` Tasks: T228156639 Reviewed By: cbi42 Differential Revision: D79395851 Pulled By: nmk70 fbshipit-source-id: 4c2a80a95521b40543981dd81b347f3984cd2a8b	2025-08-06 11:40:09 -07:00
Jay Huh	9c0a0c0058	Fix remote compaction stress test (#13835 ) Summary: Remote Compaction in the stress test previously failed with the following error, so we temporarily disabled it in PR https://github.com/facebook/rocksdb/issues/13815 : ``` reference std::vector<rocksdb::ThreadState >::operator[](size_type) [_Tp = rocksdb::ThreadState , _Alloc = std::allocator<rocksdb::ThreadState *>]: Assertion '__n < this->size()' failed. ``` The error was from accessing `remote_compaction_worker_threads[i]` when `i < remote_compaction_worker_threads.size()` which leads to an undefined behavior. This PR fixes the issue by properly setting the worker thread pointers in `remote_compaction_worker_threads`. Note: We are still encountering errors when both BlobDB and Remote Compaction are enabled. It appears to be a race condition. For now, BlobDB is temporarily disabled if remote compaction is enabled. We will fix the race condition and re-enable BlobDB as a follow-up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13835 Test Plan: ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=16 --interval=2 --duration=180 ``` Reviewed By: hx235 Differential Revision: D79684447 Pulled By: jaykorean fbshipit-source-id: 65f5809f651865c3df76c2cf3b9e7b8d654bb90a	2025-08-06 06:59:51 -07:00
Changyu Bi	3bd7d968e1	Introduce column family option `cf_allow_ingest_behind` (#13810 ) Summary: this option has the same functionality as DBOptions::allow_ingest_behind but allows the feature at per CF level. `DBOptions::allow_ingest_behind` is deprecated after this PR and users should use `cf_allow_ingest_behind` instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13810 Test Plan: updated some existing tests to use the new option. Reviewed By: xingbowang Differential Revision: D79191969 Pulled By: cbi42 fbshipit-source-id: 0da45f6be472ace6754ad15df93d45ac86313837	2025-08-05 23:19:09 -07:00
Hui Xiao	d0a412d962	Disable RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources (#13839 ) Summary: Context/Summary: The `RoundRobinSubcompactionsAgainstResources` test, specifically the `SubcompactionsUsingResources` case, is now disabled. This decision was made because the test's reliability depends on the absence of any concurrent compactions other than the round-robin compaction. Addressing this issue while maintaining the test's focus on resource reservation requires a deeper investigation, which is currently beyond my available bandwidth. Given the increased frequency of test failures, it has been temporarily disabled to prevent further disruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13839 Test Plan: - Should be no test failure from RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources anymore. Reviewed By: cbi42 Differential Revision: D79686366 Pulled By: hx235 fbshipit-source-id: 3a226cfd2b67cabc6c585ea567e2b0c25aa5f345	2025-08-05 17:51:54 -07:00
Jay Huh	b6e804b7de	Rename CompactFiles() and CompactRange() in CompactionPickers (#13831 ) Summary: #Summary Quick follow-up from https://github.com/facebook/rocksdb/pull/13816: `CompactFiles()` and `CompactRange()` in CompactionPickers do not run compaction as their names might suggest. What they actually do is create the Compaction object that will be passed to `CompactionJob` to run the compaction. Renaming these two functions to better represent their purposes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13831 Test Plan: No functional change. Existing CI should be sufficient. Reviewed By: hx235 Differential Revision: D79660196 Pulled By: jaykorean fbshipit-source-id: ca831dbef5120e7115b52fd07b0059ca16c8f1e8	2025-08-05 13:11:01 -07:00
Maciej Szeszko	799079cac5	Handle drop column family version edit in file checksum retriever (#13832 ) Summary: ... by ensuring that files in dropped column family are not returned to the caller upon successful, offline MANIFEST iteration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13832 Test Plan: `DBTest2, GetFileChecksumsFromCurrentManifest_CRC32` Reviewed By: pdillinger Differential Revision: D79607298 Pulled By: mszeszko-meta fbshipit-source-id: e7948e086ba6e6fb953a3959fdcc81300613d73e	2025-08-05 10:48:49 -07:00
Jay Huh	a88d367096	Minor Refactor - VerifyOutputRecordCount (#13830 ) Summary: Introduce `CompactionJob::VerifyOutputRecordCount()` and make it align with `VerifyInputRecordCount()`. Functionality-wise, it should be the same except when `db_options_.compaction_verify_record_count` is false. RocksDB will only print WARN message upon verification failure and not return `Status::Corruption()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13830 Test Plan: Existing tests cover both ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.VerifyInputRecordCount" ``` ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.CorruptedOutput" ``` Reviewed By: hx235 Differential Revision: D79584795 Pulled By: jaykorean fbshipit-source-id: 5851328999005601b28504085b688b80880bca7c	2025-08-04 17:16:25 -07:00
Peter Dillinger	53c39c2b01	Refactor/improve PartitionedIndexBuilder::AddIndexEntry (#13828 ) Summary: In anticipation of an enhancement related to parallel compression * Rename confusing state variables `seperator_is_key_plus_seq_` -> `must_use_separator_with_seq_` * Eliminate copy-paste code in `PartitionedIndexBuilder::AddIndexEntry` * Optimize/simplify `PartitionedIndexBuilder::flush_policy_` by allowing a single policy to be re-targetted to different block builders. Added some additional internal APIs to make this work, and it only works because the FlushBlockBySizePolicy is otherwise stateless (after creation). * Improve some comments, including another proposed optimization especially for the common case of no live snapshots affecting a large compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/13828 Test Plan: existing tests are pretty exhaustive, especially with crash test Planning to validate performance in combination with next change. (This change is saving some extra allocate/deallocate with partitioned index.) Reviewed By: cbi42 Differential Revision: D79570576 Pulled By: pdillinger fbshipit-source-id: f7a16f0e6e6ad2023a3d1a2ebaa3cc22aac717af	2025-08-04 14:15:38 -07:00
Ryan Hancock	7c5c37a1a4	IntervalSet Data Structure (#13787 ) Summary: This diff introduces the IntervalSet data structure, which will be used to help create sets of non overlapping sets of intervals for MultiScan scan options. Specifically, we add specializations for Slices to assist in this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13787 Test Plan: Added test to catch various cases within adding intervals. Reviewed By: anand1976 Differential Revision: D78624970 Pulled By: krhancoc fbshipit-source-id: 9a3e4a28738ab8428788467540fc05ab5c1a1b67	2025-08-04 14:14:16 -07:00
Jay Huh	3829750b70	Make CompactionPicker::CompactFiles() take earliest_snapshot and snapshot_checker (#13816 ) Summary: One of the parameters for constructing a Compaction object is `earliest_snapshot`, which is required for Standalone Range Deletion Optimization (introduced in [https://github.com/facebook/rocksdb/pull/13078](https://github.com/facebook/rocksdb/pull/13078)). Remote Compaction has been using the `CompactionPicker::CompactFiles()` API to create the Compaction object, but this API never sets the `earliest_snapshot` parameter. To address this, update `CompactionPicker::CompactFiles()` to optionally accept `earliest_snapshot` and pass it during the call in `DBImplSecondary::CompactWithoutInstallation()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13816 Test Plan: ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.StandaloneDeleteRangeTombstoneOptimization" ``` \+ Tested in Meta's internal offload infra. Reviewed By: hx235 Differential Revision: D79284769 Pulled By: jaykorean fbshipit-source-id: 164834ef6972d5e0ddfc2970bb9234ef166d6e52	2025-08-04 13:20:49 -07:00
Changyu Bi	ccd850fa56	Bug fix in MultiScan and stress test (#13822 ) Summary: Fix a bug in MultiScan where BlockBasedTableIterator should not return out-of-bound when the all blocks of the last scan are exhausted. This prevented LevelIterator from entering the next file so iterator is returning less keys than expected. Also fixed stress testing to specify iterate_upper_bound correctly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13822 Test Plan: - the following fails quickly before this PR and finishes after this PR ```python3 tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --seed=1 --fill_cache=1 --read_fault_one_in=0 --column_families=1 --allow_unprepared_value=0 --kill_random_test=88888``` - new unit test that fails before this PR Reviewed By: krhancoc Differential Revision: D79308957 Pulled By: cbi42 fbshipit-source-id: c9eafd1c8750b959b0185d7c63199b503493cbd2	2025-07-31 13:28:17 -07:00
Peter Dillinger	0a169cea0e	Compressor::CompressBlock API change and refactoring/improvement (#13805 ) Summary: The main motivation for this change is to more flexibly and efficiently support compressing data without extra copies when we do not want to support saving compressed data that is LARGER than the uncompressed. We believe pretty strongly that for the various workloads served by RocksDB, it is well worth a single byte compression marker so that we have the flexibility to save compressed or uncompressed data when compression is attempted. Why? Compression algorithms can add tens of bytes in fixed overheads and percents of bytes in relative overheads. It is also an advantage for the reader when they can bypass decompression, including at least a buffer copy in most cases, after reading just one byte. The block-based table format in RocksDB follows this model with a single-byte compression marker, and at least after https://github.com/facebook/rocksdb/pull/13797 so does CompressedSecondaryCache. (Notably, the blob file format DOES NOT. This is left to follow-up work.) In particular, Compressor::CompressBlock now takes in a fixed size buffer for output rather than a `std::string`. CompressBlock itself rejects the compression if the output would not fit in the provided buffer. This also works well with `max_compressed_bytes_per_kb` option to reject compression even sooner if its ratio is insufficient (implemented in this change). In the future we might use this functionality to reduce a buffer copy (in many cases) into the WritableFileWriter buffer of the block based table builder. This is a large change because we needed to (or were compelled to) Update all the existing callers of CompressBlock, sometimes with substantial changes. This includes introducing GrowableBuffer to reuse between calls rather than std::string, which (at least in C++17) requires zeroing out data when allocating/growing a buffer. * Re-implement built-in Compressors (V2; V1 is obsolete) to efficiently implement the new version of the API, no longer wrapping the `OLD_CompressData()` function. The new compressors appropriately leverage the CompressBlock virtual call required for the customization interface and no rely on `switch` on compression type for each block. The implementations are largely adaptations of the old implementations, except * LZ4 and LZ4HC are notably upgraded to take advantage of WorkingArea (see performance tests). And for simplicity in the new implementation, we are dropping support for some super old versions of the library. * Getting snappy to work with limited-size output buffer required using the Sink/Source interfaces, which appear to be well supported for a long time and efficient (see performance tests). * Replace awkward old CompressionManager::GetDecompressorForCompressor with Compressor::GetOptimizedDecompressor (which is optional to implement) * Small behavior change where we treat lack of support for compression closer to not configuring compression, such as incompatibility with block_align. This is motivated by giving CompressionManager the freedom of determining when compression can be excluded for an entire file despite the configured "compression" type, and thus only surfacing actual incompatibilities not hypothetical ones that might be irrelevant to the CompressionManager (or build configuration). Unit tests in `table_test` and `compact_files_test` required update. * Some lingering clean up of CompressedSecondaryCache and a re-optimization made possible by compressing into an existing buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13805 Test Plan: for correctness, existing tests ## Performance Test As I generally only modified compression paths, I'm using a db_bench write benchmark, with before & after configurations running at the same time. vc=1 means verify_compression=1 ``` USE_CLANG=1 DEBUG_LEVEL=0 LIB_MODE=static make -j100 db_bench SUFFIX=`tty \| sed 's\|/\|_\|g'`; for CT in zlib bzip2 none snappy zstd lz4 lz4hc none snappy zstd lz4 bzip2; do for VC in 0 1; do echo "$CT vc=$VC"; (for I in `seq 1 20`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -verify_compression=$VC 2>&1 \| grep micros/op; done) \| awk '{n++; sum += $5;} END { print int(sum / n); }'; done; done ``` zlib vc=0 524198 -> 524904 (+0.1%) zlib vc=1 430521 -> 430699 (+0.0%) bzip2 vc=0 61841 -> 60835 (-1.6%) bzip2 vc=1 49232 -> 48734 (-1.0%) none vc=0 1802375 -> 1906227 (+5.8%) none vc=1 1837181 -> 1950308 (+6.2%) snappy vc=0 1783266 -> 1901461 (+6.6%) snappy vc=1 1799703 -> 1879660 (+4.4%) zstd vc=0 1216779 -> 1230507 (+1.1%) zstd vc=1 996370 -> 1015415 (+1.9%) lz4 vc=0 1801473 -> 1943095 (+7.9%) lz4 vc=1 1799155 -> 1935242 (+7.6%) lz4hc vc=0 349719 -> 1126909 (+222.2%) lz4hc vc=1 348099 -> 1108933 (+218.6%) (Repeating the most important ones) none vc=0 1816878 -> 1952221 (+7.4%) none vc=1 1813736 -> 1904622 (+5.0%) snappy vc=0 1794816 -> 1875062 (+4.5%) snappy vc=1 1789363 -> 1873771 (+4.7%) zstd vc=0 1202592 -> 1225164 (+1.9%) zstd vc=1 994322 -> 1016688 (+2.2%) lz4 vc=0 1786959 -> 1971518 (+10.3%) lz4 vc=1 1829483 -> 1935871 (+5.8%) I confirmed manually that the new WorkingArea for LZ4HC makes the huge difference on that one, but not as much difference for LZ4, presumably because LZ4HC uses much larger buffers/structures/whatever for better compression ratios. Reviewed By: hx235 Differential Revision: D79111736 Pulled By: pdillinger fbshipit-source-id: 1ce1b14af9f15365f1b6da49906b5073a8cecc14	2025-07-31 08:39:56 -07:00
Jay Huh	7f14960816	UnitTest for Remote Compaction Empty Result (#13812 ) Summary: Unit Test for a repro for the fix that was reported by https://github.com/facebook/rocksdb/pull/13743 There's potential dataloss when Remote Compaction entries are all removed due to various reasons (CompactionFilter, DeleteRange covering all keys of the SST file, etc) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13812 Test Plan: ``` ./compaction_service_test --gtest_filter="CompactionServiceTest.EmptyResult" ``` Failed before merging https://github.com/facebook/rocksdb/pull/13743, now passing Reviewed By: cbi42 Differential Revision: D79192829 Pulled By: jaykorean fbshipit-source-id: e200300c4a7993de21c63cd92bda65b692921b89	2025-07-30 14:13:31 -07:00
Peter Dillinger	3757e5479d	Improve detection and reporting for fbcode build (#13820 ) Summary: We were seeing some internal builds apparently failing the `-d /mnt/gvfs/third-party` check. Although third-party2 is likely a better check (see dependencies_platform010.sh), that would create a big headache with check_format_compatible.sh which has to work across codebase versions. * Report a WARNING when we detect on a Meta machine but the `-d /mnt/gvfs/third-party` check fails * Let USE_CLANG influence default compiler choice so that things might still work in that case (e.g. `USE_CLANG=1 make -j24 check`) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13820 Test Plan: manual, CI Reviewed By: jaykorean Differential Revision: D79277197 Pulled By: pdillinger fbshipit-source-id: 19b2d45ed794f64bbf838f4414568d77ae9ca6f1	2025-07-30 13:00:37 -07:00
Changyu Bi	e7a4505a2e	Preserve tombstones for `allow_ingest_behind` (#13807 ) Summary: Preserve tombstone when allow_ingest_behind` is enabled so that they can be applied to ingested files. This can be useful when users use ingest_behind to buffer updates where Deletion needs to be preserved. This fixes https://github.com/facebook/rocksdb/issues/13571. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13807 Test Plan: updated a unit test to verify that tombstones are not dropped during compaction. Reviewed By: hx235 Differential Revision: D79016109 Pulled By: cbi42 fbshipit-source-id: c4d31ef32c88468ababcc1ea5af5db6de42a3b0d	2025-07-30 12:00:54 -07:00
Jay Huh	5435032c4c	Temporarily Disable Remote Compaction in Stress Test (#13815 ) Summary: As title. We will re-enable it once fixed Pull Request resolved: https://github.com/facebook/rocksdb/pull/13815 Test Plan: N/A - Disabling the test. Reviewed By: archang19 Differential Revision: D79172697 Pulled By: jaykorean fbshipit-source-id: 936de3743816049cda811bde48b3b2207ed256ee	2025-07-29 08:48:19 -07:00
huangmengbin	f66ac76938	prevent data loss when all entries are expired in Remote Compaction (#13743 ) Summary: Issue: When running remote compaction, if all entries in the input files are expired, RocksDB incorrectly deletes an active file from the primary DB, leading to data loss and corruption. Root Cause: The current logic mistakenly mixed up the input and output file paths during the cleanup phase when no keys survive the compaction (all expired). This results in deleting the input files (which belong to the primary DB) instead of the output files (which belong to the SecondaryDB). Fix: Use `GetTableFileName` (virtual function) instead of `TableFileName` Pull Request resolved: https://github.com/facebook/rocksdb/pull/13743 Reviewed By: hx235 Differential Revision: D79108650 Pulled By: jaykorean fbshipit-source-id: 1c9ba971a0e9a62c15ebc014436cb8fc961af95c	2025-07-28 19:17:45 -07:00
anand76	07f1520290	Add MultiScan to db_stress (#13803 ) Summary: Add the new MultiScan operation to db_stress (disabled by default) Pull Request resolved: https://github.com/facebook/rocksdb/pull/13803 Test Plan: python3 tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 Reviewed By: krhancoc Differential Revision: D78938131 Pulled By: anand1976 fbshipit-source-id: 30fced56e46b79cebebc7ec4d4588c6c2fca232a	2025-07-28 15:39:58 -07:00
Peter Dillinger	f8535fb955	Build fix and GitHub CI enhancements (#13813 ) Summary: Building db_bench with clang and DEBUG_LEVEL=0 was failing with unused variable. This was not caught by CI so I have added this to the build-linux-clang-13-no_test_run job. Also, while I was touching CI: * Fold build-linux-release-rtti into build-linux-release by reducing the number of combinations tested between static/dynamic lib and rtti/not. I don't expect these to interact meaningfully with an extremely mature compiler. * Combine build-linux-clang10-asan and build-linux-clang10-ubsan because clang is extremely reliable running both together Pull Request resolved: https://github.com/facebook/rocksdb/pull/13813 Test Plan: manual builds, CI Reviewed By: krhancoc Differential Revision: D79112643 Pulled By: pdillinger fbshipit-source-id: 4ffc672718c05fa4597d637aacbc5a179ad8a0cf	2025-07-28 14:40:32 -07:00
RROP	6ae1cb8837	Switch fragmented range tombstone cache to C++20 atomic<shared_ptr> API (#13744 ) Summary: • Guard on __cpp_lib_atomic_shared_ptr to use std::atomic<std::shared_ptr<T>>::load()/store() • Fallback to std::atomic_load_explicit()/store_explicit() under C++17 When attempting to build with CXX 20 using clang in a Linux environment, the build fails due to deprecation of atomic_load_explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13744 Reviewed By: xingbowang Differential Revision: D78997919 Pulled By: cbi42 fbshipit-source-id: f829c282cba878f072d4b0ad44192a87f73b8a90	2025-07-28 13:14:14 -07:00
Jay Huh	217e075df8	Simulate e2e flow in Stress Test (#13800 ) Summary: Simulate Remote Compaction in Stress Test by running a separate set of threads that runs remote compaction. Queue and ResultMap for the remote compactions are stored in memory as part of the `SharedState`. They are shared across main worker threads and remote compaction worker threads. `enable_remote_compaction` is replaced by `remote_compaction_worker_threads`. If `remote_compaction_worker_threads` is set to 0, remote compaction is not enabled in Stress Test. To Follow up This PR covers happy path only. Failure injection in the remote worker thread will be added as a follow up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13800 Test Plan: ``` ./db_stress --remote_compaction_worker_threads=4 --flush_one_in=1000 --writepercent=40 --readpercent=40 --iterpercent=10 --prefixpercent=0 --delpercent=10 --destroy_db_initially=0 --clear_column_family_one_in=0 --reopen=0 ``` ``` python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 ``` Reviewed By: hx235 Differential Revision: D78862084 Pulled By: jaykorean fbshipit-source-id: b262058c92d7fecc5e014cef5df9cca4a209921b	2025-07-28 07:29:03 -07:00
Peter Dillinger	ee6b0def55	Refactor, improve CompressedSecondaryCache (#13797 ) Summary: To be compatible with some upcoming compression change/refactoring where we supply a fixed size buffer to CompressBlock, we need to support CompressedSecondaryCache storing uncompressed values when the compression ratio is not suitable. It seems crazy that CompressedSecondaryCache currently stores compressed values that are larger than the uncompressed value, and even explicitly exercises that case (almost exclusively) in the existing unit tests. But it's true. This change fixes that with some other nearby refactoring/improvement: * Update the in-memory representation of these cache entries to support uncompressed entries even when compression is enabled. AFAIK this also allows us to safely get rid of "don't support custom split/merge for the tiered case". * Use more efficient in-memory representation for non-split entries * For CompressionType and CacheTier, which are defined as single-byte data types, use a single byte instead of varint32. (I don't know if varint32 was an attempt at future-proofing for a memory-only schema or what.) Now using lossless_cast will raise a compiler error if either of these types is made too large for a single byte. * Don't wrap entries in a CacheAllocationPtr object; it's not necessary. We can rely on the same allocator being provided at delete time. * Restructure serialization/deserialization logic, hopefully simpler or easier to read/understand. * Use a RelaxedAtomic for disable_cache_ to avoid race. Suggested follow-up on CompressedSecondaryCache: * Refine the exact strategy for rejecting compressions * Still have a lot of buffer copies; try to reduce * Revisit the split-merge logic and try to make it more efficient overall, more unified with non-split case Pull Request resolved: https://github.com/facebook/rocksdb/pull/13797 Test Plan: Unit tests updated to use actually compressible strings in many places and more testing around non-compressible string. ## Performance Test There was some pre-existing issue causing decompression failures in compressed secondary cache with cache_bench that is somehow fixed in this change. This decompression failures were present before the new compression API, but since then cause assertion failures rather than being quietly ignored. For the "before" test here, they are back to quietly ignored. And the cache_bench changes here were back-ported to the "before" configuration. ### No compressed secondary (setting expectations) ``` ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=8000000000 -populate_cache ``` Max key : 3906250 Before: Complete in 12.784 s; Rough parallel ops/sec = 2503123 Thread ops/sec = 160329; Lookup hit ratio: 0.686771 After: Complete in 12.745 s; Rough parallel ops/sec = 2510717 (in the noise) Thread ops/sec = 159498; Lookup hit ratio: 0.68686 ### Compressed secondary, no split/merge Same max key and approximate total memory size ``` /usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri=compressed_secondary_cache://capacity=4000000000 ``` Before: Complete in 18.690 s; Rough parallel ops/sec = 1712144 Thread ops/sec = 108683; Lookup hit ratio: 0.776683 Latency: P50: 4205.19 P75: 15281.76 P99: 43810.98 P99.9: 71487.41 P99.99: 165453.32 max RSS (according to /usr/bin/time): 9341856 After: Complete in 17.878 s; Rough parallel ops/sec = 1789951 (+4.5%) Thread ops/sec = 114957; Lookup hit ratio: 0.792998 (+0.016) Latency: P50: 4012.70 P75: 14477.63 P99: 40039.70 P99.9: 62521.04 P99.99: 167049.18 max RSS (according to /usr/bin/time): 9235688 The improved hit ratio is probably from fixing the failed decompressions (somehow). And my modifications could have improved CPU efficiency, or it could be the small penalty the benchmark naturally imposes on most misses (generate another value and insert it). ### Compressed secondary, with split/merge ``` /usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri='compressed_secondary_cache://capacity=4000000000;enable_custom_split_merge=true' ``` Before: Complete in 20.062 s; Rough parallel ops/sec = 1595075 Thread ops/sec = 101759; Lookup hit ratio: 0.787129 Latency: P50: 5338.53 P75: 16073.46 P99: 46752.65 P99.9: 73459.11 P99.99: 201318.75 max RSS (according to /usr/bin/time): 9049852 After: Complete in 18.564 s; Rough parallel ops/sec = 1723771 (+8.1%) Thread ops/sec = 110724; Lookup hit ratio: 0.813414 (+0.026) Latency: P50: 5234.75 P75: 14590.43 P99: 41401.03 P99.9: 65606.50 P99.99: 157248.04 max RSS (according to /usr/bin/time): 8917592 Looks like an improvement Reviewed By: anand1976 Differential Revision: D78842120 Pulled By: pdillinger fbshipit-source-id: 5f754b160c37ebee789279178ebb5e862071bdb2	2025-07-25 13:39:25 -07:00
Xingbo Wang	961880b458	Create a new API FileSystem::SyncFile for file sync (#13762 ) Summary: Create a new API FileSystem::SyncFile for file sync, so that we could use file sync directly in places where we need to sync file content to file system without any modification. This is mostly used combined with link file. In some file system link file does not guarantee the file content is synced to file system. https://github.com/facebook/rocksdb/issues/13741 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13762 Test Plan: Unit test T229418750 Reviewed By: pdillinger Differential Revision: D78121137 Pulled By: xingbowang fbshipit-source-id: 0ea8a5a3b486e0b61636700400613fed6bbd3faa	2025-07-23 17:12:41 -07:00
jainpr	124dd30879	Remove yield in point lock manager (#13796 ) Summary: The yield is actually of not much use because waitFor should already be doing that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13796 Reviewed By: pdillinger Differential Revision: D78823656 Pulled By: jainpr fbshipit-source-id: 040eaf596938ce8db535bc810ad77a9e50b2d551	2025-07-23 11:43:03 -07:00
Ryan Hancock	351d212777	Ensure Property Bags are Pushed Down to BlockBasedIterator (#13795 ) Summary: This diff fixes up a miss in which the property_bag was not pushed down to the BlockBasedIterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13795 Reviewed By: anand1976 Differential Revision: D78762294 Pulled By: krhancoc fbshipit-source-id: 8970b0a87e35d07d5a0dd16f360ec96859f66550	2025-07-22 17:46:45 -07:00
Richard Barnes	668067e0bf	Del redundant-static-def in internal_repo_rocksdb/repo/db/db_with_timestamp_basic_test.cc +1 (#13794 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13794 LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning: > warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so. - If you approve of this diff, please use the "Accept & Ship" button :-) Reviewed By: meyering Differential Revision: D78635037 fbshipit-source-id: a90c68469947705c65f36588b2d575237689dbe8	2025-07-22 11:49:12 -07:00
Richard Barnes	463f9fd9f2	Del redundant-static-def in internal_repo_rocksdb/repo/tools/sst_dump_test.cc +1 (#13793 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13793 LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning: > warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so. - If you approve of this diff, please use the "Accept & Ship" button :-) Reviewed By: meyering Differential Revision: D78635005 fbshipit-source-id: bd7cbfff0580b9579e78237ec4371615d3609536	2025-07-22 10:56:07 -07:00
Xingbo Wang	ca5d60fd69	Switch back to FSWritableFile in external sst file ingestion job (#13791 ) Summary: This patch reverted "NewRandomRWFile" back to "ReopenWritableFile" in external sst file ingestion job when file is linked instead of copied. The reason is that some of the file systems do not support "NewRandomRWFile". A long term fix is being worked in progress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13791 Test Plan: Unit test Reviewed By: pdillinger Differential Revision: D78697825 Pulled By: xingbowang fbshipit-source-id: d3651223ab1f2369aac34b772bba8049c6c2c628	2025-07-21 16:21:19 -07:00
Jay Huh	c50a2b68bb	Expose GetTtl() API in TTL DB (#13790 ) Summary: As title Pull Request resolved: https://github.com/facebook/rocksdb/pull/13790 Test Plan: ``` ./ttl_test --gtest_filter="TtlTest.ChangeTtlOnOpenDb" ``` Reviewed By: cbi42 Differential Revision: D78670347 Pulled By: jaykorean fbshipit-source-id: 1b2538d6cd0f2a0fbf397a5d2f677852f97272c4	2025-07-21 14:43:40 -07:00
Ryan Hancock	fe68fbcd7f	Prepare() Scan Option Pruning for LevelIterator (#13780 ) Summary: This diff introduces the ScanOption Pruning, previously the intent was to do prefetching for each sub-iterator of the level iterator, however since BlockBasedIterator does not prefetch asynchronously, this optimization does not make sense just yet. For now we will prune the ScanOptions to the overlapping ranges and make sure they are properly piped to the underlying layers (during Prepare, and Seek). Pull Request resolved: https://github.com/facebook/rocksdb/pull/13780 Reviewed By: cbi42 Differential Revision: D78436869 Pulled By: krhancoc fbshipit-source-id: 681fe7f7f88b04b5c2d60cb3a5de01e03f6f8431	2025-07-21 13:09:53 -07:00
Andrew Chang	57ff2b2492	Update for next release 10.6 (#13784 ) Summary: This includes: 1. Release notes from 10.5 branch 2. Version.h update 3. Format compatibility check 4. Folly commit hash update (I chose https://github.com/facebook/folly/releases/tag/v2025.06.30.00 because later commits were causing CI failures) Previous release: https://github.com/facebook/rocksdb/pull/13719 Pull Request resolved: https://github.com/facebook/rocksdb/pull/13784 Reviewed By: pdillinger Differential Revision: D78587604 Pulled By: archang19 fbshipit-source-id: a8611ef4527c3c6ee5c830349b7ae41701c1efb6	2025-07-18 15:02:49 -07:00
Peter Dillinger	551ba21e9b	Support recompress-with-CompressionManager in sst_dump (#13783 ) Summary: So that we can use --command=recompress with a custom CompressionManager. (It's not required for reading files using a custom CompressionManager because those can already use ObjectLibrary for dependency injection.) Suggested follow-up: * These tests should not be using C arrays, snprintf, manual delete, etc. except for thin compatibility with argc/argv. Pull Request resolved: https://github.com/facebook/rocksdb/pull/13783 Test Plan: unit test added, some manual testing Reviewed By: archang19 Differential Revision: D78574434 Pulled By: pdillinger fbshipit-source-id: 609e6c6439090e6b7e9b63fbd4c2d3f04b104fcf	2025-07-18 14:22:29 -07:00