Summary:
When building a Release on Windows RTTI is not available, so asserts that use dynamic_cast need to be disabled
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14280
Reviewed By: nmk70
Differential Revision: D91807791
Pulled By: mszeszko-meta
fbshipit-source-id: e29c19c757bcd076a1f09ed40b306bb50ba9e882
Summary:
Causing failures and not yet supported. Also putting a note in db.h about the combination being unsupported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14189
Test Plan: started up blackbox_crash_test_with_ts many times and checked command line to be confident it's excluded.
Reviewed By: hx235
Differential Revision: D89297971
Pulled By: pdillinger
fbshipit-source-id: c5134351d9ecb37879c7e3319c17dd9228d7f12a
Summary:
`PosixRandomAccessFile::MultiRead` was introduced in Dec 2019 in https://github.com/facebook/rocksdb/pull/5881. Subsequently, 2 years after, we introduced the `PosixRandomAccessFile::ReadAsync` API in https://github.com/facebook/rocksdb/pull/9578, which was reusing the same `PosixFileSystem` IO ring as `MultiRead` API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving `ReadAsync` and `MultiRead` API calls on the very same thread might result in reading 'unknown' events in `MultiRead` leading to `Bad cqe data` errors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the `ReadAsync` API. This change aims to solve this problem by maintaining separate thread local IO rings for `async reads` and `multi reads` assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (`IORING_SETUP_SINGLE_ISSUER` [if available]) and defer the task just before the application intends to process completions (`IORING_SETUP_DEFER_TASKRUN` [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.
## Benchmark
**TLDR**
There's no evident advantage of using `io_uring_submit` (relative to proposed `io_uring_submit_and_wait`) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above `kIoUringDepth` batch size. `io_uring_submit` might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from `io_uring_submit_and_wait` -> `io_uring_submit` can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).
**Compilation**
```
DEBUG_LEVEL=0 make db_bench
```
**Create DB**
```
./db_bench \
--db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
--benchmarks=fillseq \
--num=2500000 \
--key_size=100 \
--value_size=6144 \
--compression_type=LZ4 \
--block_size=16384 \
--seed=1723056275
```
**LSM**
* L0: 2 files, L1: 5, L2: 49, L3: 79
* Each file is roughly ~35M in size
### MultiReadRandom (with caching disabled)
Each run was preceded by OS page cache cleanup with `echo 1 | sudo tee /proc/sys/vm/drop_caches`.
```
./db_bench \
--use_existing_db=true \
--db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
--compression_type=LZ4 \
--benchmarks=multireadrandom \
--num= **<N>** \
--batch_size= **<B>** \
--io_uring_enabled=true \
--async_io=false \
--optimize_multiget_for_io=false \
--threads=4 \
--cache_size=0 \
--use_direct_reads=true \
--use_direct_io_for_flush_and_compaction=true \
--cache_index_and_filter_blocks=false \
--pin_l0_filter_and_index_blocks_in_cache=false \
--pin_top_level_index_and_filter=false \
--prepopulate_block_cache=0 \
--row_cache_size=0 \
--use_blob_cache=false \
--use_compressed_secondary_cache=false
```
| B=10; N=100,000 | B = 250; N=80,000 | B = 1,000; N=20,000
-- | -- | -- | --
baseline | 31.5 (± 0.4) us/op | 17.5 (± 0.5) us/op | 13.5 (± 0.4) us/op
io_uring_submit_and_wait | 31.5 (± 0.6) us/op | 17.7 (± 0.4) us/op | 13.6 (± 0.4) us/op
io_uring_submit | 31.5 (± 0.6) us/op | 17.5 (± 0.5) us/op | 13.4 (± 0.45) us/op
### Specs
| Property | Value
-- | --
RocksDB | version 10.9.0
Date | Tue Dec 9 15:57:03 2025
CPU | 56 * Intel Sapphire Rapids (T10 SPR)
Kernel version | 6.9.0-0_fbk12_0_g28f2d09ad102
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14158
Reviewed By: anand1976
Differential Revision: D88172809
Pulled By: mszeszko-meta
fbshipit-source-id: 5198de3d2f18f76fee661a2ec5f447e79ba06fbd
Summary:
**Context/Summary:**
Truncated range deletion in input files can be output by CompactionIterator with type kMaxValid instead of kTypeRangeDeletion, to satisfy ordering requirement between the truncated range deletion start key and a file's point keys. There was a plan to skip such key in https://github.com/facebook/rocksdb/pull/14122 but blockers remain to fulfill the plan.
Resumable compaction is not able to handle resumption from range deletion well at this point and should consider kMaxValid type same as kTypeRangeDeletion for resumption. Previously, it didn't and mistakenly allow resumption from a delete range. That led to an assertion failure, complaining about lacking information to update file boundaries in the presence of range deletion needed during cutting an output file, after the compaction resumes from that delete range and happens to cut the output file shortly after without any point keys in between.
```
frame https://github.com/facebook/rocksdb/issues/9: 0x00007f4f4743bc93 libc.so.6`__GI___assert_fail(assertion="meta.smallest.size() > 0", file="db/compaction/compaction_outputs.cc", line=530, function="rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice*, const rocksdb::Slice*, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00007f4f4808c68c librocksdb.so.10.9`rocksdb::CompactionOutputs::AddRangeDels(this=0x00007f4f0c27e1a0, range_del_agg=0x00007f4f0c21ecc0, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, range_del_out_stats=0x00007f4f0dffa140, bottommost_level=false, icmp=0x00007f4ef4c93040, earliest_snapshot=13108729, keep_seqno_range=<unavailable>, next_table_min_key=0x00007f4ef4c8f540, full_history_ts_low="") at compaction_outputs.cc:530:7
frame https://github.com/facebook/rocksdb/issues/11: 0x00007f4f480480dd librocksdb.so.10.9`rocksdb::CompactionJob::FinishCompactionOutputFile(this=0x00007f4f0dffb890, input_status=<unavailable>, prev_table_last_internal_key=0x00007f4f0dffa650, next_table_min_key=0x00007f4ef4c8f540, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, c_iter=0x00007f4ef4c8f400, sub_compact=0x00007f4f0c27e000, outputs=0x00007f4f0c27e1a0) at compaction_job.cc:1917:31
```
This PR simply prevents MaxValid from being a resumption point like regular range deletion - see commit 842d66eb18ea67e965d6acb1fce12c18eeb778d2
Besides that, the PR also improves the testing, variable naming, logging in resumable compaction codes that were needed to debug this assertion failure - see commit https://github.com/facebook/rocksdb/pull/14184/commits/aecd4e7f971f6dd4df672d9e5f1409fe4747c561. These improvements are covered by existing tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14184
Test Plan:
- The stress initially surfaced the error. Using the exact same LSM shapes and files that were used in stress test but in a unit test, I'm able to get a deterministic repro and confirmed the fix resolves the error. This is the repro test 1075936e69
```
./compaction_service_test --gtest_filter=ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
# Before fix
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ResumableCompactionServiceTest
[ RUN ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
compaction_service_test: db/compaction/compaction_outputs.cc:530: rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice*, const rocksdb::Slice*, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&): Assertion `meta.smallest.size() > 0' failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
[New LWP 2621610]
[New LWP 2621611]
[New LWP 2621612]
[New LWP 2621613]
[New LWP 2621614]
[New LWP 2621630]
[New LWP 2621631]
# After fix
Note: Google Test filter = ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ResumableCompactionServiceTest
[ RUN ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
[ OK ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume (4722 ms)
[----------] 1 test from ResumableCompactionServiceTest (4722 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (4722 ms total)
[ PASSED ] 1 test.
```
- Follow-up: I tried a couple time to coerce the truncated range delete from scratch in the unit test but failed doing so. Considering kMaxValid may not be outputted by compaction iterator anymore after https://github.com/facebook/rocksdb/pull/14122/files gets landed again (and obsolete the bug) ADN the simple nature of this fix 842d66eb18ea67e965d6acb1fce12c18eeb778d2 AND the worst case of such fix going wrong is just less resumption, I decided to leave writing a unit test to coerce truncated ranged deletion from scratch a follow-up. Maybe I will draw inspiration from https://github.com/facebook/rocksdb/pull/14122/files.
Reviewed By: jaykorean
Differential Revision: D88912663
Pulled By: hx235
fbshipit-source-id: 80a01135684c8fea659650faaa00c2dc452c482a
Summary:
**Context/Summary:**
Stress test flag printed by db_crashtest.py like `./db_stres ....-secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --otherflags=xxxx` is not copy-paste-run friendly. Directly running this command will cause parsing hiccups due to special characters like // or ;. This PR made the db_crashtest.py print a single-quoted value so at least the copy-paste-run works for unix-like shell (the most common case).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14180
Test Plan:
`python3 tools/db_crashtest.py --simple blackbox ...` display the following
Before fix, no single-quoted
```
Use random seed for iteration 9698536012932546857
Running db_stress with pid=1280640:./db_stress --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true ...
// Directly copy, paste and run the ./db_stress command will encounter
Error: Read(-readpercent=0)+Prefix(-prefixpercent=0)+Write(-writepercent=45)+Delete(-delpercent=0)+DeleteRange(-delrangepercent=30)+Iterate(-iterpercent=40)+CustomOps(-customopspercent=0) percents != 100!
bash: --set_options_one_in=0: command not found
```
After fix, has single-quoted
```
se random seed for iteration 6017815530972723112
Running db_stress with pid=1234632: ./db_stress --secondary_cache_uri='compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true' ....
// Directly copy, paste and run the ./db_stress command is fine
```
Reviewed By: archang19
Differential Revision: D88688584
Pulled By: hx235
fbshipit-source-id: 88b8b2de7c2c5619b6e19900f4144dcd8e032f7b
Summary:
r? cbi42
Exposes RocksDB's remote compaction functionality through the C API, enabling C/FFI clients (Go, Rust, Python, etc.) to offload compaction work to remote workers.
## API Components
### Compaction Service
Create service with schedule, wait, cancel, and on_installation callbacks
Ownership transfers to options object (auto-destroyed, no manual cleanup)
### Job Info (13 getters)
DB/CF metadata and compaction details (priority, reason, levels, flags)
### Schedule Response
Create with job ID and status (validated with errptr)
Status: success, failure, aborted, use_local
### OpenAndCompact (for remote workers)
Execute compaction on worker node with environment/comparator overrides
Cancellation support via atomic flags
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14136
Reviewed By: hx235
Differential Revision: D88316558
Pulled By: jaykorean
fbshipit-source-id: 60a0fee69ff1e650dd785d96ec656649263214f8
Summary:
Crash tests have been failing of late with this assertion failure - db_stress: `./table/block_based/block_based_table_iterator.h:656: void rocksdb::BlockBasedTableIterator::PrepareReadAsyncCallBack(rocksdb::FSReadRequest &, void *): Assertion `async_state->status.IsAborted()' failed.` Instead of asserting, surface the failure status so we can troubleshoot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14171
Reviewed By: xingbowang
Differential Revision: D88396654
Pulled By: anand1976
fbshipit-source-id: 8d59d7ace0c522c17b7af17c50e16af876911bad
Summary:
To help find potential issues not showing up in ARM unit tests. I'm running it with and without TransactionDB (write-committed) for better coverage. The job expands the size of /dev/shm for adequate space on maximum performance storage, and adds swap space to reduce risk of OOM in case we fill that up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14172
Test Plan: earlier drafts of this PR added the job to PR jobs, and the last before putting in "nightly" can be seen here: https://github.com/facebook/rocksdb/actions/runs/19945493840/job/57193797390?pr=14172
Reviewed By: archang19
Differential Revision: D88429479
Pulled By: pdillinger
fbshipit-source-id: bd4d9cda9256950c3c6c126c299a44dbbbc30c7e
Summary:
Fix missing const for arg of OptionChangeMigration
We switched from std::string to std::string & for API OptionChangeMigration, which caused const qualifier to be lost at call site, which causes compilation failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14173
Test Plan: Unit test
Reviewed By: pdillinger
Differential Revision: D88431457
Pulled By: xingbowang
fbshipit-source-id: a705f3b80cc5ff56dab73aa6a31c940798d8df45
Summary:
Revert "Fix a bug where compaction with range deletion can persist kTypeMaxValid in file metadata (https://github.com/facebook/rocksdb/issues/14122)"
Add a new unit test to capture the situation found by stress test
This reverts commit 8c7c8b8dab.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14170
Test Plan: Unit Test
Reviewed By: anand1976
Differential Revision: D88395956
Pulled By: xingbowang
fbshipit-source-id: 226649dc79a86010ad326ffb2eae35109dc96bc4
Summary:
Continuing work from https://github.com/facebook/rocksdb/issues/13965. Here I'm migrating the "next with shift" kind of bit field and for that I've added an API for atomic additive transformations that can be combined into a single atomic update for multiple fields. (I implemented more features than needed, just in case they are needed someday and to demonstrate what is possible.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14027
Test Plan: BitFields unit test updated/added, existing HCC tests
Reviewed By: xingbowang
Differential Revision: D83895094
Pulled By: pdillinger
fbshipit-source-id: e4487f34f5607b20f94b85a645ca654e6401e35d
Summary:
I want to reduce the time from when we call `StopBackup` to `CreateNewBackup` returning `BackupStopped`. We already check for the `stop_backup_` inside `CopyOrCreateFile` and `ReadFileAndComputeChecksum`, but we should add a check at the top of these methods to abort immediately. This could help save some latency from the file system metadata operations, like creating the sequential file and writable file.
We also want to update the API documentation for `StopBackup` which currently does not indicate that once it is called, all subsequent requests to create backups will fail.
In a follow up PR, we should also add coverage of `StopBackup` to the crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14129
Test Plan:
We were missing unit test coverage for `StopBackup`. I added test cases which cancel backups at different points in time.
Once this change is rolled out to production, we can monitor the DB close latencies, which depend on first cancelling ongoing backups
Reviewed By: pdillinger
Differential Revision: D87356536
Pulled By: archang19
fbshipit-source-id: 687094a41f096f6a156be65b2cce0b5054fb26f2
Summary:
Support ccache in make file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14123
Test Plan: local build
Reviewed By: cbi42
Differential Revision: D87332892
Pulled By: xingbowang
fbshipit-source-id: 2088bd19bdab1bd7070734c886200be80f1a65af
Summary:
... from https://github.com/facebook/rocksdb/issues/14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.)
Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14150
Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers.
Reviewed By: archang19
Differential Revision: D87793987
Pulled By: pdillinger
fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
Summary:
This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.)
Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also:
* Require a Clone() function from Compressors, to support proper implementations of MaybeCloneSpecialized() in wrapper Compressors.
* Assert that the default implementation of CompressorWrapper::MaybeCloneSpecialized() is only used in allowable cases.
* Convenience function Compressor::CloneMaybeSpecialized()
This also fixes a serious bug/oversight in ManagedPtr for (ManagedWorkingArea) that somehow wasn't showing up before. It probably doesn't need a release note because CompressionManager stuff is still considered experimental.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14140
Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors.
Reviewed By: hx235
Differential Revision: D87600019
Pulled By: pdillinger
fbshipit-source-id: 252ef78910073a0e45f2c81dd45ac87ff8a41fc6
Summary:
Range deletion start keys are considered during compaction for cutting output files. Due to some ordering requirement (see comment above InsertNextValidRangeTombstoneAtLevel()) between truncated range deletion start key and a file's point keys, there was logic in f6c9c3bf1c/db/range_del_aggregator.cc (L39) that changes the value type to be kTypeMaxValid. However, kTypeMaxValid is not supposed to be persisted per f6c9c3bf1c/db/dbformat.h (L75-L76). This can cause forward compatibility issues reported in https://github.com/facebook/rocksdb/issues/14101. This PR fixes this issue by removing the logic that sets kTypeMaxValid and always skip truncated range deletion start key in CompactionMergingIterator.
For existing SST files, we want to avoid using this kTypeMaxValid, so this PR also introduces a new placeholder value type. This allows us to re-strengthen the relevant value type checks (IsExtendedValueType()) that was loosen for kTypeMaxValid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14122
Test Plan:
- a unit test that persists kTypeMaxValid before this fix
- crash test with frequent range deletion: `python3 ./tools/db_crashtest.py blackbox --delrangepercent=11 --readpercent=35`
- Generate SST files with 0x1A as value type (kTypeMaxValid before this change) in file metadata. Run ldb with the strengthened check in IsExtendedValueType() to dump the MANIFEST. It failed to parse MANIFEST as expected before this PR and succeeds after this PR.
```
Error in processing file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 Corruption: VersionEdit: new-file4 entry The file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 may be corrupted.
```
Reviewed By: pdillinger
Differential Revision: D87016541
Pulled By: cbi42
fbshipit-source-id: 9957a095db2cd9947463b403f352bd9a1fd70a76
Summary:
Fixing internal validator failure
```
Every project specific source file must contain a doc block with an appropriate copyright header. Unrelated files must be listed as exceptions in the Copyright Headers Exceptions page in the repo dashboard.
A copyright header clearly indicates that the code is owned by Meta. Every open source file must start with a comment containing "Meta Platforms, Inc. and affiliates"
https://github.com/facebook/rocksdb/blob/main/buckifier/targets_cfg.py:
The first 16 lines of 'buckifier/targets_cfg.py' do not contain the patterns:
(Meta Platforms, Inc. and affiliates)|(Facebook, Inc(\.|,)? and its affiliates)|([0-9]{4}-present(\.|,)? Facebook)|([0-9]{4}(\.|,)? Facebook)
```
While fixing the text to pass the linter, I took the opportunity to modify `format-diff.sh` script to add the copyright header automatically if missing in new files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14143
Test Plan:
```
$> make format
```
**new python file**
```
build_tools/format-diff.sh
Checking format of uncommitted changes...
Checking for copyright headers in new files...
Added copyright header to build_tools/test.py
Copyright headers were added to new files.
Nothing needs to be reformatted!
```
**new header file**
```
build_tools/format-diff.sh
Checking format of uncommitted changes...
Checking for copyright headers in new files...
Added copyright header to db/db_impl/db_impl_jewoongh.h
Copyright headers were added to new files.
Nothing needs to be reformatted!
```
Reviewed By: hx235
Differential Revision: D87653124
Pulled By: jaykorean
fbshipit-source-id: 164322cfcd2c162bb3b41bb8f3bafefa3f20b695
Summary:
**Context/Summary:**
.. because verify_output_flags contains information of usage of paranoid_file_check that is currently not yet compatible with resumable remote compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14139
Test Plan: Existing tests
Reviewed By: jaykorean
Differential Revision: D87582635
Pulled By: hx235
fbshipit-source-id: ef21223da53a0696fa3ca9b1617c2c1ee2e19878
Summary:
**Context/Summary:**
Due to double's 53-bit mantissa limitation, large uint64_t values lose precision when converted to double. Value equals to or smaller than UINT64_MAX (but greater than 2^64 - 1024) round up to 2^64 since rounding up results in less error than rounding down, which exceeds UINT64_MAX. `std::numeric_limits<uint64_t>::max() / op1 < op2` won't catch those cases. Casting such out-of-range doubles back to uint64_t causes undefined behavior. T
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14132
UndefinedBehaviorSanitizer: undefined-behavior options/cf_options.cc:1087:32 in
```
before the fix but not after.
Test Plan:
```
COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CC=clang-18 CXX=clang++-18 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j55 db_stress
python3 tools/db_crashtest.py --simple blackbox --compact_range_one_in=5 --target_file_size_base=9223372036854775807 // Half of std::numeric_limits<uint64_t>::max()
```
It fails with
```
stderr:
options/cf_options.cc:1087:32: runtime error: 1.84467e+19 is outside the range of representable values of type 'unsigned long'
Reviewed By: pdillinger
Differential Revision: D87434936
Pulled By: hx235
fbshipit-source-id: 65563edf9faf732410bdba8b9e4b7fd61b958169
Summary:
I have been using sst_dump --command=recompress for some ad hoc automation for compression engineering and these new options help with that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14133
Test Plan: manual
Reviewed By: hx235
Differential Revision: D87453635
Pulled By: pdillinger
fbshipit-source-id: 2ae54e13a9221ec27c6637fea16623465a9163ae
Summary:
Saw a mysterious failure of assertion
`assert(rep_->props.num_data_blocks == 0)` in
DBCompressionTest/CompressionFailuresTest.CompressionFailures/45. This seems to be caused by a parallel compression failure arriving after the emit thread has started Finish() but before the Flush() at the start of Finish(). We can fix this by relaxing the assertion to allow for the !ok() case. Testing revealed more ok() assertions that needed to be relaxed/moved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14130
Test Plan: Added a sync point to inject a failure status in the right place and added to unit test to be sure the case is essentially covered. It would arguably be a more realistic test to force a particular thread interleaving but I believe simple is good here.
Reviewed By: hx235
Differential Revision: D87377709
Pulled By: pdillinger
fbshipit-source-id: 4bd465673b084afcc235688503d1c2f464eed32d
Summary:
**Context/Summary:**
This PR adds multi-cf support to option migration. The original implementation sets options, opens db, compacts files and reopens the db in almost all the three branches below. Such design makes expanding to multi-cf difficult as it needs to change all these places within each of the branch causing code redundancy.
```
Status OptionChangeMigration(std::string dbname, const Options& old_opts,
const Options& new_opts) {
if (old_opts.compaction_style == CompactionStyle::kCompactionStyleFIFO) {
// LSM generated by FIFO compaction can be opened by any compaction.
return Status::OK();
} else if (new_opts.compaction_style ==
CompactionStyle::kCompactionStyleUniversal) {
return MigrateToUniversal(dbname, old_opts, new_opts);
} else if (new_opts.compaction_style ==
CompactionStyle::kCompactionStyleLevel) {
return MigrateToLevelBase(dbname, old_opts, new_opts);
} else if (new_opts.compaction_style ==
CompactionStyle::kCompactionStyleFIFO) {
return CompactToLevel(old_opts, dbname, 0, 0 /* l0_file_size */, true);
} else {
return Status::NotSupported(
"Do not how to migrate to this compaction style");
}
}
```
Therefore this PR
- Refactor the option migration implementation by moving the common parts into the high-level `OptionChangeMigration()` through `PrepareNoCompactionCFDescriptors()` and `OpenDBWithCFs()` so `MigrateAllCFs()` can focus on compaction only.
- Treat the original OptionChangeMigration() API as a special case of the multi-cf version option migration
- Add multiple-cf support
A few notes:
- CompactToLevel() originally modifies the compaction-related options conditionally before doing compaction. This is moved into earlier steps through `ApplySpecialSingleLevelSettings()` in `PrepareNoCompactionCFDescriptors()`
- MigrateToUniversal() originally opens the db twice with essentially the same option. This PR reduces that to one open
- Option migration does not always use the old option to compact the db and reopen the db after migration, see ` return CompactToLevel(new_opts, dbname, new_opts.num_levels - 1,/*l0_file_size=*/0, false);`. `PrepareNoCompactionCFDescriptors()` is where we handle those decisions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14059
Test Plan:
- Existing UTs
- New UTs
Reviewed By: cbi42
Differential Revision: D84852970
Pulled By: hx235
fbshipit-source-id: 936b456cf9fb4c3ccb687e5d1387f2d67a1448be
Summary:
This diff introduces the async prepare of all iterators within a MultiScan. The current state has each iterator be prepared as its needed, and with this diff, we prepare all iterators during the prepare phase of the Level Iterator, this will allow more time for each IO to be dispatched and serviced, increasing the odds that a block is ready as the scan seeks to it.
Benchmark is prefilled using
```
KEYSIZE=64
VALUESIZE=512
NUMKEYS=5000000
SCAN_SIZE=100
DISTANCE=25000
NUM_SCANS=15
THREADS=1
./db_bench --db=$DB \
--benchmarks="fillseq" \
--write_buffer_size=5242880 \
--max_write_buffer_number=4 \
--target_file_size_base=5242880 \
--disable_wal=1 --key_size=$KEYSIZE \
--value_size=$VALUESIZE --num=$NUMKEYS --threads=32
}
```
And benchmark ran is
```
run() {
echo 1 | sudo tee /proc/sys/vm/drop_caches
./db_bench --db=$DB --use_existing_db=1 \
--benchmarks=multiscan \
--disable_auto_compactions=1 --seek_nexts=$SCAN_SIZE \
--multiscan-use-async-io=1 \
--multiscan-size=$NUM_SCANS --multiscan-stride=$DISTANCE \
--key_size=$KEYSIZE --value_size=$VALUESIZE \
--num=$NUMKEYS --threads=$THREADS --duration=60 --statistics
}
```
The benchmark uses large stride sides to ensure that two scans would touch separate files. We reduce the size of the block cache to increase likelyhood of reads (and simulate larger data sets)
**Branch:**
```
Integrated BlobDB: blob cache disabled
RocksDB: version 10.8.0
Date: Tue Nov 11 13:26:29 2025
CPU: 166 * AMD EPYC-Milan Processor
CPUCache: 512 KB
Keys: 64 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2746.6 MB (estimated)
FileSize: 1525.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
multiscan_stride = 25000
multiscan_size = 15
seek_nexts = 100
DB path: [/data/rocksdb/mydb]
multiscan : 837.941 micros/op 1193 ops/sec 60.001 seconds 71605 operations; (multscans:71605)
```
**Baseline:**
```
Set seed to 1762898809121995 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB: version 10.9.0
Date: Tue Nov 11 14:06:49 2025
CPU: 166 * AMD EPYC-Milan Processor
CPUCache: 512 KB
Keys: 64 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2746.6 MB (estimated)
FileSize: 1525.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
multiscan_stride = 25000
multiscan_size = 15
seek_nexts = 100
DB path: [/data/rocksdb/mydb]
multiscan : 1129.916 micros/op 885 ops/sec 60.001 seconds 53102 operations; (multscans:53102)
```
Repeated for confirmation.
This introduces a ~20% improvement in latency and op/s.
Note: Benchmarks are single threaded as, when increasing thread count, we start seeing large amounts of overhead being induced by block cache contention, finally resulting in both baseline and branch becoming equal.
Further on network attached storage with high latency, the level iterator, preparing all iterators so a 20% improvement even at high thread counts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14100
Reviewed By: anand1976
Differential Revision: D86913584
Pulled By: krhancoc
fbshipit-source-id: da9d0c890e25e392a33389ce6b80f9bfb84d3f85
Summary:
Oversight in https://github.com/facebook/rocksdb/issues/13964. More detail:
* Applies to cache_bench and db_bench (db_stress already using it)
* Make sure those along with db_stress treat "hyper_clock_cache" as "auto_hyper_clock_cache" because this is now the blessed implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14120
Test Plan: manual runs of the tools
Reviewed By: krhancoc
Differential Revision: D86913202
Pulled By: pdillinger
fbshipit-source-id: 07b425d3522103417f4b034735376b9d759af5fb
Summary:
Right now, in Java's Get() calls, the way Get() is treated is inefficient. Status.NotFound is turned into an exception in the JNI layer, and is caught in the same function to turn into not found return. This causes significant overhead in the scenario where most of the queries ending up with not found. For example, in Spark's deduplication query, this exception creation overhead is higher than Get() itself. With the proposed change, if return status is NotFound, we directly return, rather than going through the exception path
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14095
Test Plan: Existing tests should cover all Get() cases, and they are passing.
Reviewed By: jaykorean
Differential Revision: D86797594
Pulled By: cbi42
fbshipit-source-id: 1202d24e46a2358976bb7c8ff38a2fd4783d0f99
Summary:
There are instances where an application might be interested in knowing the distribution in SST files for a key range in a particular level.
This implementation creates an overloaded GetColumnFamilyMetaData api where (startKey, EndKey) can be passed along with level information to filter the necessary sst files along with the keyranges for each sst file
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14009
Reviewed By: anand1976
Differential Revision: D83389707
fbshipit-source-id: 6df1dc1f9233efe9000b03cc1831b3c618cbcef3
Summary:
Support trivial move in CompactFiles API, which is not supported previously.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14112
Test Plan: Unit test
Reviewed By: cbi42
Differential Revision: D86546150
Pulled By: xingbowang
fbshipit-source-id: 08a3ae9a055f3d3d41711403b1695f44977e6ea8
Summary:
**Summary:**
Merge the BuiltinFilterBitsBuilder into FilterBitsBuilder. This enables using
CalculateSpace() for accurate filter size estimation instead of hardcoded
bits-per-key which could result in incorrect estimations for different filter types.
The previous hardcoded estimate of 15 bits per key was in the filter block builders UpdateFilterSizeEstimate().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14111
Test Plan: - Existing filter tests pass (bloom_test, full_filter_block_test, filter_bench, db_bloom_filter_test)
Reviewed By: pdillinger
Differential Revision: D86473287
Pulled By: nmk70
fbshipit-source-id: cd4a47351e67444e944d5b1b375b3b13274dd6e3
Summary:
For all compactions, RocksDB performs a lightweight sanity check on output SST files before installation (in `CompactionJob::VerifyOutputFiles()`). However, this lightweight check may not catch corruption that is small enough to allow the SST files to still be opened.
There is an existing feature, `paranoid_file_check`, which opens the SST file, iterates through all keys, and checks the hash of each key. While this provides the ultimate level of data integrity checking, it comes at a high computational cost.
In this PR, we introduce a new mutable CF option, `verify_output_flags`. The `verify_output_flags` is a bitmask enum that allows users to select various verification types, including block checksum verification, full key iteration, and file checksum verification (to be added in subsequent PRs). Note that the existing `paranoid_file_check` option is equivalent to a full key iteration check. Block-level checksum verification is much lighter than the full key iteration check.
Please note that the previously deprecated `verify_checksums_in_compaction` option (removed in version 5.3.0) was for verifying the checksum of **input SST files**. RocksDB continues to perform this verification for both local and remote compactions, and this behavior remains unchanged. In contrast, this PR focuses on verifying the **output SST files**.
## To follow up
- File-level Checksum verification for output SST files
- Deprecate `paranoid_file_checks` option in favor of the new option
- Add to stress test / db_bench
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14103
Test Plan:
New Unit Test added. The corruption is both detected by `paranoid_file_check` and various types of verification set by this new option, `verify_output_flags`
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.CorruptedOutput*"
```
Reviewed By: pdillinger
Differential Revision: D86357924
Pulled By: jaykorean
fbshipit-source-id: a9e04798f249c7e977231e179622a0830d6675fe
Summary:
MultiScanUnexpectedSeekTarget() currently uses user key comparison to decide on the next data block for multiscan. This can cause a multiscan to move backward in the following scenario:
data block 1: ..., k@7, k@6
data block 2: k@5, ...
DB iter scan through k@7, k@6 and k@5 and decides to seek to k@0 due to option [`max_sequential_skip_in_iterations`](d56da8c112/include/rocksdb/advanced_options.h (L621-L629)). Multiscan was on data block 2, but moves to data block 1 after the seek.
This can cause assertion failure in debug mode and seg fault in prod since older data blocks are unpinned and freed as we advanced a multiscan. This PR fixes the issue by forcing a multiscan to never go backward.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14106
Test Plan: - added a new unit test that reproduces the scenario: `./db_iterator_test --gtest_filter="*ReseekAcrossBlocksSameUserKey*"`
Reviewed By: xingbowang
Differential Revision: D86428845
Pulled By: cbi42
fbshipit-source-id: ab623f93e73298a60857fb2ff268366f289092a0
Summary:
This test is now taking > 6 hours, timing out, and has low signal, so creating a weekly job for it, with an explicit timeout of 12 hours.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14110
Test Plan: watch CI
Reviewed By: virajthakur
Differential Revision: D86428262
Pulled By: pdillinger
fbshipit-source-id: 44103518064ca378f3fd2ff8d21967ede698c8ea
Summary:
Adds auto-tuning of manifest file size to avoid the need to scale `max_manifest_file_size` in proportion to things like number of SST files to properly balance (a) manifest file write amp and new file creation, vs. (b) manifest file space amp and replay time, including non-incremental space usage in backups. (Manifest file write amp comes from re-writing a "live" record when the manifest file is re-created, or "compacted"; space amp is usage beyond what would be used by a compacted manifest file.) In more detail,
* Add new option `max_manifest_space_amp_pct` with default value of 500, which defaults to 0.2 write amp and up to roughly 5.0 space amp, except `max_manifest_file_size` is treated as the "minimum" size before re-creating ("compacting") the manifest file.
* `max_manifest_file_size` in a way means the same thing, with the same default of 1GB, but in a way has taken on a new role. What is the same is that we do not re-create the manifest file before reaching this size (except for DB re-open), and so users are very unlikely to see a change in default behavior (auto-tuning only kicking in if auto-tuning would exceed 1GB for effective max size for the current manifest file). The new role is as a file size lower bound before auto-tuning kicks in, to minimize churn in files considered "negligibly small." We recommend a new setting of around 1MB or even smaller like 64KB, and expect something like this to become the default soon.
* These two options along with `manifest_preallocation_size` are now mutable with SetDBOptions. The effect is nearly immediate, affecting the next write to the current manifest file.
Also in this PR:
* Refactoring of VersionSet to allow it to get (more) settings from MutableDBOptions. This touches a number of files in not very interesting ways, but notably we have to be careful about thread-safe access to MutableDBOptions fields, and even fields within VersionSet. I have decided to save copies of relevant fields from MutableDBOptions to simplify testing, etc. by not saving a reference to MutableDBOptions but getting notified of updates.
* Updated some logging in VersionSet to provide some basic data about final and compacted manifest sizes (effects of auto-tuning), making sure to avoid I/O while holding DB mutex.
* Added db_etc3_test.cc which is intended as a successor to db_test and db_test2, but having "test.cc" in its name for easier exclusion of test files when using `git grep`. Intended follow-up: rename db_test2 to db_etc2_test
* Moved+updated `ManifestRollOver` test to the new file to be closer to other manifest file rollover testing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14076
Test Plan:
As for correctness, new unit test AutoTuneManifestSize is pretty thorough. Some other unit tests updated appropriately. Manual tests in the performance section were also audited for expected behavior based on the new logging in the DB LOG. Example LOG data with -max_manifest_file_size=2048 -max_manifest_space_amp_pct=500:
```
2025/10/24-11:12:48.979472 2150678 [/version_set.cc:5927] Created manifest 5, compacted+appended from 52 to 116
2025/10/24-11:12:49.626441 2150682 [/version_set.cc:5927] Created manifest 24, compacted+appended from 2169 to 1801
2025/10/24-11:12:52.194592 2150682 [/version_set.cc:5927] Created manifest 91, compacted+appended from 10913 to 8707
2025/10/24-11:13:02.969944 2150682 [/version_set.cc:5927] Created manifest 362, compacted+appended from 52259 to 13321
2025/10/24-11:13:18.815120 2150681 [/version_set.cc:5927] Created manifest 765, compacted+appended from 80064 to 13304
2025/10/24-11:13:35.590905 2150681 [/version_set.cc:5927] Created manifest 1167, compacted+appended from 79863 to 13304
```
As you can see, it only took a few iterations of ramp-up to settle on the auto-tuned max manifest size for tracking ~122 live SST files, around 80KB and compacting down to about 13KB. (13KB * (500 + 100) / 100 = 78KB). With the default large setting for max_manifest_file_size, we end up with a 232KB manifest, which is more than 90% wasted space. (A long-running DB would be much worse.)
As for performance, we don't expect a difference, even with TransactionDB because actual writing of the manifest is done without holding the DB mutex. I was not able to see a performance regression using db_bench with FIFO compaction and >1000 ~10MB SST files, including settings of -max_manifest_file_size=2048 -max_manifest_space_amp_pct={500,10,0}. No "hiccups" visible with -histogram either.
I also tried seeding a 1 second delay in writing new manifest files (other than the first). This had no significant effect at -max_manifest_space_amp_pct=500 but at 100 started causing write stalls in my test. In many ways this is kind of a worst case scenario and out-of-proportion test, but gives me more confidence that a higher number like 500 is probably the best balance in general.
Reviewed By: xingbowang
Differential Revision: D85445178
Pulled By: pdillinger
fbshipit-source-id: 1e6e07e89c586762dd65c65bb7cb2b8b719513f9
Summary:
**Summary:**
This change introduces tail size estimation during SST construction to improve compaction file cutting accuracy to prevent oversized files. The BlockBasedTableBuilder now estimates the SST tail size (index and filter blocks) and uses this estimate, in addition to the data size, to determine when to cut files during compaction.
**Problem:**
Currently, file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized. This often leads to SST files that exceed target file size limits.
**Behavior Change:**
Implement size estimation methods for index and filter builders, and integrate these estimates into BlockBasedTableBuilder via a new EstimatedTailSize() method. This method aggregates estimates from all tail components and is used for file cutting decisions during compaction.
**Performance Considerations:**
To minimize CPU overhead, size estimates are updated when data blocks are finalized rather than on every key add. For index builders, estimates are updated when index entries are added (one per data block). For filter builders, the OnDataBlockFinalized() hook triggers estimate updates when data blocks are cut/finalized.
This approach provides:
* Minimal impact to compaction hot path (key additions)
* Near real-time estimates for file cutting decisions
* Meaningful estimate changes only when data blocks are finalized
**Usage:**
* Set true mutable cf option `compaction_use_tail_size_estimation`
to use tail size estimation for compaction file cutting decisions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14051
Test Plan:
* Assert tail size estimate is an overestimate in BlockBasedTableBuilder::Finish
* Add new test to verify compaction output file is below target file size
**Next steps:**
* Enable tail size estimation for compaction file cutting by default (and other improvements)
Reviewed By: pdillinger, cbi42
Differential Revision: D84852285
Pulled By: nmk70
fbshipit-source-id: c43cf5dbd2cb2f623a0622591ef24eee30ce0c87
Summary:
* Fix nightly build-linux-cmake-with-folly-lite-no-test for real this time
with correct include directory. (CMakeLists.txt)
* Add test runs to that build (and rename)
* Improve folly build caching with a folly.mk file with most of the relevant
parts of Makefile that contribute to the checkout_folly and
build_folly builds. This reduces the risk of false passing of CI job with
cache folly build. This caching is still only for folly debug builds, (which
is probably OK with just a single nightly build relying on release folly
build, which also serves as a rough canary against false passing
because of caching).
* Use `make VERBOSE=1` after cmake calls for detailed output
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14099
Test Plan:
temporary CI change to put the relevant parts in pr-jobs,
then back to homes including in nightly
Reviewed By: mszeszko-meta
Differential Revision: D86243363
Pulled By: pdillinger
fbshipit-source-id: f7975fa190ef45195c6d0b74417f7886e551516a
Summary:
... caused by public headers depending on build parameters (macro definitions). This change also adds a check under 'make check-headers' (already in CI) looking for potential future violations.
I've audited the uses of '#if' in public headers and either
* Eliminated them
* Systematically excluded them because they are intentional or similar (details in comments in check-public-header.sh
* Manually excluded them as being ODR-SAFE
In the case of ROCKSDB_USING_THREAD_STATUS, there was no good reason for this to appear in public headers so I've replaced it with a static bool ThreadStatus::kEnabled. I considered getting rid of the ability to disable this code but some relatively recent PRs have been submitted for fixing that case. I've added a release note and updated one of the CI jobs to use this build configuration. (I didn't want to combine with some jobs like no_compression and status_checked because the interaction might limit what is checked.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14096
Test Plan: manual 'make check-headers' + manual cmake as in new CI config + CI
Reviewed By: jaykorean
Differential Revision: D86241864
Pulled By: pdillinger
fbshipit-source-id: d16addc9e3480706b174a006720a4def0740bf2e
Summary:
Following up on https://github.com/facebook/rocksdb/pull/14071, updating folly to
8a9fc1e80a or beyond was failing an F14Table assertion for a very subtle reason: ODR violation between the folly build and RocksDB build because folly build was release mode and RocksDB build was debug mode. What was happening was that folly change introduced a dependence on kDebug (whether build is debug) in a hashing implementation in a .h file, and the inconsistency between the inlined implementation during RocksDB build and the linked-to implementation from the folly build was leading to inconsistencies in the data structure.
The primary fix is to ensure we build folly in debug mode for debug mode RocksDB builds. Also,
* Needed to use the `patchelf` tool in `build_folly` to ensure the glog dependency shared library can always find its own gflags dependency. I explored many options for working around this, and this is what would work without reworking folly's own build.
* Updated folly to latest commit.
* Thrown in an ad hoc folly patch to use ftp.gnu.org mirrors (the canonical is super slow)
* Moved the placement of GETDEPS_USE_WGET=1 to apply to local builds also, to avoid the issue of a large download almost reaching completion and then stalling indefinitely.
* Fix failing nightly build-linux-cmake-with-folly-lite-no-test with fmt includes in cmake build (as was done with make build)
* Add a release mode folly+RocksDB to nightly CI, including both cmake and make. This also serves as a non-cached folly build to detect potential problems with PR jobs working from cached folly build.
* Move build-linux-cmake-with-folly to nightly because it's mostly covered by build-linux-cmake-with-folly-coroutines
Intended follow-up:
* folly-lite build with tests
* Make the folly build caching more friendly+accurate by hashing the relevant Makefile parts and tagging whether debug or release. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14094
Test Plan: manual + CI
Reviewed By: mszeszko-meta
Differential Revision: D85864871
Pulled By: pdillinger
fbshipit-source-id: 50009b33422d5781074fcbbdf18089be9e36800d
Summary:
Resolving this folly upgrade required fixing the FOLLY_LITE build with header include from the 'fmt' library.
I was close to timing out on fixing USE_FOLLY_LITE and removing it altogether - it could be considered obsolete and/or not worth the maintenance cost.
Follow-up: make the folly build caching more friendly by hashing the relevant makefile parts. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves.
UPDATE/NOTE: I wasn't able to fully update to latest due to a failure seen in F14, using the next folly commit or later. The source of the bug is likely outside of F14 but investigation is in progress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14071
Test Plan: CI
Reviewed By: jaykorean
Differential Revision: D85268833
Pulled By: pdillinger
fbshipit-source-id: 1d0a2d61f095524a20e6ec796ef46c02d0696f4e
Summary:
Change PosixWritableFile's Truncate to the new end offset. This ensures that future appends are written with no holes or overwrites. RocksDB doesn't guarantee this in the FileSystem contract, and its left up to the specific implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14088
Reviewed By: cbi42
Differential Revision: D85786398
Pulled By: anand1976
fbshipit-source-id: 3520d9d6336362f5128a17bbf396297d821a5da3
Summary:
Comprehensive performance optimizations for the RocksDB C API that eliminate unnecessary memory allocations and copies.
## Key Changes
### 1. PinnableSlice for Get Operations (50% reduction in copies)
- Changed all `rocksdb_get*` functions to use `PinnableSlice` internally instead of `std::string`
- **Before:** RocksDB → std::string → malloc'd buffer (2 copies)
- **After:** RocksDB → malloc'd buffer (1 copy)
- Affects: Get, Transaction Get, TransactionDB Get, WriteBatch Get variants
### 2. Array-Based MultiGet with PinnableSlice (30% allocation reduction)
- Switched MultiGet operations to use optimized array-based RocksDB API with `PinnableSlice`
- Eliminates vector overhead and string allocations
- Affects: MultiGet, Transaction MultiGet, TransactionDB MultiGet variants
### New Zero-Copy APIs
Added high-performance zero-copy functions for applications that can use them:
- `rocksdb_iter_key_slice()` / `value_slice()` / `timestamp_slice()` - Return slices by value (eliminates output param overhead)
- `rocksdb_batched_multi_get_cf_slice()` - Batched get with slice array input
- `rocksdb_slice_t` - ABI-compatible slice type
Note that this pr builds on top of https://github.com/facebook/rocksdb/pull/13911
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14036
Reviewed By: pdillinger
Differential Revision: D85604919
Pulled By: jaykorean
fbshipit-source-id: 7f04b935eea79af1d45b3125a79b90e4706666f6
Summary:
Stress test can fail with assertion inside MultiScan in some reseek scenario. E.g., data block 1 ends with k@9, data block 2 starts with k@8, when a DB iter seeks to k@0 (see option `max_sequential_skip_in_iterations`), MultiScan will land in data block 1 due to fd0b4e0cf0/table/block_based/block_based_table_iterator.cc (L1258-L1263).
We can't just use internal key as separator since index block might not use it. I plan to follow up with a fix that never moves `cur_data_block_idx` backward within a MultiScan.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14087
Test Plan: CI and internal crash tests
Reviewed By: anand1976
Differential Revision: D85701668
Pulled By: cbi42
fbshipit-source-id: d3f1aaff40a12be4e3d1b4b7160bf2547f43b849
Summary:
All remote compaction test failures had `mmap_read=1` in common. Unfortunately, the failure hasn't been very reproducible. Try disabling `mmap_read` to see if that shed some light.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14083
Test Plan: CI
Reviewed By: hx235
Differential Revision: D85622229
Pulled By: jaykorean
fbshipit-source-id: bbe9e08efc369813f0fec388c910446089e43650
Summary:
As titled, this fixes some internal crash test failures when UDT is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14085
Test Plan: monitor crash tests.
Reviewed By: anand1976
Differential Revision: D85617949
Pulled By: cbi42
fbshipit-source-id: da6fb21c0ca5803ea24e8daf7de8558321babcf4
Summary:
Due to some internal requirements, what's being used for`$SSH` and `$SCP` has changed and it broke the regression test. (e.g. tarball streaming to remote host no longer works)
Minor behavior changes to the script to make the internal workflow work.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14079
Test Plan:
```
./tools/regression_test.sh
```
Meta Internal automation
Reviewed By: pdillinger
Differential Revision: D85502798
Pulled By: jaykorean
fbshipit-source-id: d294c2ee47661fbe368ccc318062e891f3ac7c81
Summary:
The TTL-based WAL archive cleanup logic could incorrectly delete an archived WAL if the system clock moved backwards between the last write to that WAL and `WALManager::PurgeObsoleteWALFiles()`. This happened due to unsigned underflow in subtraction of two wall clock based timestamps: `now_seconds - file_m_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14016
Test Plan: unit test repro
Reviewed By: pdillinger
Differential Revision: D83879806
Pulled By: hx235
fbshipit-source-id: 643e7f623c6b5c31711565854314cfd6cbbcf3a7
Summary:
Fixed a missing CV signal when `FindObsoleteFiles()` decides there is nothing to purge and then decrements `pending_purge_obsolete_files_` to zero. This bug could cause `DB::GetSortedWalFiles()` to hang, at least.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14069
Test Plan: unit test repro
Reviewed By: hx235
Differential Revision: D85453534
Pulled By: cbi42
fbshipit-source-id: cf5cfe7f5087459ca1f1f28ce81ea6afc84178f0
Summary:
* Address feedback from https://github.com/facebook/rocksdb/issues/14040
* Add additional test for MultiScan
* Fix a bug when del range and data are in same file for multi-scan
* Rewrite the cases need to be handled in SeekMultiScan
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14055
Test Plan: Unit test
Reviewed By: cbi42, anand1976
Differential Revision: D84851788
Pulled By: xingbowang
fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
Summary:
Linter complains like this
```
void foo(Arg parameter_name) {}
void bar() {
Arg a;
foo(/*some_other_name=*/ a); // Wrong! Comment/parameter name mismatch
foo(/*parameter_name=*/ a); // This is OK; the names match.
}
```
```
Argument name in comment (`read_only`) does not match parameter name (`unchanging`).
```
This used to be warning, but now treated as an error :(
Fixing a few other linter warnings before they become errors in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14074
Test Plan: CI
Reviewed By: archang19
Differential Revision: D85370353
Pulled By: jaykorean
fbshipit-source-id: 20e96aad740d516a29c0424282674e655f99c0a2
Summary:
When a standalone range deletion file is ingested in L0, currently it is compacted with any overlapping L0 files. This is not desirable when we ingest new data on top of the range deletion file. This PR fixes the compaction picking logic to only consider L0 files older than the standalone range deletion file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14061
Test Plan: added a new unit test and updated an existing one.
Reviewed By: xingbowang
Differential Revision: D84930780
Pulled By: cbi42
fbshipit-source-id: 65f4403ccb40ba964b9e65b09e2f7f7efebe81df
Summary:
**Context/Summary:**
- Add resumable compaction to stress test with adaptive progress cancellation
- Add fault injection to remote compaction
- Fix a real minor bug in a couple testing framework bugs with remote compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14041
Test Plan: - Rehearsal stress test, finding bugs for https://github.com/facebook/rocksdb/pull/13984 effectively and did not create new failures.
Reviewed By: jaykorean
Differential Revision: D84524194
Pulled By: hx235
fbshipit-source-id: 42b4264e428c6739631ed9aa5eb02723367510bc
Summary:
With cache hit and compiler option optimization, the compilation time build time is reduced from 40 min to 2 min. Overall build time is reduced from 60 min to less 20 minutes on cache hit on majority of the source file. On 100% cache miss, it would be around 40 minutes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14064
Test Plan: Github CI
Reviewed By: mszeszko-meta
Differential Revision: D85023882
Pulled By: xingbowang
fbshipit-source-id: 98551880c98f14d36133ff43e6af8c3be94ab465
Summary:
Fixing a nullptr access in multiscan, under following situation.
```
Block Based Table: blk1:[k1,k2], blk2:[k3, k8], blk3:[k9]
Scan ranges: [k1, k4), [k5,k6), [k7, k10)
Prepared block ranges: [0,2], [2,2], [1,3]
```
1. Seek key k1 on the first range, read key k1, k2.
2. Seek key k4 on the 2nd range, blocks 0,1 would be unpinned.
3. Seek key k9, block 1 would be accessed, but it is unpinned, which trigger assert failure in debug mode and nullptr access on release build.
This fix changes how blocks are unpinned. It is now only unpinning the block, when the cur_data_block_idx has passed it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14062
Test Plan:
Unit Test
rand_seed 304010984 on UserDefinedIndexStressTest
Reviewed By: cbi42
Differential Revision: D84976410
Pulled By: xingbowang
fbshipit-source-id: 6b99bf85fc9d4108c5267ae77be77ccfe08923cd
Summary:
**Problem:** RocksDB was making unnecessary prefetch system calls on file systems that don't support prefetch operations, potentially leading to wasted CPU cycles.
**Fix:** Add kFSPrefetch to FSSupportedOps enum to allow file systems to indicate prefetch support capability. File systems can now opt out of prefetch calls by not setting this field.
**Backwards compatibility:** File systems that don't override SupportedOps() continue to receive prefetch calls exactly as before. Only file systems that explicitly opt out by not setting kFSPrefetch will avoid the calls.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13917
Test Plan:
- Added a new test in block_based_table_reader.
- Run existing tests: ```make prefetch_test && ./prefetch_test```
Reviewed By: anand1976
Differential Revision: D81607145
Pulled By: nmk70
fbshipit-source-id: 3bbefa05919034e8776ea4e4540cdc695cdc6d3f
Summary:
Currently we return `File is too large for PlainTableReader!` when the file size exceeds our pre-defined constant. There was a request to have the file size information logged when this error is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14056
Reviewed By: nmk70
Differential Revision: D84834869
Pulled By: archang19
fbshipit-source-id: 8f332b6a31d51f320c7e2db06ad49f50798ff70e
Summary:
* Reduce build time of folly from 45m~1hr down to 25m. This is achieved by caching folly build artifact from previous build.
* Reduce windows build time of folly from 1hr 15m down to 50m. This is done by increase windows build machine size.
* Fix build on macos on other macos target.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14057
Test Plan: github CI
Reviewed By: archang19, nmk70
Differential Revision: D84848041
Pulled By: xingbowang
fbshipit-source-id: 00306750737070e7e446ee436d607ed6ecae79ae
Summary:
We simulate remote compaction in our stress test by running a separate set of worker threads to run compactions. In reality, these remote compactions run on a different host or (at least in a different process) where we cannot share the TableFactory and BlockCache with the main DB process.
To make this simulated remote compaction closer to reality, create a new TableFactory for each remote compaction in stress test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14050
Test Plan:
```
python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10
```
Reviewed By: hx235
Differential Revision: D84775656
Pulled By: jaykorean
fbshipit-source-id: d6203fcbe0eca3539e008a19fd47b742553537ed
Summary:
We are adding more and more tests, so we need to increase the number of shards in macos build to reduce overall CI time.
macos-15-xlarge image is ARM, which has 5 vCPU cores, but is still 50% faster than the intel x86 12 vCPU.
Test time reduced from 1h 37m to 14m.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14048
Reviewed By: archang19
Differential Revision: D84741917
Pulled By: xingbowang
fbshipit-source-id: 9ba9bd696d3b2152f11dec2fb4280572b98233d5
Summary:
Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after https://github.com/facebook/rocksdb/issues/14040 .
Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14046
Test Plan:
1. Add unit test in db_iterator_test
2. Run crash test
Reviewed By: xingbowang
Differential Revision: D84623871
Pulled By: anand1976
fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
Summary:
The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14040
Test Plan: Unit test
Reviewed By: anand1976
Differential Revision: D84292297
Pulled By: xingbowang
fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869
Summary:
Multi scan crash/stress tests are failing when skip_stats_update_on_db_open is true, because LevelIterator::Prepare relies on these stats in FileMetaData to make decisions. Disable it in crash tests until the proper fix is ready.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14039
Reviewed By: archang19
Differential Revision: D84280059
Pulled By: anand1976
fbshipit-source-id: f9f58b94c24d1f455432b05f3bf97f25c7233e3c
Summary:
**Context/Summary:**
There is no way to tag or rate-limit write IO occurs during FlushWAL() with priority. Under `Options::manual_wal_flush=true`, it is the major source of write IO during user writes so we decide to add that support. A new option struct `FlushWALOptions` is introduced to avoid making the API ugly for future new fields.
Also, we can't use the WriteOptions (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/options.h#L2293-L2302 i) since is associated with that particular Put/Merge/.. associated with that option but FlushWAL() can happen after that write. There is no way to carry that write option over in RocksDB. I also avoided using the WriteOptions since it's mostly for live write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14037
Test Plan: New UTs `TEST_P(DBRateLimiterOnManualWALFlushTest, ManualWALFlush)`
Reviewed By: archang19
Differential Revision: D84193522
Pulled By: hx235
fbshipit-source-id: 18feb5235672010d19a101ce52c8abdcc4a789f2
Summary:
- Include Status in RemoteCompactionResultMap in SharedState so that we can directly check the status of the remote compaction in `DbStressCompactionService::Wait()`
- If result is empty, populate the result with the status that was returned from `GetRemoteCompactionResult()` so that the status can be bubbled up to the primary (main db thread)
- Get rid of Timeout in `Wait()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14022
Test Plan:
With fall-back
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=1
```
Without fall-back
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=0
```
Reviewed By: hx235
Differential Revision: D83789172
Pulled By: jaykorean
fbshipit-source-id: 08f710c4ece5fcc1d4b95b3f9c353831882851b7
Summary:
Fix the binutils truncated download issue by switching to wget in the folly build scripts for downloading dependencies.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14030
Test Plan: make build_folly
Reviewed By: jaykorean
Differential Revision: D84033126
Pulled By: anand1976
fbshipit-source-id: bc6706d7e57c97d6edff149a965aa12c7959825f
Summary:
MultiScan currently doesn't handle delete range properly. In this specific case, a file with only delete range will have an empty index resulting in BlockBasedTableIterator wrongly thinking that a scan doesn't intersect the file due to empty result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14026
Test Plan: Run crash test
Reviewed By: xingbowang
Differential Revision: D83881266
Pulled By: anand1976
fbshipit-source-id: dc1faa494ea23f36391b700dd1ee0430a1f20ac5
Summary:
When there is an ingested SST file that only contains delete range operations, MultiScan may return error "Scan does not intersect with file". This is due to file selection during Prepare uses the file smallest and largest key without considering whether there is any key in the file. This is only a temporary fix.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14028
Test Plan: Unit test
Reviewed By: anand1976
Differential Revision: D83986964
Pulled By: xingbowang
fbshipit-source-id: e0961ca854e2062c2457be4324817ba073ae785d
Summary:
Implicit reseek in the middle of an iteration is not supported with MultiScan. Avoid this for now in crash tests by setting max_sequential_skip_in_iterations to an absurdly high value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14015
Reviewed By: xingbowang
Differential Revision: D83761612
Pulled By: anand1976
fbshipit-source-id: 16f4e856374b79170c0a79c11c275cbb0fc83a70
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14024
Fix some typo found along the codebase
Reviewed By: pdillinger
Differential Revision: D83789182
fbshipit-source-id: feb24d7d47a6faaf735fcfd50dd3ecce4a6c8cd5
Summary:
When inplace_update_support and memtable_veirfy_per_key_checksum_on_seek are enabled at the same time, it would cause data race in memtable.
inplace_update_support allows key/value pair in place update in memtable.
memtable_veirfy_per_key_checksum_on_seek performs key checksum verification during seek. It is possible that one thread is updating the key/value pair in place, while another thread is reading the key/value pair for checksum verification during seek.
Therefore, there these 2 configurations could not be enabled at the same time
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14023
Test Plan: local stress test run stops reporting race condition
Reviewed By: anand1976
Differential Revision: D83812322
Pulled By: xingbowang
fbshipit-source-id: 6cb9f0f3faa8deba97305bfe87266f2fe78e0501
Summary:
In RocksDB 10.6 with https://github.com/facebook/rocksdb/issues/13805, due to inaccurate testing of an async system, it went undetected at the time that LZ4 compression was using more CPU despite making a change to reuse stream objects which dramatically improved LZ4HC compression efficiency.
This change switches to using a basic LZ4 compress API which appears to be faster than all of these:
* Legacy behavior of creating LZ4_stream_t for each compression
* 10.6-10.7 behavior of re-using streams between compressions for the same file (with stream-as-WorkingArea)
* using LZ4's extState APIs without streams (with extState-as-WorkingArea) (data not shown in below results)
Also in this PR: more improvements to sst_dump --recompress, which is arguably the best SST construction benchmark right now since db_bench seems to be so noisy due to backgroun flush+compaction, even with no compaction (FIFO). Streamlined some output and added a SST read time test, mostly for decompression performance.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14017
Test Plan:
Performance test using sst_dump --recompress with newer sst_dump back-ported to 10.5:
```
./sst_dump --command=recompress --compression_types=kLZ4Compression
test5.sst --compression_level_from=-6 --compression_level_to=-1
```
and with default compression level.
10.5:
```
Cx level: -6 Cx size: 61608137 Write usec: 880404
Cx level: -5 Cx size: 60793749 Write usec: 840903
Cx level: -4 Cx size: 58134030 Write usec: 836365
Cx level: -3 Cx size: 55193773 Write usec: 857113
Cx level: -2 Cx size: 54013891 Write usec: 855642
Cx level: -1 Cx size: 50400393 Write usec: 865194
Cx level: 32767 Cx size: 50400393 Write usec: 886310
```
Before this change (showing the regression, more time, from 10.6:
```
Cx level: -6 Cx size: 61608137 Write usec: 933448
Cx level: -5 Cx size: 60793749 Write usec: 893826
Cx level: -4 Cx size: 58134030 Write usec: 891138
Cx level: -3 Cx size: 55193773 Write usec: 898461
Cx level: -2 Cx size: 54013891 Write usec: 897485
Cx level: -1 Cx size: 50400393 Write usec: 936970
Cx level: 32767 Cx size: 50400393 Write usec: 958764
```
After this change (faster than both the above):
```
Cx level: -6 Cx size: 63641883 Write usec: 874190
Cx level: -5 Cx size: 58860032 Write usec: 834662
Cx level: -4 Cx size: 57150188 Write usec: 832707
Cx level: -3 Cx size: 58791894 Write usec: 850305
Cx level: -2 Cx size: 53145885 Write usec: 839574
Cx level: -1 Cx size: 49809139 Write usec: 845639
Cx level: 32767 Cx size: 49809139 Write usec: 875199
```
Similar tests with dictionary compression show essentially no difference (need to use stream APIs and reuse doesn't seem to matter). LZ4HC also unaffected (still improved vs. 10.5)
Reviewed By: hx235
Differential Revision: D83722880
Pulled By: pdillinger
fbshipit-source-id: 30149dd187686d5dd98321e6aa7d74bd7653a905
Summary:
Pad block based table based on super block alignment
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13909
Test Plan:
Unit Test
No impact on perf observed due to change in the inner loop of flush.
upstream/main branch 202.15 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1; grep fillseq /tmp/x1 | grep -Po "\d+\.\d+ MB/s" | grep -Po "\d+\.\d+" | awk '{sum+=$1} END {print sum/NR}'
```
After the change without super block alignment 203.44 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1
```
After the change with super block alignment 204.47 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 --super_block_alignment_size=131072 --super_block_alignment_max_padding_size=4096 >> /tmp/x1 2>&1;
```
Reviewed By: pdillinger
Differential Revision: D83068913
Pulled By: xingbowang
fbshipit-source-id: eecd65088ab3e9dbc7902aab8c2580f1bc8575df
Summary:
### Context/Summary:
Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact()
This PR focuses on SubcompactionProgress -> CompactionJob and CompactionJob -> SubcompactionProgress -> Compaction progress file. For now only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway.
The actual triggering of progress persistence and resuming (i.e, integration) is through DB::OpenAndCompact() in the upcoming PR.
**Resume Flow**
1. input_iter->Seek(next_internal_key_to_compact) // Position iterator
2. ReadTableProperties() // Validate existing outputs
3. RestoreCompactionOutputs() in CompactionOutputs // Rebuild output file metadata
4. Restore critical statistics about processed input and output records count for verification later
5. AdvanceFileNumbers() // Prevent file number conflicts
6. Continue normal compaction from positioned iterator or fallback to not resuming compaction in limited case or fail the compaction entirely
**Persistence Strategy**
1. When: At each SST file completion (FinishCompactionOutputFile()). This is the simplest but most expensive frequency. See below for benchmarking and potential follow-up items
2. What: Serialize, write and sync the in-memory SubcompactionProgress to a dedicated manifest-like file
3. For simplicity: Only persist at "clean" boundaries (no overlapping user keys, no range deletions, no timestamp for now)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13983
Test Plan:
- New unit test in CompactionJob level to cover basic compaction progress resumption
- Existing UTs and stress/crash test to test no correctness regression to existing compaction code
- Run benchmark to ensure no performance regression to existing compaction code
```
./db_bench --benchmarks=fillseq[-X10] --db=$db --disable_auto_compactions=true --num=100000 --value_size=25000 --compression_type=none --target_file_size_base=268435456 --write_buffer_size=268435456
```
Pre-PR:
fillseq [AVG 10 runs] : 45127 (± 799) ops/sec; 1076.6 (± 19.1) MB/sec
fillseq [MEDIAN 10 runs] : 45375 ops/sec; 1082.5 MB/sec
Post-PR (regressed 0.057%, ignorable)
fillseq [AVG 10 runs] : 45101 (± 920) ops/sec; 1076.0 (± 22.0) MB/sec
fillseq [MEDIAN 10 runs] : 45385 ops/sec; 1082.8 MB/sec
Reviewed By: jaykorean
Differential Revision: D82889188
Pulled By: hx235
fbshipit-source-id: 8553fd478f134969d331af2c5a125b94bd747268
Summary:
This method will be used to improve the compaction logic by accounting for the tail size, in addition to the data size, when determining when to cut a file.
Problem: Currently the file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized.
Key changes:
- Add EstimateCurrentIndexSize() to IndexBuilder interface
- Implement in ShortenedIndexBuilder with buffer that accounts for the next index entry. The buffer addresses under-estimation where the current index size doesn't account for the next index entry associated with the data block currently being built. The 2x multiplier bounds the estimate in the right direction and handles outlier cases with large keys.
- Add num_index_entries_ member to track added index entries (== data blocks emitted). This is thread-safe since it's updated/read in the serialized emit step.
Next steps:
- Partitioned index size estimation implementation
- Update compaction file cutting logic to consider index size estimation
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14010
Test Plan: Added a new test class with unit tests for new builder size estimation across all IndexBuilder implementations.
Reviewed By: pdillinger
Differential Revision: D83501741
Pulled By: nmk70
fbshipit-source-id: d58fc2a9e92e12a162f6244d4abd707a9c9e1885
Summary:
This PR fixes a bug in how MultiScan handled a scan range limit falling in the key range between files. The bug was in LevelIterator, where Prepare() relied on FindFile to determine the lower bound file for the range limit. FindFile returns the smallest file index with `range.limit < file.largest_key`. However, that doesn't guarantee that the range overlaps the file, as the `range.limit` could be smaller than `file.smallest_key`.
This also fixes a bug in BlockBasedTableIterator of Valid() returning true even if status() returned error. This was exposed by the previous bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14011
Test Plan: Add unit tests in db_iterator_test and table_test
Reviewed By: cbi42
Differential Revision: D83496439
Pulled By: anand1976
fbshipit-source-id: a9d2d138d69d0c816d9f4160a984b273d00d683f
Summary:
Pretty self-explanatory from the changes, including re-arranging the "COOL" entries for easier tracking of which values are used.
I'm not touching the TICKER_ENUM_MAX issue because IIRC we've gotten in trouble in the past for changing any Java ticker values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14012
Test Plan: CI, sufficient prompts to get AI to discover the known issues relayed by hx235, to help ensure we found any other outstanding issues.
Reviewed By: hx235
Differential Revision: D83497503
Pulled By: pdillinger
fbshipit-source-id: ec0bd7e28188e0430fb03fc5bd79c2ed7b28f3ad
Summary:
Pass the comparator to UDI interface for both reader and builder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14001
Test Plan: Unit test
Reviewed By: anand1976
Differential Revision: D83339943
Pulled By: xingbowang
fbshipit-source-id: 7f6541776b0995260e28224329f0cca37f13b3d4
Summary:
currently BlockBasedTableIterator::Prepare() fails the iterator with non-ok status if an out-of-range scan option is detected. This is due to the interaction between LevelIterator and BlockBasedTableIterator, see added comment above BlockBasedTableIterator::Prepare(). This can fail stress test for L0 files since it doesn't use LevelIterator and scan options are not pruned. This PR fixes this by adding an internal option to MultiScanArgs that enables this check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13995
Test Plan:
- new unit test
- stress test that fails before this pr: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=0 --mmap_read=0 --level0_file_num_compaction_trigger=20`
Reviewed By: anand1976
Differential Revision: D83166088
Pulled By: cbi42
fbshipit-source-id: 241a7d43c8c00d9a98eea0cabb03d2174d51aae5
Summary:
There can be concurrent reads/writes to fields in `IODebugContext`. One example we have seen is for the `cost_info` field which is of type `std::any`. In fact, in RocksDB's async MultiRead implementation, the same `IODebugContext` is re-used across separate async read requests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13993
Test Plan: Update code which reads/writes to `cost_data` to first acquire shared/exclusive lock on the `mutex` field. There should not be any race conditions when async MultiRead is used.
Reviewed By: pdillinger
Differential Revision: D83091423
Pulled By: archang19
fbshipit-source-id: 4db86d33cf162ed39114b1cd115fcd8964c8ff9b
Summary:
Remove the restriction of only using BytewiseComparator(). In a follow on PR, the UDI interface will be updated to take the Comparator as a parameter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13999
Test Plan: Add a unit test in table_test.cc
Reviewed By: cbi42
Differential Revision: D83179747
Pulled By: anand1976
fbshipit-source-id: 60222533c71022aa0701ac61c39268d36ca86338
Summary:
In https://github.com/facebook/rocksdb/issues/13964 I changed an expensive DEBUG check in ~AutoHyperClockTable to only run in ASAN builds. It's still expensive so I'm modifying it to scan only about one page beyond what we expect to have written to the anonymous mmap, rather than scanning the whole thing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13998
Test Plan: manually checked that lru_cache_test running time went from 5.0s to 4.0s after the change. Verified that existing unit test ClockCacheTest.Limits uses the full anonymous mmap to be sure it is sized as expected, by temporarily breaking AutoHyperClockTable::Grow() to allow slightly exceeding the anonymous mmap size.
Reviewed By: cbi42
Differential Revision: D83178493
Pulled By: pdillinger
fbshipit-source-id: a2bf093e98bf68b540c073800be7e193021f2692
Summary:
This combination causes MultiScan iteration to fail due to internal reseek by the iterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13992
Reviewed By: cbi42
Differential Revision: D83094631
Pulled By: anand1976
fbshipit-source-id: 96410747d88de391e6d65857d39063d4fb113d65
Summary:
Fix the bug in Improve random seed override support in stress test.
The Bug:
`parser.parse_known_args()` is used to parse command line argument. When it is called without any argument, it uses sys.argv as input parameter. In sys.argv, the first argument is the command itself, so parser.parse_known_args skip the first argument. Meantime, the return value `remain_argv` of `parser.parse_known_args()` does not contain the command itself. When `remain_arg` replaces `sys.argv`, the first argument is treated as the command itself, which is skipped by `parser.parse_known_args()`. In the internal stress test tool, the first argument is `--stress_cmd`, therefore, it is skipped. Instead, the default value `./stress_db` is used. This is why `./stress_db` showed up in the error message. This is also why it works in local, as stress_db is located in the local folder.
The Fix:
When `parser.parse_known_args()` is called first time, the remain_argv is saved as a global variable. It is used in the second call of the `parser.parse_known_args(remain_argv)`. When argument is passed to `parser.parse_known_args` directly, the first argument will not be skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13991
Test Plan:
The the value of first argument `--stress_cmd` is parsed correctly, and shown up in the error message.
```
/usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress --cleanup_cmd='' --simple blackbox --print_stderr_separately
Start with random seed 11107847853133580500
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000
Use random seed for iteration 8577470137673434540
Traceback (most recent call last):
File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1650, in <module>
main()
File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1639, in main
blackbox_crash_main(args, unknown_args)
File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1358, in blackbox_crash_main
hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1294, in execute_cmd
child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress'
```
Reviewed By: hx235
Differential Revision: D83068960
Pulled By: xingbowang
fbshipit-source-id: 28334d38a444c6f8525444e15f460ec6b257ef38
Summary:
Return a failure status for multi scan if Prepare fails, or if the scan options are unsupported, instead of falling back on a regular scan. This PR also fixes a bug in LevelIterator that caused max_prefetch_size to be ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13974
Test Plan: Add new test in db_iterator_test and table_test
Reviewed By: xingbowang
Differential Revision: D82843944
Pulled By: anand1976
fbshipit-source-id: f12756c40ebd38d8d4e4425e97438b6e766a4663
Summary:
**Context/Summary**
This reverts commit 73432a3f36. This is due to it mysteriously fails our internal CI running with this change to db_crashtest.py. The root-cause is unknown but the error only reproed with this commit frequently but not the one before it. The error message appears to be the command parsing leading to the db_stress binary can't be found
```
Traceback (most recent call last):
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1638, in <module>
main()
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1627, in main
blackbox_crash_main(args, unknown_args)
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1347, in blackbox_crash_main
hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1283, in execute_cmd
child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './db_stress'
```
**Test plan**
- Rehearsal crash test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13989
Reviewed By: xingbowang
Differential Revision: D83010751
Pulled By: hx235
fbshipit-source-id: d8cfc70564074065b6bb8a3986d6c1011064dd5e
Summary:
This is causing some internal failure, we decide to revert this for now until we have a proper fix.
This reverts commit 961880b458.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13987
Reviewed By: anand1976
Differential Revision: D82990294
Pulled By: cbi42
fbshipit-source-id: 5f5b4d18d0afe47599738d27e11e3eb2d08d88a0
Summary:
**Context**
Resuming compaction is designed to periodically record the progress of an ongoing compaction and can resume from that saved progress after interruptions such as cancellation, database shutdown, or crashes.
This PR introduces the data structures needed to store subcompaction progress in memory, along with serialization and deserialization support to persist and parse this progress to/from "a manifest-like compaction progress file" (the actual creation of such file is in upcoming PRs).
Flow of resuming: DB::OpenAndCompact() -> Compaction progress file -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file -> DB that is called with OpenAndCompact()
**Summary**
Progress represented by `SubcompactionProgress` will be tracked at the scope of a subcompaction, which is the smallest independent unit of compaction work.
The frequency of recording this progress is once every N compaction output files (to be detailed in future PRs).
When recording, all fields, except for the output files metadata in `SubcompactionProgress`, will directly overwrite the corresponding fields from the last saved progress (See `SubcompactionProgress` and `SubcompactionProgressBuilder` for more).
As a bonus, this PR refactors the file metadata encoding and decoding utilities into two static helper functions, EncodeToNewFile4() and DecodeNewFile4From(), to support subcompaction progress usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13928
Test Plan:
- Added various `SubcompactionProgressTest` unit tests in version_edit_test.cc to verify basic serialization/deserialization and forward compatibility handling
- Existing UTs and stress/crash test
**Follow up:**
- Move output entry number and file verification to after each file creation so we can remove kNumProcessedOutputRecords persistence support and make resuming compaction work with `paranoid_file_checks=true` (by default false). Output verification will be done before persistence of progress. As long as this follow-up is done before the landing of the integration PR to create the progress file, we can change the manifest-like compaction progress file format freely.
Reviewed By: jaykorean
Differential Revision: D81986583
Pulled By: hx235
fbshipit-source-id: b42766da7d9c2e2f596c892d050c753238d1039f
Summary:
for MultiScan and UDI we start to use bound check from index iterator, so removing this assert here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13988
Test Plan: existing test
Reviewed By: hx235
Differential Revision: D82993180
Pulled By: cbi42
fbshipit-source-id: 442b2e83cb3aef96fc1a825bf733af9ce59c21c1
Summary:
It is useful to be able to specify output temperatures in the CompactFiles API. For example it may be useful to store small L0 files produced by flushes locally, while larger intra-L0 compactions can store the compacted L0 file remotely.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13955
Test Plan: New unit tests
Reviewed By: jaykorean
Differential Revision: D82492503
Pulled By: joshkang97
fbshipit-source-id: e1225fe572a15d7c5c30a265762b048a4a9e7f0b
Summary:
- updated release note
- updated version to 10.8 in version.h
- added 10.7 to check_format_compatible.sh
- did not updated folly commit hash due to some build failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13980
Reviewed By: xingbowang
Differential Revision: D82882035
Pulled By: cbi42
fbshipit-source-id: b5e0e78570fdd492d592ee77bd3901e4b39c25fb
Summary:
the test did not consider the ingestion_option settings that can result in different error message. This PR fixes the relevant check and ensure we have enough randomness in this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13979
Test Plan: `gtest-parallel --repeat=20 --workers=20 ./external_sst_file_test --gtest_filter="*VaryingOptions/IngestDBGeneratedFileTest2.NonZeroSeqno/*"`
Reviewed By: hx235
Differential Revision: D82873439
Pulled By: cbi42
fbshipit-source-id: b0d74bf26a502ca3db59b4a0ea9717bf7d027400
Summary:
Start the process of migrating the HCC implementation over to my new system of "bit field atomics" to clean up the code. Here I took on the simplest of the three "bit field atomic" formats in HCC, but ended up moving some things around to end up with less plumbing of definitions and values overall.
In the process, updated BitFields to use the CRTP pattern to simplify some things (see updated example, etc.)
https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13965
Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.
Reviewed By: xingbowang
Differential Revision: D82669582
Pulled By: pdillinger
fbshipit-source-id: b73dd47361cbe9fbd334413dd4ce01b3c667159e
Summary:
longtime wanted e.g. for easy tab-completion, now implemented
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13978
Test Plan: pretty good unit test updates, manual testing
Reviewed By: cbi42
Differential Revision: D82857671
Pulled By: pdillinger
fbshipit-source-id: d2b63b7d15e61ebf22c58a6ecd3003311e2d03cb
Summary:
* There was a bug where the compression manager would actually not be used for recompress because the options passed to SstFileDumper were not respected. That is now fixed by respecting the Options.
* Refactored SstFileDumper not to take explicit options that could naturally be embedded in Options.
* Report compressed and uncompressed data block sizes (and ratio) instead of total file size (without a useful ratio). Needed to add a new table property to support that.
* Allow --block_size instead of --set_block_size to be consistent with other tools
* Allow --compression_level as shorthand for both _from and _to options, for simplicity and consistency with other tools
* Support --compression_parallel_threads option
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13977
Test Plan:
* sst_dump manual testing
* TableProperties unit tests updated
* Made it much easier to detect when a functional change requires an update to ParseTablePropertiesString() (rather than causing cryptic downstream failures)
Reviewed By: cbi42
Differential Revision: D82841412
Pulled By: pdillinger
fbshipit-source-id: 8d3421be4d2a3e25b7590cd59d204a3779c2a928
Summary:
Currently in MultiScan we only unpins a block after we scan through it. This PR adds unpinning during Seek to release all blocks pinned by the previous scan range. This is useful when users do not scan through the entire scan range. I plan to follow up with support for aborting async IOs from the previous scan.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13972
Test Plan: new test MultiScanUnpinPreviousBlocks validates unpinning behavior
Reviewed By: xingbowang
Differential Revision: D82779504
Pulled By: cbi42
fbshipit-source-id: 17ba7d1e5a6d8ff09ceea57b79c18febfba75584
Summary:
This change adds FFI support for exporting column family checkpoints, basic access to the export/import files metadata, and creating column families by import.
I've been able to successfully use this to [add checkpoint export and import support to `rust-rocksdb`](https://github.com/pcholakov/rust-rocksdb/pull/2), a forked version of which has been successfully used in production for some time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13874
Reviewed By: hx235
Differential Revision: D82343565
Pulled By: jaykorean
fbshipit-source-id: fb4182bdfd5cce10743c021a1ac636fd6ac48df3
Summary:
If there's a static initialization of Options() this could now instantiate an AutoHyperClockTable before kPageSize is initialized. Break the dependency because it's a very minor optimization.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13973
Test Plan: internal CI (not able to reproduce locally)
Reviewed By: hx235
Differential Revision: D82789849
Pulled By: pdillinger
fbshipit-source-id: 3f32b5779a4f56d2071be5aadacda2bf0f4b895d
Summary:
Add a new CF immutable option `paranoid_memory_check_key_checksum_on_seek` that allows additional data integrity validations during seek on SkipList Memtable. When this option is enabled and memtable_protection_bytes_per_key is non zero, skiplist-based memtable will validate the checksum of each key visited during seek operation. The option is opt-in due to performance overhead. This is an enhancement on top of paranoid_memory_checks option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13902
Test Plan:
* new unit test added for paranoid_memory_check_key_checksum_on_seek=true.
* existing unit test for paranoid_memory_check_key_checksum_on_seek=false.
* enable in stress test.
Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR:
### Memtable-only randomread ops/sec:
* Value size = 100 Bytes
```
for B in 0 1 2 4 8; do (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=100 --num=250000 --reads=500000 --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done;
```
1. Main: 928999
2. PR with paranoid_memory_check_key_checksum_on_seek=false: 930993 (+0.2%)
3. PR with paranoid_memory_check_key_checksum_on_seek=true:
3.1 memtable_protection_bytes_per_key=1: 464577 (-50%)
3.2 memtable_protection_bytes_per_key=2: 470319 (-49%)
3.3 memtable_protection_bytes_per_key=4: 468457 (-50%)
3.4 memtable_protection_bytes_per_key=8: 465061 (-50%)
* Value size = 1000 Bytes
```
for B in 0 1 2 4 8; do (for I in $(seq 1 50);do ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=1000 --num=250000 --reads=500000 --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done;
```
1. Main: 601321
2. PR with paranoid_memory_check_key_checksum_on_seek=false: 607885 (+1.1%)
3. PR with paranoid_memory_check_key_checksum_on_seek=true:
3.1 memtable_protection_bytes_per_key=1: 185742 (-69%)
3.2 memtable_protection_bytes_per_key=2: 177167 (-71%)
3.3 memtable_protection_bytes_per_key=4: 185908 (-69%)
3.4 memtable_protection_bytes_per_key=8: 183639 (-69%)
Reviewed By: pdillinger
Differential Revision: D81199245
Pulled By: xingbowang
fbshipit-source-id: e3c29552ab92f2c5f360361366a293fa26934913
Summary:
Force caller of MultiScanArgs to pass comparator. Pass comparator from CF handle to MultiScanArgs in NewMultiScan.
Expand MultiScanArgs unit test with different comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13970
Test Plan: unit test
Reviewed By: cbi42
Differential Revision: D82739270
Pulled By: xingbowang
fbshipit-source-id: e709f4a333ad547c0ba6d24d8fb2b22e50e8a12f
Summary:
**Context/Summary:**
`Status::state` can be nullptr when created with no specific error message. std::strstr on nullptr caused some segfault in our stress test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13968
Test Plan: Monitor stress test
Reviewed By: jaykorean
Differential Revision: D82695541
Pulled By: hx235
fbshipit-source-id: cf08f70163a9ee6c911cdc3a3d79acd3429f0d15
Summary:
After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates:
* In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration).
* In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC.
* When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free.
* Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message.
* Cleaned up some includes
* Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes)
* Added a new mode to cache_bench (detailed below)
* Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build.
Planned follow-up:
* Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain
Possible follow-up:
* Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964
Test Plan:
existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.
## Performance
Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example:
```
/usr/bin/time ./cache_bench -stress_
cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432
```
Release build:
Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each)
->
Average 4.3 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each)
Debug build:
Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each)
->
Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each)
Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000 -cache_size=1073741824`:
Release build:
Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each)
->
Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each)
Debug build:
Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each)
->
Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each)
Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm:
3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed
There is increased kernel time but acceptable. With ASAN+UBSAN:
11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed
Acceptable given that our ASAN+UBSAN builds are not the slowest in CI
Reviewed By: hx235
Differential Revision: D82661067
Pulled By: pdillinger
fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b
Summary:
when ingesting DB generated file with non-zero sequence number, we need smallest seqno of each file for file meta data. To avoid full table scan, we record this information in table property and use it during file ingestion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13942
Test Plan: new unit test and updated existing unit test.
Reviewed By: hx235
Differential Revision: D82331802
Pulled By: cbi42
fbshipit-source-id: 3009a6801ca7092cd0fde33692db1a13567068a9
Summary:
This PR fixes a bug in BlockBasedTableIterator::Prepare in conjunction with a user defined index (UDI). If the UDI determines a scan range to be empty and thus returns the kOutOfBound iteration result during Seek, the iteration result is not propagated up and Prepare() assumes end of file and aborts the remaining scans. This results in incorrect behavior and unpredictable multi scan results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13960
Test Plan: Add unit test to table_test.cc
Reviewed By: xingbowang
Differential Revision: D82590892
Pulled By: anand1976
fbshipit-source-id: 8cfaaae2bb1a9509ddf8ec967cb8a8801748413d
Summary:
* Fix compaction/flush CPU usage stats to include CPU usage by parallel compression workers. (Validated with manual db_bench testing.)
* Disable the parallel compression framework when compression is disabled. See new code comment for details, because in theory it could be useful to hide SST write latency, but manual testing with db_bench and -rate_limiter_bytes_per_sec or -simulate_hdd options shows no useful increase in throughput, just more CPU usage.
* Fix some minor clean-up items in the implementation
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13959
Test Plan: Also ran some tests like in https://github.com/facebook/rocksdb/issues/13910 to ensure the new CPU usage tracking did not regress performance, all good.
Reviewed By: xingbowang
Differential Revision: D82556686
Pulled By: pdillinger
fbshipit-source-id: 77c522159a7e6ab0ab6f7fb1d662070a46661557
Summary:
The stress test runs concurrent transactions through many threads at the same time on a shared key space. It is possible that a dead lock or a timeout is detected from the transactiondb layer. When this happens, simply return from the function and continue the test, instead of fail the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13950
Test Plan: Stress test pass locally with the same random seed from stress test 14723229280871643749.
Reviewed By: hx235
Differential Revision: D82373959
Pulled By: xingbowang
fbshipit-source-id: 5d72e89998171c5844fb22f13d8f061f81014c7d
Summary:
... reporting false positive double-lock on some of the new parallel compression code. Switching from std::condition_variable to condition_variable_any simply changes the FP from double-lock to lock inversion. In addition, leaking ParallelCompressionRep instances to avoid memory location reuse fails to fix the FP reports. Thus, I've decided to disable the watchdog with GCC+TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13958
Test Plan: local crash test runs could reproduce, now don't reproduce. CLANG TSAN doesn't seem to be reporting the same supposed issues
Reviewed By: xingbowang
Differential Revision: D82555968
Pulled By: pdillinger
fbshipit-source-id: 537fbc3a787f917915a6faf0bdedd1449a7f378a
Summary:
Complete redo of parallel compression in block_based_table_builder.cc to greatly reduce cross-thread hand-off and blocking. A ring buffer of blocks-in-progress is used to essentially bound working memory while enabling high throughput. Unlike before, all threads can participate in compression work, for a kind of work-stealing algorithm that reduces the need for threads to block. This builds on improvements in https://github.com/facebook/rocksdb/pull/13850
Previously, there was either
* parallel_threads==1, the *emit thread* (caller from flush/compaction) doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks, `parallel_threads` worker threads compress blocks, and a writer thread writes to the SST file. Total of `parallel_threads + 2` threads participating. (Other bookkeeping in emit and write steps omitted from description for simplicity.)
Now we have either
* parallel_threads==1 (same), the emit thread doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks and can take up compression work when the ring buffer is full; `parallel_threads` worker threads have as their top priority to write compressed blocks to the SST file but also take up compression work in priority order of next-to-write. Total of `parallel_threads + 1` threads participating. In some cases, this could result in less throughput than before, but arguably the previous implementation was using more threads than explicitly allowed.
## Future/alternate considerations
Although we could likely have used some framework for micro-work sharing across threads, that could be difficult with the asymmetry of work loads and thread affinity. Specifically, (a) it would be quite challenging to allow emit work in other threads, because it happens in the caller of BlockBasedTableBuilder, (b) async programming is unlikely to pay off until we have an async interface for writing SST files, and (c) this implementation will nevertheless serve as a benchmark for what we lose or gain in such a framework vs. a hand-tuned system.
This implementation still creates and destroys threads for each SST file created. We hope in the future to have more governance and/or pooling of worker threads across various flushes and compactions, but that is not available currently and would require significant design and implementation work.
## More details
* This implementation makes use of semaphores for idling and re-waking threads. `std::counting_semaphore` and `binary_semaphore` offer the best performance (see benchmark results below) but some implementations are known to have correctness bugs. Also, my attempt at upgrading CI for C++20 support (required for these) in https://github.com/facebook/rocksdb/pull/13904 is actually incomplete. Therefore, using these structures is opt-in with `-DROCKSDB_USE_STD_SEMAPHORES` at compile time, and a naive semaphore implementation based on mutex and condvar is used by default. A folly alternative (folly::fibers::Semaphore) was dropped in during development and found to be less efficient than the naive implementation. One CI job is upgraded to test with the new opt-in.
* One of the biggest concerns about correctness/reliability for this implementation is the possibility of hitting a deadlock, in part because that is not well checked in the DB crash test (a challenging problem!). Note also that with the parallel compression improvements in this release, I am calling the feature production-ready, so there is an extra level of confidence needed in the reliability of the feature. Thus, for DEBUG builds including crash test, I have added a watchdog thread to each parallel SST construction that heuristically checks for the most likely kinds of deadlock that could happen, including for the case of buggy semaphore implementations. It periodically verifies that some thread is outside of its "idle" state, and if the watchdog wakes up repeatedly to see all live threads stuck in their idle state (even if wake-up was attempted) then it declares a deadlock. This feature was manually verified for several seeded deadlock bugs. (More details in code comments.)
* For CPU efficiency, this implementation greatly simplifies the logic to estimate the outstanding or "inflight" size not yet written to the SST file. I expect this size to generally be insignificant relative to the full SST file size so is not worth careful engineering. And based on Meta's current needs, landing under-size for an SST file is better than over-size. See comments on `estimated_inflight_size` for details.
* Some other existing atomics in block_based_table_builder.cc modified to use safe atomic wrappers.
* Status handling in BlockBasedTableBuilder was streamlined to get rid of essentially redundant `status`+`io_status` fields and associated code. Made small optimizations to reduce unnecessary IOStatus copies (with StatusOk()) and mark status conditional branches as LIKELY or UNLIKELY.
* Prefer inline field initialization to initialization in constructor.
* Minimize references to the `parallel_threads` configuration parameter for better separation of concerns / sanitization / etc. For example, use non-nullity of `pc_rep` to indicate that parallel compression is enabled (and active).
* Some other refactoring to aid the new implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13910
Test Plan:
## Correctness
Already integrated into unit tests and crash test. CI updated for opt-in semaphore implementation. Basic semaphore unit tests added/updated.
As for the tremendous simplification of logic relating to hitting target SST file size, as expected, the new behavior could under-shoot the single-threaded behavior by a small number of blocks, which will typically affect the file size by ~1/1000th or less. I think that's a good trade-off for cutting out unnecessarily complex code with non-trivial CPU cost (FileSizeEstimator).
```
./db_bench -db=/dev/shm/dbbench_filesize_after8 -benchmarks=fillseq,compact -num=10000000 -compression_type=zstd -compression_level=8 -compression_parallel_threads=8
```
Before, PT=8 & PT=1, and After PT=1 the same or very similar
```
-rw-r--r-- 1 peterd users 67474097 Sep 12 15:32 000052.sst
-rw-r--r-- 1 peterd users 67474214 Sep 12 15:32 000053.sst
-rw-r--r-- 1 peterd users 67473834 Sep 12 15:32 000054.sst
-rw-r--r-- 1 peterd users 67473437 Sep 12 15:32 000055.sst
-rw-r--r-- 1 peterd users 67473835 Sep 12 15:32 000056.sst
-rw-r--r-- 1 peterd users 67473204 Sep 12 15:33 000057.sst
-rw-r--r-- 1 peterd users 67473294 Sep 12 15:33 000058.sst
-rw-r--r-- 1 peterd users 67473839 Sep 12 15:33 000059.sst
```
After, PT=8 (worst case here ~0.05% smaller)
```
-rw-r--r-- 1 peterd users 67463189 Sep 12 14:55 000052.sst
-rw-r--r-- 1 peterd users 67465233 Sep 12 14:55 000053.sst
-rw-r--r-- 1 peterd users 67466822 Sep 12 14:55 000054.sst
-rw-r--r-- 1 peterd users 67466221 Sep 12 14:55 000055.sst
-rw-r--r-- 1 peterd users 67441675 Sep 12 14:55 000056.sst
-rw-r--r-- 1 peterd users 67467855 Sep 12 14:55 000057.sst
-rw-r--r-- 1 peterd users 67455132 Sep 12 14:55 000058.sst
-rw-r--r-- 1 peterd users 67458334 Sep 12 14:55 000059.sst
```
## Performance, modest load
We are primarily interested in balancing throughput in building SST files and CPU usage in doing so. (For example, we could maximize throughput by having worker threads only spin waiting for work, but that would likely be extra CPU usage we want to avoid to allow other productive CPU work to be scheduled.) No read path code has been touched.
A benchmark script running "before" and "after" configurations at the same time to minimize random machine load effects:
```
$ SUFFIX=`tty | sed 's|/|_|g'`; for CT in none lz4 zstd; do for PT in 1 2 3 4 6 8; do echo -n "$CT pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```
Before this change:
```
none pt=1 -> ops/s: 1999603 cpu*s: 72.08
none pt=2 -> ops/s: 1871094 cpu*s: 148.3
none pt=3 -> ops/s: 1882907 cpu*s: 147.7
lz4 pt=1 -> ops/s: 1987858 cpu*s: 94.74
lz4 pt=2 -> ops/s: 1590192 cpu*s: 182.65
lz4 pt=3 -> ops/s: 1896294 cpu*s: 174.7
lz4 pt=4 -> ops/s: 1949174 cpu*s: 172.26
lz4 pt=6 -> ops/s: 1912517 cpu*s: 175.91
lz4 pt=8 -> ops/s: 1930585 cpu*s: 176.71
zstd pt=1 -> ops/s: 1239379 cpu*s: 129.85
zstd pt=2 -> ops/s: 1171742 cpu*s: 226.12
zstd pt=3 -> ops/s: 1832574 cpu*s: 214.21
zstd pt=4 -> ops/s: 1887124 cpu*s: 212.51
zstd pt=6 -> ops/s: 1920936 cpu*s: 211.7
zstd pt=8 -> ops/s: 1885544 cpu*s: 214.87
```
After this change:
```
none pt=1 -> ops/s: 1964361 cpu*s: 72.66
none pt=2 -> ops/s: 1914033 cpu*s: 104.95
none pt=3 -> ops/s: 1978567 cpu*s: 100.24
lz4 pt=1 -> ops/s: 2041703 cpu*s: 92.88
lz4 pt=2 -> ops/s: 1903210 cpu*s: 121.64
lz4 pt=3 -> ops/s: 1973906 cpu*s: 122.22
lz4 pt=4 -> ops/s: 1952605 cpu*s: 123.05
lz4 pt=6 -> ops/s: 1957524 cpu*s: 124.31
lz4 pt=8 -> ops/s: 1986274 cpu*s: 129.06
zstd pt=1 -> ops/s: 1233748 cpu*s: 130.43
zstd pt=2 -> ops/s: 1675226 cpu*s: 158.41
zstd pt=3 -> ops/s: 1929878 cpu*s: 159.77
zstd pt=4 -> ops/s: 1916403 cpu*s: 160.99
zstd pt=6 -> ops/s: 1942526 cpu*s: 166.21
zstd pt=8 -> ops/s: 1966704 cpu*s: 171.56
```
For parallel_threads=1, results are very similar, as expected.
For parallel_threads>1, throughput is usually improved a bit, but cpu consumption is dramatically reduced. For zstd, maximum throughput is essentially achieved with pt=3 rather than the previous roughly pt=4 to 6. And the old used about 30% more CPU.
We can also compare with more expensive compression by raising the compression level.
```
SUFFIX=`tty | sed 's|/|_|g'`; CT=zstd; for CL in 4 6 8; do for PT in 1 4 8; do echo -n "$CT@$CL pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT -compression_level=$CL 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```
Before:
```
zstd@4 pt=1 -> ops/s: 883630 cpu*s: 161.12
zstd@4 pt=4 -> ops/s: 1878206 cpu*s: 243.25
zstd@4 pt=8 -> ops/s: 1885002 cpu*s: 245.89
zstd@6 pt=1 -> ops/s: 710767 cpu*s: 189.44
zstd@6 pt=4 -> ops/s: 1706377 cpu*s: 277.29
zstd@6 pt=8 -> ops/s: 1866736 cpu*s: 275.07
zstd@8 pt=1 -> ops/s: 529047 cpu*s: 237.87
zstd@8 pt=4 -> ops/s: 1401379 cpu*s: 330.61
zstd@8 pt=8 -> ops/s: 1895601 cpu*s: 321.59
```
After:
```
zstd@4 pt=1 -> ops/s: 889905 cpu*s: 161.03
zstd@4 pt=4 -> ops/s: 1942240 cpu*s: 193.18
zstd@4 pt=8 -> ops/s: 1922367 cpu*s: 205.21
zstd@6 pt=1 -> ops/s: 713870 cpu*s: 188.91
zstd@6 pt=4 -> ops/s: 1832314 cpu*s: 219.66
zstd@6 pt=8 -> ops/s: 1949631 cpu*s: 229.34
zstd@8 pt=1 -> ops/s: 530324 cpu*s: 238.02
zstd@8 pt=4 -> ops/s: 1479767 cpu*s: 271.65
zstd@8 pt=8 -> ops/s: 1949631 cpu*s: 275.6
```
And we can also look at the cumulative effect of this change and https://github.com/facebook/rocksdb/pull/13850 that will combine for the parallel compression improvements in the upcoming 10.7 release:
Before both:
```
lz4 pt=1 -> ops/s: 1954445 cpu*s: 95.14
lz4 pt=3 -> ops/s: 1687043 cpu*s: 186.62
lz4 pt=5 -> ops/s: 1708196 cpu*s: 188.33
zstd pt=1 -> ops/s: 1220649 cpu*s: 131.2
zstd pt=3 -> ops/s: 1658100 cpu*s: 227.08
zstd pt=5 -> ops/s: 1685074 cpu*s: 226.08
```
After:
```
lz4 pt=1 -> ops/s: 2048214 cpu*s: 93.24
lz4 pt=3 -> ops/s: 1922049 cpu*s: 122.9
lz4 pt=5 -> ops/s: 1980165 cpu*s: 122.49
zstd pt=1 -> ops/s: 1245165 cpu*s: 128.84
zstd pt=3 -> ops/s: 1956961 cpu*s: 158.73
zstd pt=5 -> ops/s: 1970458 cpu*s: 161.02
```
In summary, before with zstd default level, you could see only
* about 38% increase in throughput for about 73% increase in CPU usage
Now you can get
* about 58% increase in throughput for about 25% increase in CPU usage
## Performance, high load
To validate this for usage on remote compaction workers, we also need to test whether it falls over at high load or anything concerning like that. For this I did a lot of testing with concurrent db_bench and zstd compression_level=8 and parallel_thread (PT) in {1,8} trying to observe "bad" behaviors such as stalls due to preempted threads and such. On a 166 core machine where a "job" is a db_bench process running a fillseq benchmark similar to above in parallel with others, I could summarize the results like this:
10 jobs PT=8 vs. PT=1 -> 12% more CPU usage, 75% reduction in wall time, 1.9 jobs/sec (vs. 0.5)
50 jobs PT=8 vs. PT=1 -> 89% more CPU usage, 27% reduction in wall time, 3.1 jobs/sec (vs. 2.3)
100 jobs PT=8 vs. PT=1 -> 24% more CPU usage, 5% reduction in wall time, 3.25 jobs/sec (vs. 3.1)
150 jobs PT=8 vs. PT=1 -> 4% more CPU usage, 2% increase in wall time, 3.3 jobs/sec (vs. 3.4)
500 jobs PT=8 vs. PT=1 -> 1% more CPU usage, insignificant difference in wall time, 3.3 jobs/sec
Even when there are 4000 threads potentially competing for 166 cores, the throughput (3.3 jobs / sec) is still very close to maximum (3.4). Enabling parallel compression didn't result in notably less throughput (based on wall clock time for all jobs to complete) in any case tested above, and much higher throughput for many cases. If parallel compression causes us to tip from comfortably under-saturating to over-saturating the cores (as in the 50 jobs case), the overall CPU usage can be much higher, presumably due to lower CPU cache hit rates and maybe clock throttling, but parallel compression still has the throughput advantage in those cases.
In other words, what would we stand to gain from being able to intelligently share worker threads between compaction jobs? It doesn't seem that much.
Reviewed By: xingbowang
Differential Revision: D81365623
Pulled By: pdillinger
fbshipit-source-id: 5db5151a959b5d25b84dbe185bc208bd188f2d1c
Summary:
we saw some crash test failure at f46242cef6/table/block_based/block_based_table_iterator.cc (L964-L965). This is likely due to timestamp not being considered properly in some places in MultiScan code paths. This PR fixes the issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13938
Test Plan: crash test with timestamp and multiscan: `python3 -u ./tools/db_crashtest.py whitebox --enable_ts --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60`
Reviewed By: anand1976
Differential Revision: D82175263
Pulled By: cbi42
fbshipit-source-id: 5d40ede1aec15f8faeaa7fd041b939e68611ff73
Summary:
This PR enables Stress Test to fall back to local compaction when a remote compaction fails, allowing the compaction to be retried on the main thread.
If the local compaction succeeds, the stress test will continue without failing. The main thread will log that the remote compaction failed and was retried locally, while detailed failure logs from the remote compaction attempt will still be printed by the worker thread for further investigation.
This approach allows us to keep collecting useful logs for diagnosing remote compaction failures in Stress Test, while ensuring the test continues to run with remote compaction enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13945
Test Plan:
```
python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10
```
# Internal Only
https://www.internalfb.com/sandcastle/workflow/1315051091202224133https://www.internalfb.com/sandcastle/workflow/3382203320165521367https://www.internalfb.com/sandcastle/workflow/2616591383512372892https://www.internalfb.com/sandcastle/workflow/4607182418810099066
Reviewed By: hx235
Differential Revision: D82279337
Pulled By: jaykorean
fbshipit-source-id: 6f663ec2eeb642fd4ad885a90efb344432a32f89
Summary:
**Context/Summary:**
Internally `CompactionJobStats ::num_input_records` is only used for input record count [verification](1aca60c089/db/compaction/compaction_job.cc (L2535)) and such verification always checks for `CompactionJobStats::has_num_input_records` (now renamed) before using this field. This is needed because the `CompactionJobStats::num_input_records` gets its number from `CompactionIterator::NumInputEntryScanned()` in a subcompaction and this number can be inaccurate purposefully to increase performance, see [CompactionIterator::must_count_input_entries](https://github.com/facebook/rocksdb/pull/13929/files#diff-e6c876f655a21865c0f3dff94b9763f1bd40cf88a8a86f04868201b2e845a890R186-R199) for more.
- This PR renames the `CompactionJobStats::has_num_input_records` to more explicit naming and adds more comments. Not a behavior change.
Also, aggregation of `CompactionJobStats::has_num_input_records` among all subcompactions is done by [AND](1aca60c089/util/compaction_job_stats_impl.cc (L62)) operation so it's false if any of the subcompaction has this field being false. The default value of this field should be "true" in order to not mistakenly "false" by default. We are currently fine because `CompactionJobStats::Reset()` that [sets the value to be true](1aca60c089/util/compaction_job_stats_impl.cc (L14)) is always called before such aggregation.
- This PR changes the default value to be true.
- Resumable compaction development plans to set `CompactionJobStats::has_num_input_records` to be false if the previous compaction carries inaccurate records. In order for this not be overwritten by the subsequent progress in [here](1aca60c089/db/compaction/compaction_job.cc (L1540-L1543)), this PR also changes this = to AND operation and +=. With the default value `CompactionJobStats::has_num_input_records` now to be true (or Reset() already called) and `CompactionJobStats::num_input_records=0` already, this will not a behavior change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13929
Test Plan: - Existing UT to test "...changes the default value to be true" is safe.
Reviewed By: jaykorean
Differential Revision: D82014912
Pulled By: hx235
fbshipit-source-id: 6f211c3b2c9eb7d39abf37271d21a4d3f407b934
Summary:
We should add error logging to be able to pinpoint why RocksDB is returning status `NotSupported` for `ReadAsync`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13936
Test Plan: Look at logs (and client logs of error status)
Reviewed By: anand1976
Differential Revision: D82141529
Pulled By: archang19
fbshipit-source-id: c71b70967457be35ef5168321d449f96b2b9441d
Summary:
Fix uninitialized value complaint in valgrind due to gtest print padded struct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13934
Test Plan: CI. Verified that valgrind no longer complains about it.
Reviewed By: pdillinger
Differential Revision: D82124983
Pulled By: xingbowang
fbshipit-source-id: 99eb7bab99726c45affe0a231777e5951844d73b
Summary:
... and associated statistics, etc. Someone needs it, so here it is.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927
Test Plan: Updated / extended / added some unit tests
Reviewed By: cbi42
Differential Revision: D81981469
Pulled By: pdillinger
fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
Summary:
There are some internal use cases that do not map cleanly onto the existing `IOActivity` enums. This PR creates new custom IOActivity types that internal users can use as they see fit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13924
Test Plan: Wrote a simple unit test
Reviewed By: pdillinger
Differential Revision: D82029992
Pulled By: archang19
fbshipit-source-id: a3e23c360baa96cd2e9adf570e71c6e43947bfc8
Summary:
PointLockManager manages point lock per key. The old implementation partition the per key lock into 16 stripes. Each stripe handles the point lock for a subset of keys. Each stripe have only one conditional variable. This conditional variable is used by all the transactions that are waiting for its turn to acquire a lock of a key that belongs to this stripe.
In production, we notified that when there are multiple transactions trying to write to the same key, all of them will wait on the same conditional variables. When the previous lock holder released the key, all of the transactions are woken up, but only one of them could proceed, and the rest goes back to sleep. This wasted a lot of CPU cycles. In addition, when there are other keys being locked/unlocked on the same lock stripe, the problem becomes even worse.
In order to solve this issue, we implemented a new PerKeyPointLockManager that keeps a transaction waiter queue at per key level. When a transaction could not acquire a lock immediately, it joins the waiter queue of the key and waits on a dedicated conditional variable. When previous lock holder released the lock, it wakes up the next set of transactions that are eligible to acquire the lock from the waiting queue. The queue respect FIFO order, except it prioritizes lock upgrade/downgrade operation.
However, this waiter queue change increases the deadlock detection cost, because the transaction waiting in the queue also needs to be considered during deadlock detection. To resolve this issue, a new deadlock_timeout_us (microseconds) configuration is introduced in transaction option. Essentially, when a transaction is waiting on a lock, it will join the wait queue and wait for the duration configured by deadlock_timeout_us without perform deadlock detection. If the transaction didn't get the lock after the deadlock_timeout_us timeout is reached, it will then perform deadlock detection and wait until lock_timeout is reached. This optimization takes the heuristic where majority of the transaction would be able to get the lock without perform deadlock detection.
The deadlock_timeout_us configuration needs to be tuned for different workload, if the likelihood of deadlock is very low, the deadlock_timeout_us could be configured close to a big higher than the average transaction execution time, so that majority of the transaction would be able to acquire the lock without performing deadlock detection. If the likelihood of deadlock is high, deadlock_timeout_us could be configured with lower value, so that deadlock would get detected faster.
The new PerKeyPointLockManager is disabled by default. It can be enabled by TransactionDBOptions.use_per_key_point_lock_mgr. The deadlock_timeout_us is only effective when PerKeyPointLockManager is used. When deadlock_timeout_us is set to 0, transaction will perform deadlock detection immediately before wait.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13731
Test Plan:
Unit test.
Stress unit test that validates deadlock detection and exclusive, shared lock guarantee.
A new point_lock_bench binary is created to help perform performance test.
Reviewed By: pdillinger
Differential Revision: D77353607
Pulled By: xingbowang
fbshipit-source-id: 21cf93354f9a367a78c8666596ed14013ac7240b
Summary:
A follow-up to https://github.com/facebook/rocksdb/issues/13904 which was incomplete in updating CI jobs to support C++20 because the C++20 usage was only in tests. Here we add subtle C++20 usage in the public API ("using enum" feature in db.h) to force the issue.
A lot of the work for this PR was in updating the Ubuntu22 docker image, for earlier compiler/runtime versions supporting C++20, and generating a new Ubuntu24 docker image, for later compiler/runtime versions. The Ubuntu22 image needed to be updated because there are incompatibilities with clang-13 + c++20 + libstdc++ for gcc 11, seen on these examples
```
#include <chrono>
int main(int argc, char *argv[]) {
std::chrono::microseconds d = {}; return 0;
}
```
and
```
#include <coroutine>
int main() { return 0; }
```
The second was causing recurring failures in build-linux-clang-13-asan-ubsan-with-folly, now fixed.
So we have to install clang's libc++ to compile with clang-13. I haven't been able to get this to work with some of the libraries like benchmark, glog, and/or gflags, but I'm able to compile core RocksDB with clang-13. On this docker image, an extra compiler parameter is needed to compile with gcc and glog because it's built from source perhaps not perfectly, because the ubuntu package transitively conflicts with libc++.
The Ubuntu24 image seems to be low-drama and generally work for testing out newer compiler versions. The mingw build uses Ubuntu24 because the mingw package on Ubuntu22 uses a gcc version that is too old.
And the mass of other code changes are trying to work around new warnings, mostly from clang-analyze, which I upgraded to clang-18 in CI.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13915
Test Plan: CI, including temporarily including the nightly jobs in the PR jobs in earlier revisions to test and stabilize
Reviewed By: archang19
Differential Revision: D81933067
Pulled By: pdillinger
fbshipit-source-id: 7e33823006a79d5f3cf5bc1d625f0a3c08a7d74c
Summary:
After running stress test over a week, we've identified more failures to fix. While we work on the fix, disable the remote compaction temporarily to reduce noise and avoid these failures hiding other failures.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13925
Test Plan: CI
Reviewed By: anand1976
Differential Revision: D81934248
Pulled By: jaykorean
fbshipit-source-id: 9ac11926429eebe1aebf7b520a548dc5987b7d76
Summary:
This diff adds logging in various places in the external file ingestion code where we check for non-OK status codes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13905
Test Plan: Debugging external file ingestion should be easier with additional logging.
Differential Revision: D81814033
Pulled By: archang19
fbshipit-source-id: 77f8b342cbad892acedc4603c02865c38886f2f4
Summary:
If user_defined_index_factory in BlockBasedTableOptions is configured and we try to open an SST file without the corresponding UDI (either during DB open or file ingestion), ignore a failure to load the UDI by default. If fail_if_no_udi_on_open in BlockBasedTableOptions is true, then treat it as a fatal error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13921
Test Plan: Update unit tests
Reviewed By: xingbowang
Differential Revision: D81826054
Pulled By: anand1976
fbshipit-source-id: f4fe0b13ccb02b9448622af487680131e349c52b
Summary:
Add a new option `MultiScanArgs::max_prefetch_size` that limits the memory usage of per file pinning of prefetched blocks. Note that this only accounts for compressed block size. This is intended to be a stopgap until we implement some kind of global prefetch manager that limits the global multiscan memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13920
Test Plan: new unit test `./block_based_table_reader_test --gtest_filter="*MultiScanPrefetchSizeLimit/*"`
Reviewed By: xingbowang
Differential Revision: D81630629
Pulled By: cbi42
fbshipit-source-id: 9f66678915242fe1220620531a4b9fd22747cdea
Summary:
# Summary
Until we get WAL + Remote Compaction in Stress Test working, temporarily disable this
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13919
Test Plan: Meta Internal CI run
Reviewed By: anand1976
Differential Revision: D81605621
Pulled By: jaykorean
fbshipit-source-id: 6e1f9a0a7a0f27e7465512689b51364b63ef3e2b
Summary:
Re-enabling Remote Compaction Stress Test with some changes to stress test feature combo sanitization changes
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13913
Test Plan:
Ran Meta Internal Tests for a few days
# Follow up
- Skip recovering from WAL in remote worker and re-enable WAL
- Investigate and fix races with Integrated BlobDB
Reviewed By: hx235
Differential Revision: D81509225
Pulled By: jaykorean
fbshipit-source-id: 949762c48ece0a25e3d0281e3510f1e7d3fe3667
Summary:
**Context/Summary:**
A small change as titled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13891
Test Plan: - Existing UT and rehearsal stress test
Reviewed By: jaykorean
Differential Revision: D80588011
Pulled By: hx235
fbshipit-source-id: 6987e08a4855782305ad742eef6c0196da0d67ca
Summary:
I am wanting to use std::counting_semaphore for something and the timing seems good to require C++20 support. The internets suggest:
* GCC >= 10 is adequate, >= 11 preferred
* Clang >= 10 is needed
* Visual Studio >= 2019 is adquate
And popular linux distributions look like this:
* CentOS Stream 9 -> GCC 11.2 (CentOS 8 is EOL)
* Ubuntu 22.04 LTS -> GCC 11.x (Ubuntu 20 just ended standard support)
* Debian 12 (oldstable) -> GCC 12.2
* (Debian 11 has ended security updates, uses GCC 10.2)
This required generating a new docker image based on Ubuntu 22 for CI using gcc. The existing Ubuntu 20 image works for covering appropriate clang versions (though we should maybe add a much later version as well, in the next increment of our Ubuntu 22 image; however the minimum available clang build from apt.llvm.org for Ubuntu 22 is clang 13).
Update to SetDumpFilter is to quiet a mysterious gcc-13 warning-as-error.
Removed --compile-no-warning-as-error from a cmake command line because cmake in the new docker image is too old for this option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13904
Test Plan: CI, one minor unit test added to verify std::counting_semaphor works
Reviewed By: xingbowang
Differential Revision: D81266435
Pulled By: pdillinger
fbshipit-source-id: 26040eeccca7004416e29a6ff4f6ea93f2052684
Summary:
**Context/Summary:**
`ProcessKeyValueCompaction()` has grown too long to resonate or add any logic to resume from some key and save progress for resumable compaction. This PR breaks this function into smaller functions. Almost all of them are cosmetic changes, except for one thing pointed out in below PR conversation.
Specially, this PR did the following:
- Added `SubcompactionInternalIterators`, `SubcompactionKeyBoundaries` and `BlobFileResources` to manage the lifetime of the local variables of the original functions to be used across smaller functions
- Moved AutoThreadOperationStageUpdater, some IO stats measurement to a different place that makes more sense
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13879
Test Plan: Existing UT
Reviewed By: jaykorean
Differential Revision: D80216092
Pulled By: hx235
fbshipit-source-id: 515615906e5e5fd5ec191bcdd4126f17d282cac2
Summary:
The implementation of parallel compression has historically scaled rather poorly, or perhaps modestly with heavy compression, topping out around 3x throughput vs. serial and incurring big overheads in CPU consumption relative to the throughput.
This change addresses one source of that extra CPU consumption: stashing all the keys of a block for later processing into building index and filter blocks. Historically with parallel compression, the index and filter block updates were handled in the last stage of processing along with writing each data block to the file writer. This was because the index blocks needed to know the BlockHandle of the new data block, which could only be known after every preceeding data block was compressed, to know the starting location for the BlockHandle. And because index and filter partitions were historically coupled (see decouple_partitioned_filters), filter updates had to happen at the same time.
Here we get rid of stashing the keys for later processing and the extra CPU associated with it, by
* Creating a two stage process of adding to index blocks ("prepare" and "finish" each entry; one entry per data block). The two stages must be executable in parallel for separate index entries. NOTE: not yet supported by UserDefinedIndex
* Requiring decouple_partitioned_filters=true for parallel compression, because we now add to filters in the first stage of processing when each key is readily available and we cannot couple that with finalizing index entries in the last stage of processing.
It might seem like adding to filters is something that is expensive (hashing etc.) and should be kept out of the bottle-neck first stage of processing (which includes walking the compaction iterator) but it's probably similar cost to simply stashing the keys away for later processing. (We might be able to reduce a bottle-neck by stashing hashes, but we're not to a point where that is worth the effort.)
And it makes sense to make two more simple public API updates in conjunction with this:
* Set decouple_partitioned_filters=true by default. No signs of problems in production.
* Mark parallel compression as production-ready. It's being thoroughly tested in the crash test, successfully, and in limited production uses.
Follow-up:
* Improve the threading/sychronization model of parallel compression for the next major efficiency improvement
* Consider supporting the parallel-compatible index building APIs with UserDefinedIndex, unless it's considered too dangerous to expect users to safely handle the multi-threading.
* (In a subsequent release) remove all the code associated with coupling filter and index partitions and mark the option as ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13850
Test Plan:
for correctness, existing tests
## Performance Data
The "before" data here includes revert of https://github.com/facebook/rocksdb/issues/13828 for combined performance measurement of this change and that one.
```
SUFFIX=`tty | sed 's|/|_|g'`; for CT in lz4 zstd lz4; do for PT in 1 2 3 4 6 8; do echo "$CT pt=$PT"; (for I in `seq 1 1`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=30000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1 | tail -n 3 | head -n 2; done); done; done
```
To get a sense of the overall performance relative to number of parallel threads, we vary that with popular fast compression and popular heavier weight compression (some noise in this data, don't interpret each data point too strongly)
lz4 pt=1
2107431 -> 2112941 ops/sec (+0.3% - improvement)
(26.51 + 0.75) = 27.26 CPU sec -> (26.63 + 0.79) = 27.42 CPU sec (+0.6% - regression)
lz4 pt=2
1606660 -> 1580333 ops/sec (-1.6% - regression)
(47.10 + 8.37) = 55.47 CPU sec -> (45.05 + 9.23) = 54.28 CPU sec (-2.2% - improvement)
lz4 pt=3
1701353 -> 1889283 ops/sec (+11.1% - improvement)
(47.23 + 8.29) = 55.52 CPU sec -> (43.89 + 8.33) = 52.22 CPU sec (-6.0% - improvement)
lz4 pt=4
1651504 -> 1817890 ops/sec (+10.1% - improvement)
(48.07 + 8.31) = 56.38 CPU sec -> (44.77 + 8.45) = 53.22 CPU sec (-5.6% - improvement)
lz4 pt=6
1716099 -> 1888523 ops/sec (+10.1% - improvement)
(47.50 + 8.45) = 55.95 CPU sec -> (44.25 + 8.73) = 52.98 CPU sec (-5.3% - improvement)
lz4 pt=8
1696840 -> 1797256 ops/sec (+5.9% - improvement)
(48.09 + 8.61) = 56.70 CPU sec -> (45.90 + 8.68) = 54.58 CPU sec (-3.8% - improvement)
Clearly parallel threads do not help with fast compression like LZ4, but it's not as bad as it was before.
zstd pt=1
1214258 -> 1202863 ops/sec (-0.9% - regression)
(38.26 + 0.66) = 38.92 CPU sec -> (39.37 + 0.69) = 40.06 CPU sec (+2.9% - regression)
zstd pt=2
1194673 -> 1152746 ops/sec (-3.5% - regression)
(61.01 + 9.85) = 70.86 CPU sec -> (58.28 + 9.99) = 68.27 CPU sec (-3.7% - improvement)
zstd pt=3
1653661 -> 1825618 ops/sec (+10.4% - improvement)
(60.07 + 8.45) = 68.52 CPU sec -> (56.03 + 8.43) = 64.46 CPU sec (-5.9% - improvement)
zstd pt=4
1691723 -> 1890976 ops/sec (+11.8% - improvement)
(59.72 + 8.46) = 68.18 CPU sec -> (55.96 + 8.27) = 64.23 CPU sec (-5.7% - improvement)
zstd pt=6
1684982 -> 1900002 ops/sec (+12.8% - improvement)
(58.89 + 8.26) = 67.15 CPU sec -> (55.98 + 8.48) = 64.46 CPU sec (-4.0% - improvement)
zstd pt=8
1648282 -> 1892531 ops/sec (+14.8% - improvement)
(59.43 + 8.63) = 68.06 CPU sec -> (56.49 + 8.32) = 64.81 CPU sec (-4.8% - improvement)
The throughput is now able to increase by *more than half* with lots of parallelism, rather than only *about a third*.
Scalability is a bit better with higher compression level, and we still see a benefit from this change. (We've also enabled partitioned indexes and filters here, which sees essentially the same benefits):
zstd pt=1 compression_level=7
595720 -> 597359 ops/sec (+0.3% - improvement)
(63.45 + 0.73) = 64.18 CPU sec -> (63.25 + 0.71) = 63.96 CPU sec (-0.3% - improvement)
zstd pt=4 compression_level=7
1527116 -> 1501779 ops/sec (-1.7% - regression)
(85.00 + 8.14) = 93.14 CPU sec -> (81.85 + 9.02) = 90.87 CPU sec (-2.5% - improvement)
zstd pt=6 compression_level=7
1678239 -> 1956070 ops/sec (+16.5% - improvement)
(83.77 + 8.11) = 91.88 CPU sec -> (79.87 + 7.78) = 87.65 CPU sec (-4.6% - improvement)
zstd pt=8 compression_level=7
1696132 -> 1953041 ops/sec (+15.1% - improvement)
(83.97 + 8.14) = 92.11 CPU sec -> (80.61 + 7.78) = 88.39 CPU sec (-4.1% - improvement)
With more tests, not really seeing any consistent differences with no parallelism (despite some micro-optimizations thrown in)
Reviewed By: hx235
Differential Revision: D79853111
Pulled By: pdillinger
fbshipit-source-id: 7a34fd7811217fb74fa6d3efaea7ffcce72beec7
Summary:
**Context/Summary:**
RocksDB stress test verifies IOActivity is set correctly through reusing the pass-in Read/Write options through assertion. This is too strict for API that does not take or do not need to take Read/WriteOptions yet hence assertion failure.
```
stderr:
db_stress: ... db_stress_tool/db_stress_env_wrapper.h:24: void rocksdb::(anonymous namespace)::CheckIOActivity(const IOOptions &): Assertion `io_activity == Env::IOActivity::kUnknown || io_activity == options.io_activity' failed.
Received signal 6 (Aborted)
```
An example is ManagedSnapshot snapshot_guard(db_); in TestMultiScan().
This PR ignores such check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13898
Test Plan: The same command repro-ed this assertion failure passes after this fix
Reviewed By: archang19
Differential Revision: D80983214
Pulled By: hx235
fbshipit-source-id: d8b660f8c8771198bc7fa0e805c3e86d2584f03e
Summary:
**Context/Summary:**
Clear statistics reference from options_ to intentionally shorten the statistics object lifetime to be same as the db object (which is the common case in practice) and detect if RocksDB access the statistics beyond its lifetime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13899
Test Plan: - [Ongoing] Stress test rehearsal
Reviewed By: pdillinger
Differential Revision: D80985435
Pulled By: hx235
fbshipit-source-id: ab238231cd81f47fa451aea12a0c85fa11d9ac81
Summary:
`IngestExternalFileOptions::allow_db_generated_files` requires SST files to have zero sequence number. This PR opens it up for any DB generated SST files. Currently we don't do global sequence number assignment when `allow_db_generated_files` is true, so we require that files do not overlap with any key in the CF. One behavior difference is that now we allow ingesting overlapping files when `allow_db_generated_files` is true. Users need to ensure that files are ordered such that later files have more recent updates.
Intended follow ups:
- Record smallest seqno in table property, so that we don't need to scan the file for it.
- Cover allow_db_generated_files in crash test. We may create a new DB and ingest all files from a CF for verification.
- Add APIs that uses allow_db_generated_files. For example, an API for ingesting SST files from a source CF, so that we take care of ingestion file ordering for user. If we are already getting metadata from the source CF, we may be use it as a hint for level placement instead of dividing input files into batches again (`ExternalSstFileIngestionJob::DivideInputFilesIntoBatches`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13878
Test Plan: two new unit tests.
Reviewed By: hx235, xingbowang
Differential Revision: D80233727
Pulled By: cbi42
fbshipit-source-id: 74209386d8426c434bff2d9a734f06db537eb50c
Summary:
I saw failure when added some asserts near b9957c991c/table/block_based/block_based_table_iterator.cc (L1201-L1205) in stress test. The decompression failed with error message like "Corruption: Failed zlib inflate: -3". This PR fixes the issue to use the right decompressor for dictionary compression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13896
Test Plan: updated unit test that checks no I/O is done after Prepare(), this would fail before this change.
Reviewed By: anand1976
Differential Revision: D80821500
Pulled By: cbi42
fbshipit-source-id: a4322c0da99a2d10e9787d0ec168668567c0c19a
Summary:
RocksDB currently aborts whenever `io_uring_wait_cqe` returns an error code. It also does not log what error code was returned.
While experimenting with `IO_URING`, my application crashed because of this.
I asked the Linux Kernel user group the best way to handle unsuccessful `io_uring_wait_cqe`.
It was recommended to retry on `EINTR`, `EAGAIN`, and `ETIME`. `ETIME` only happens when waiting with a timeout, so I am not handling it.
I also write to `stderr` so that we have some debugging information if we abort.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13890
Test Plan: Unfortunately this is hard to cover through unit/stress tests. We have to see what sort of errors get encountered in production.
Reviewed By: anand1976
Differential Revision: D80639955
Pulled By: archang19
fbshipit-source-id: e3a230bd37552ec0f36be34e6a4e53cfd2a254f1
Summary:
When fill_cache is ReadOptions is false, multi scan Prepare crashes with the following assertion failure. In this case, CreateAndPibBlockInCache needs to directly create a block with full ownership.
https://github.com/facebook/rocksdb/issues/9 0x00007f2fc003bc93 in __GI___assert_fail (assertion=0x7f2fc2147361 "pinned_data_blocks_guard[block_idx].GetValue()", file=0x7f2fc2146e08 "table/block_based/block_based_table_iterator.cc", line=1178, function=0x7f2fc2147262 "virtual void rocksdb::BlockBasedTableIterator::Prepare(const rocksdb::MultiScanArgs *)") at assert.c:101
101 in assert.c
https://github.com/facebook/rocksdb/issues/10 0x00007f2fc1d73088 in rocksdb::BlockBasedTableIterator::Prepare(rocksdb::MultiScanArgs const*) () from /data/users/anand76/rocksdb_anand76/librocksdb.so.10.6
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13889
Test Plan: Parameterize the DBMultiScanIteratorTest tests with fill_cache
Reviewed By: cbi42
Differential Revision: D80552069
Pulled By: anand1976
fbshipit-source-id: 1a0b64af1e14c63d826add1f994a832ebff12757
Summary:
I ran multiple runs of crash test jobs internally, so far I've seen one iterator mismatch and one assertion failure. I've added relevant logging improvements to help debugging them. use_multiscan will be stable within a crash test run to make it easier to triage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13888
Test Plan: `python3 tools/db_crashtest.py whitebox --prefix_size=-1 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888`
Reviewed By: anand1976
Differential Revision: D80627399
Pulled By: cbi42
fbshipit-source-id: 2fa3f77e730f5bc7d1d200dc122cf84e3558c588
Summary:
The assert occasionally throws off the stress test runs. We already have sufficient logging in place to collect the signal about secondary cache capacity exceeding primary cache reservation for further investigation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13885
Reviewed By: anand1976
Differential Revision: D80355513
Pulled By: mszeszko-meta
fbshipit-source-id: b36926f0493a3aca19818a1980ef79277db9fe7e
Summary:
Add the --list_meta_blocks option to sst_dump. This PR also refactors some of the test code in sst_dump_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13838
Reviewed By: cbi42
Differential Revision: D80320812
Pulled By: anand1976
fbshipit-source-id: 921b6560fbd756f5f8b364893700d240d3b7ad00
Summary:
Two instances of change that are not just cosmetic:
* InlineSkipList<>::Node::CASNext() was implicitly using memory_order_seq_cst to access `next_` while it's intended to be accessed with acquire/release. This is probably not a correctness issue for compare_exchange_strong but potentially a previously missed optimization.
* Similar for `max_height_` in Insert which is otherwise accessed with relaxed memory order.
* One non-relaxed access to `is_range_del_table_empty_` in a function only used in assertions. Access to this atomic is otherwise relaxed (and should be - comment added)
Didn't do all of memtable.h because some of them are more complicated changes and I should probably add FetchMin and FetchMax functions to simplify and take advantage of C++27 functions where available (intended follow-up).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13844
Test Plan: existing tests
Reviewed By: xingbowang
Differential Revision: D79742552
Pulled By: pdillinger
fbshipit-source-id: d97ce72ba9af6c105694b7d40622db9e994720cd
Summary:
This is an important feature for avoiding (reducing) unfair block cache treatment for a lot of blocks. It should also unlock some parallel optimizations (https://github.com/facebook/rocksdb/issues/13850) and code simplification.
Consider for follow-up:
* Feature to avoid majorly under0sized data blocks and filter and index partition blocks
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13881
Test Plan: existing tests, been looking good in production
Reviewed By: hx235
Differential Revision: D80288192
Pulled By: pdillinger
fbshipit-source-id: 5e274ffffb044713278d2a286db6bceaab2dadec
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13882
The `expect_valid_internal_key` parameter was always passed as true, with false only used in one unit test. This change removes the parameter and always fail compaction when encountering corrupted internal keys, which is the expected production behavior.
Reviewed By: mszeszko-meta
Differential Revision: D80287672
fbshipit-source-id: e30a282ac30d7fded677504cec11173de8d15167
Summary:
Allow a user defined index to be configured from a string
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13880
Test Plan: Add a unit test in table_test.cc
Reviewed By: bikash-c
Differential Revision: D80237701
Pulled By: anand1976
fbshipit-source-id: 8b3d0bcdfbb4bb76803916ea1b1f940a4d985dfd
Summary:
The original intention of the User Defined Index interface was to use the user key. However, the implementation mixed user and internal key usage. This PR makes it consistent. It also clarifies the UDI contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13865
Test Plan: Update tests in table_test.cc
Reviewed By: pdillinger
Differential Revision: D80050344
Pulled By: anand1976
fbshipit-source-id: ace47737d21684ec19709640a09e198cee2d98bd
Summary:
... as we see some issues that rehearsal stress test didn't surface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13869
Reviewed By: cbi42
Differential Revision: D80103341
Pulled By: hx235
fbshipit-source-id: 8b2c1d76d4c3099727ba3a69de44de67afd64369
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13846
This diff addresses few issues that was identified during testing of the user defined index.
1. During the finishing of the index blocks, we run into an infinite loop because the user defined index wrapper returns
early on incomplete status. This happens because the wrapper blindly returns the status if it not OK. But, the status
could legitimately be `Incomplete()` for some indices like Partitioned Index (serving as the internal index for the UDI
wrapper). Fix is to exclude `Incomplete()` check from the status check early in the UDI wrapper's finish.
2. Once we fixed (1), we noticed that the meta blocks for the UDI-based index writer were not written out to the final
SST file. This is because the UDI's meta blocks are created after the internal index's meta blocks and the block-based
index builder didn't account for this. The fix is to finish the UDI wrapper first which will create the necessary meta blocks
and then finish the internal index. If the internal index is incomplete, the block-based index builder should still continue
to write out the meta blocks.
3. OnKeyAdded when delegating to the user-defined index should only pass the user key. The UDI builder doesn't
understand RocksDB's internal key format and while that poses interesting challenges when the UDI is used for non
last level SST files, our plan is to restrict the usage of the UDI to last level files only (for now).
Reviewed By: pdillinger
Differential Revision: D79781453
fbshipit-source-id: 2239c8fc016da55df5c24be6aacc8f6357cab029
Summary:
fix the following error showing up in continuous tests:
```
Makefile:186: Warning: Compiling in debug mode. Don't use the resulting binary in production
port/mmap.cc:46:15: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'rocksdb::MemMapping' [-Werror,-Wnontrivial-memcall]
46 | std::memcpy(this, &other, sizeof(*this));
| ^
port/mmap.cc:46:15: note: explicitly cast the pointer to silence this warning
46 | std::memcpy(this, &other, sizeof(*this));
| ^
| (void*)
1 error generated.
make: *** [Makefile:2580: port/mmap.o] Error 1
make: *** Waiting for unfinished jobs....
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13864
Test Plan: `make USE_CLANG=1 j=150 check` with 13f054febb/build_tools/build_detect_platform (L61-L70) commented out.
Reviewed By: mszeszko-meta
Differential Revision: D80033441
Pulled By: cbi42
fbshipit-source-id: b2330eea71fe28243236b75128ec6f3f1e971873
Summary:
while debugging stress test failure, I noticed that sst_dump and ldb do not work if custom db_stress compression manager is used. This PR adds support for it.
```
./sst_dump --command=raw --show_properties --file=/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst
options.env is 0x7f2b1f4b9000
Process /tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst
Sst file format: block-based
/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst: Not implemented: Could not load CompressionManager: DbStressCustom1
/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst is not a valid SST file
./ldb idump --db=/tmp/rocksdb_crashtest_whiteboxy_emah11 --ignore_unknown_options --hex >> /tmp/i_dump
Failed: Not implemented: Could not load CompressionManager: DbStressCustom1
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13827
Test Plan: manually tested that ldb and sst_dump work with DbStressCustomCompressionManager after this PR
Reviewed By: pdillinger
Differential Revision: D79461175
Pulled By: cbi42
fbshipit-source-id: c8c092b10b4fde3a295b00751057749e8f0cf095
Summary:
To better support future options, and changes, we need to convert the std::vector<ScanOptions> to something more malleable.
This diff introduces the MultiScanOptions structure and pipes it through the various points in the code in the Prepare path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13837
Test Plan:
Ensure all associated tests pass
```
make check all
```
Reviewed By: cbi42
Differential Revision: D79655229
Pulled By: krhancoc
fbshipit-source-id: 3a90fb7420e9655021de85ed0158b866f8bfba05
Summary:
**Context/Summary:**
This update, which should have been part of a previous refactoring [PR](d2ac955881), involves simple renaming for clarity and ensures output table properties are only set when compaction succeeds. Output properties are not meaningful if compaction fails, so this change prevents their population in such cases. Additionally, subsequent statistics updates already do not rely on output file table properties, maintaining correctness regardless of compaction success.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13851
Test Plan: Existing unit tests
Reviewed By: jaykorean
Differential Revision: D79862244
Pulled By: hx235
fbshipit-source-id: 1db16b8dc7b820fab3ec1d5c8a4b757466590e2c
Summary:
**Context/Summary:**
The `CompactionJob::Run()` method has grown too large and complex, making it difficult to implement moderate changes or reason about the code flow (e.g., determining where to save compaction progress for resuming). This PR refactors the method into smaller, more focused functions to improve readability and maintainability.
The refactoring consists mostly of cosmetic changes that extract logical sections into separate methods, with two notable functional improvements:
1. **Relocated output processing logic**: Moved code under `RemoveEmptyOutputs()` and `HasNewBlobFiles()` to where it's actually needed, rather than piggy-backing on the subcompaction state loop. While this introduces 2 additional loops over subcompactions, the performance impact should be negligible given the improved code clarity.
2. **Repositioned statistics updates**: Moved `UpdateCompactionJobInputStats()` and `UpdateCompactionJobOutputStats()` from the record verification section to the end `FinalizeCompactionRun()` methods. This change is safe since record verification is a read-only operation that doesn't modify any statistics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13849
Test Plan: Existing unit tests
Reviewed By: jaykorean
Differential Revision: D79824429
Pulled By: hx235
fbshipit-source-id: 6b73136f32ecc6842a04a77502b7dbb0bbf507f7
Summary:
We temporarily disabled WAL when Remote Compaction is enabled in Stress Test (https://github.com/facebook/rocksdb/pull/13843). There are few others to incompatible features when WAL is disabled. Due to the sanitization order, WAL was disabled at the end of the sanitization and these incompatible features weren't set properly. Stress Test failed with an error like the following.
e.g. `reopen` stress test is not compatible with `disable_wal` - `Error: Db cannot reopen safely with disable_wal set!`
This PR changes the order of sanitization so that `disable_wal` is set earlier when `remote_compaction_worker_threads > 0`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13845
Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --txn_write_policy=2 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0
```
Reviewed By: cbi42
Differential Revision: D79758670
Pulled By: jaykorean
fbshipit-source-id: aa6f4a74cc86c23f442928c301187b06e8137f53
Summary:
https://github.com/facebook/rocksdb/issues/13676 unfortunately treated some IOErrors as corruption, which is not appropriate when remote storage is involved. To help enforce this, our crash test injects errors that are expected to be propagated back to the user rather than causing some other failure.
Saw crash test failures like this:
```
TestMultiGetEntity (AttributeGroup) error: Corruption: Failed to get file size: Not implemented: GetFileSize Not Supported for file ...
```
So fixing this handling by not injecting a false Corruption failure and allowing smooth fallback from FSRandomAccessFile::GetFileSize to FileSystem::GetFileSize
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13842
Test Plan: unit test added
Reviewed By: xingbowang
Differential Revision: D79728861
Pulled By: pdillinger
fbshipit-source-id: 33f7dfc85d86d88cb4ab24a8defd26618c95c954
Summary:
To reduce the noise, disable the incompatible ones for now when `remote_compaction_worker_threads > 0`. We will investigate each, fix as needed and re-enable them as follow up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13843
Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0
```
Reviewed By: cbi42
Differential Revision: D79735166
Pulled By: jaykorean
fbshipit-source-id: ae3be38a21073fd3282d6e8cd7d71f0363df3590
Summary:
**Summary:**
This test verifies that compaction respects the min_file_size parameter when triggered by deletions, preventing the compaction of files with deletions smaller than the threshold. The test logic includes two scenarios:
1. Verify that a large L0 file with deletions exceeding the minimum file size threshold triggers deletion-triggered compaction (DTC) and compacts to L1.
2. Verify that a small L0 file with deletions, but below the minimum file size threshold, does not trigger DTC and remains at L0.
Added the DeletionTriggeredCompactionWithMinFileSizeTestListener, which verifies that files selected for compaction based on deletion triggers meet the minimum file size threshold. The listener validates in OnCompactionBegin that all input files have sizes greater than or equal to the configured min_file_size parameter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13825
Test Plan:
Tested this feature on our devserver using the following commands:
```
DEBUG_LEVEL=2 make -j64 db_compaction_test && KEEP_DB=1 ./db_compaction_test --gtest_filter="*DBCompactionTest.CompactionWith*"
```
Test output confirms the expected behavior:
```
2025/07/31-11:24:49.473181 1431671 [/compaction/compaction_job.cc:2291] [default] [JOB 6] Compacting 2@0 files to L1, score 0.04
2025/07/31-11:24:49.473240 1431671 [/compaction/compaction_job.cc:2297] [default]: Compaction start summary: Base version 6 Base level 0, inputs: [15(52KB) 9(103KB)]
2025/07/31-11:24:49.473304 1431671 EVENT_LOG_v1 {"time_micros": 1753986289473273, "job": 6, "event": "compaction_started", "cf_name": "default", "compaction_reason": "FilesMarkedForCompaction", "files_L0": [15, 9], "score": 0.04, "input_data_size": 159848, "oldest_snapshot_seqno": -1}
```
**Tasks:**
T228156639
Reviewed By: cbi42
Differential Revision: D79395851
Pulled By: nmk70
fbshipit-source-id: 4c2a80a95521b40543981dd81b347f3984cd2a8b
Summary:
Remote Compaction in the stress test previously failed with the following error, so we temporarily disabled it in PR https://github.com/facebook/rocksdb/issues/13815 :
```
reference std::vector<rocksdb::ThreadState *>::operator[](size_type) [_Tp = rocksdb::ThreadState *, _Alloc = std::allocator<rocksdb::ThreadState *>]: Assertion '__n < this->size()' failed.
```
The error was from accessing `remote_compaction_worker_threads[i]` when `i < remote_compaction_worker_threads.size()` which leads to an undefined behavior. This PR fixes the issue by properly setting the worker thread pointers in `remote_compaction_worker_threads`.
Note: We are still encountering errors when both BlobDB and Remote Compaction are enabled. It appears to be a race condition. For now, BlobDB is temporarily disabled if remote compaction is enabled. We will fix the race condition and re-enable BlobDB as a follow-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13835
Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=16 --interval=2 --duration=180
```
Reviewed By: hx235
Differential Revision: D79684447
Pulled By: jaykorean
fbshipit-source-id: 65f5809f651865c3df76c2cf3b9e7b8d654bb90a
Summary:
this option has the same functionality as DBOptions::allow_ingest_behind but allows the feature at per CF level. `DBOptions::allow_ingest_behind` is deprecated after this PR and users should use `cf_allow_ingest_behind` instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13810
Test Plan: updated some existing tests to use the new option.
Reviewed By: xingbowang
Differential Revision: D79191969
Pulled By: cbi42
fbshipit-source-id: 0da45f6be472ace6754ad15df93d45ac86313837
Summary:
**Context/Summary:**
The `RoundRobinSubcompactionsAgainstResources` test, specifically the `SubcompactionsUsingResources` case, is now disabled. This decision was made because the test's reliability depends on the absence of any concurrent compactions other than the round-robin compaction. Addressing this issue while maintaining the test's focus on resource reservation requires a deeper investigation, which is currently beyond my available bandwidth. Given the increased frequency of test failures, it has been temporarily disabled to prevent further disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13839
Test Plan: - Should be no test failure from RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources anymore.
Reviewed By: cbi42
Differential Revision: D79686366
Pulled By: hx235
fbshipit-source-id: 3a226cfd2b67cabc6c585ea567e2b0c25aa5f345
Summary:
#Summary
Quick follow-up from https://github.com/facebook/rocksdb/pull/13816: `CompactFiles()` and `CompactRange()` in CompactionPickers do not run compaction as their names might suggest. What they actually do is create the Compaction object that will be passed to `CompactionJob` to run the compaction.
Renaming these two functions to better represent their purposes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13831
Test Plan: No functional change. Existing CI should be sufficient.
Reviewed By: hx235
Differential Revision: D79660196
Pulled By: jaykorean
fbshipit-source-id: ca831dbef5120e7115b52fd07b0059ca16c8f1e8
Summary:
... by ensuring that files in dropped column family are not returned to the caller upon successful, offline MANIFEST iteration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13832
Test Plan: `DBTest2, GetFileChecksumsFromCurrentManifest_CRC32`
Reviewed By: pdillinger
Differential Revision: D79607298
Pulled By: mszeszko-meta
fbshipit-source-id: e7948e086ba6e6fb953a3959fdcc81300613d73e
Summary:
Introduce `CompactionJob::VerifyOutputRecordCount()` and make it align with `VerifyInputRecordCount()`.
Functionality-wise, it should be the same except when `db_options_.compaction_verify_record_count` is false. RocksDB will only print WARN message upon verification failure and not return `Status::Corruption()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13830
Test Plan:
Existing tests cover both
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.VerifyInputRecordCount*"
```
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.CorruptedOutput*"
```
Reviewed By: hx235
Differential Revision: D79584795
Pulled By: jaykorean
fbshipit-source-id: 5851328999005601b28504085b688b80880bca7c
Summary:
In anticipation of an enhancement related to parallel compression
* Rename confusing state variables `seperator_is_key_plus_seq_` -> `must_use_separator_with_seq_`
* Eliminate copy-paste code in `PartitionedIndexBuilder::AddIndexEntry`
* Optimize/simplify `PartitionedIndexBuilder::flush_policy_` by allowing a single policy to be re-targetted to different block builders. Added some additional internal APIs to make this work, and it only works because the FlushBlockBySizePolicy is otherwise stateless (after creation).
* Improve some comments, including another proposed optimization especially for the common case of no live snapshots affecting a large compaction
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13828
Test Plan:
existing tests are pretty exhaustive, especially with crash test
Planning to validate performance in combination with next change. (This change is saving some extra allocate/deallocate with partitioned index.)
Reviewed By: cbi42
Differential Revision: D79570576
Pulled By: pdillinger
fbshipit-source-id: f7a16f0e6e6ad2023a3d1a2ebaa3cc22aac717af
Summary:
This diff introduces the IntervalSet data structure, which will be used to help create sets of non overlapping sets of intervals for MultiScan scan options. Specifically, we add specializations for Slices to assist in this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13787
Test Plan: Added test to catch various cases within adding intervals.
Reviewed By: anand1976
Differential Revision: D78624970
Pulled By: krhancoc
fbshipit-source-id: 9a3e4a28738ab8428788467540fc05ab5c1a1b67
Summary:
One of the parameters for constructing a Compaction object is `earliest_snapshot`, which is required for Standalone Range Deletion Optimization (introduced in [https://github.com/facebook/rocksdb/pull/13078](https://github.com/facebook/rocksdb/pull/13078)). Remote Compaction has been using the `CompactionPicker::CompactFiles()` API to create the Compaction object, but this API never sets the `earliest_snapshot` parameter. To address this, update `CompactionPicker::CompactFiles()` to optionally accept `earliest_snapshot` and pass it during the call in `DBImplSecondary::CompactWithoutInstallation()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13816
Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.StandaloneDeleteRangeTombstoneOptimization*"
```
\+ Tested in Meta's internal offload infra.
Reviewed By: hx235
Differential Revision: D79284769
Pulled By: jaykorean
fbshipit-source-id: 164834ef6972d5e0ddfc2970bb9234ef166d6e52
Summary:
Fix a bug in MultiScan where BlockBasedTableIterator should not return out-of-bound when the all blocks of the last scan are exhausted. This prevented LevelIterator from entering the next file so iterator is returning less keys than expected.
Also fixed stress testing to specify iterate_upper_bound correctly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13822
Test Plan:
- the following fails quickly before this PR and finishes after this PR
```python3 tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --seed=1 --fill_cache=1 --read_fault_one_in=0 --column_families=1 --allow_unprepared_value=0 --kill_random_test=88888```
- new unit test that fails before this PR
Reviewed By: krhancoc
Differential Revision: D79308957
Pulled By: cbi42
fbshipit-source-id: c9eafd1c8750b959b0185d7c63199b503493cbd2
Summary:
The main motivation for this change is to more flexibly and efficiently support compressing data without extra copies when we do not want to support saving compressed data that is LARGER than the uncompressed. We believe pretty strongly that for the various workloads served by RocksDB, it is well worth a single byte compression marker so that we have the flexibility to save compressed or uncompressed data when compression is attempted. Why? Compression algorithms can add tens of bytes in fixed overheads and percents of bytes in relative overheads. It is also an advantage for the reader when they can bypass decompression, including at least a buffer copy in most cases, after reading just one byte.
The block-based table format in RocksDB follows this model with a single-byte compression marker, and at least after https://github.com/facebook/rocksdb/pull/13797 so does CompressedSecondaryCache. (Notably, the blob file format DOES NOT. This is left to follow-up work.)
In particular, Compressor::CompressBlock now takes in a fixed size buffer for output rather than a `std::string*`. CompressBlock itself rejects the compression if the output would not fit in the provided buffer. This also works well with `max_compressed_bytes_per_kb` option to reject compression even sooner if its ratio is insufficient (implemented in this change). In the future we might use this functionality to reduce a buffer copy (in many cases) into the WritableFileWriter buffer of the block based table builder.
This is a large change because we needed to (or were compelled to)
* Update all the existing callers of CompressBlock, sometimes with substantial changes. This includes introducing GrowableBuffer to reuse between calls rather than std::string, which (at least in C++17) requires zeroing out data when allocating/growing a buffer.
* Re-implement built-in Compressors (V2; V1 is obsolete) to efficiently implement the new version of the API, no longer wrapping the `OLD_CompressData()` function. The new compressors appropriately leverage the CompressBlock virtual call required for the customization interface and no rely on `switch` on compression type for each block. The implementations are largely adaptations of the old implementations, except
* LZ4 and LZ4HC are notably upgraded to take advantage of WorkingArea (see performance tests). And for simplicity in the new implementation, we are dropping support for some super old versions of the library.
* Getting snappy to work with limited-size output buffer required using the Sink/Source interfaces, which appear to be well supported for a long time and efficient (see performance tests).
* Replace awkward old CompressionManager::GetDecompressorForCompressor with Compressor::GetOptimizedDecompressor (which is optional to implement)
* Small behavior change where we treat lack of support for compression closer to not configuring compression, such as incompatibility with block_align. This is motivated by giving CompressionManager the freedom of determining when compression can be excluded for an entire file despite the configured "compression" type, and thus only surfacing actual incompatibilities not hypothetical ones that might be irrelevant to the CompressionManager (or build configuration). Unit tests in `table_test` and `compact_files_test` required update.
* Some lingering clean up of CompressedSecondaryCache and a re-optimization made possible by compressing into an existing buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13805
Test Plan:
for correctness, existing tests
## Performance Test
As I generally only modified compression paths, I'm using a db_bench write benchmark, with before & after configurations running at the same time. vc=1 means verify_compression=1
```
USE_CLANG=1 DEBUG_LEVEL=0 LIB_MODE=static make -j100 db_bench
SUFFIX=`tty | sed 's|/|_|g'`; for CT in zlib bzip2 none snappy zstd lz4 lz4hc none snappy zstd lz4 bzip2; do for VC in 0 1; do echo "$CT vc=$VC"; (for I in `seq 1 20`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -verify_compression=$VC 2>&1 | grep micros/op; done) | awk '{n++; sum += $5;} END { print int(sum / n); }'; done; done
```
zlib vc=0 524198 -> 524904 (+0.1%)
zlib vc=1 430521 -> 430699 (+0.0%)
bzip2 vc=0 61841 -> 60835 (-1.6%)
bzip2 vc=1 49232 -> 48734 (-1.0%)
none vc=0 1802375 -> 1906227 (+5.8%)
none vc=1 1837181 -> 1950308 (+6.2%)
snappy vc=0 1783266 -> 1901461 (+6.6%)
snappy vc=1 1799703 -> 1879660 (+4.4%)
zstd vc=0 1216779 -> 1230507 (+1.1%)
zstd vc=1 996370 -> 1015415 (+1.9%)
lz4 vc=0 1801473 -> 1943095 (+7.9%)
lz4 vc=1 1799155 -> 1935242 (+7.6%)
lz4hc vc=0 349719 -> 1126909 (+222.2%)
lz4hc vc=1 348099 -> 1108933 (+218.6%)
(Repeating the most important ones)
none vc=0 1816878 -> 1952221 (+7.4%)
none vc=1 1813736 -> 1904622 (+5.0%)
snappy vc=0 1794816 -> 1875062 (+4.5%)
snappy vc=1 1789363 -> 1873771 (+4.7%)
zstd vc=0 1202592 -> 1225164 (+1.9%)
zstd vc=1 994322 -> 1016688 (+2.2%)
lz4 vc=0 1786959 -> 1971518 (+10.3%)
lz4 vc=1 1829483 -> 1935871 (+5.8%)
I confirmed manually that the new WorkingArea for LZ4HC makes the huge difference on that one, but not as much difference for LZ4, presumably because LZ4HC uses much larger buffers/structures/whatever for better compression ratios.
Reviewed By: hx235
Differential Revision: D79111736
Pulled By: pdillinger
fbshipit-source-id: 1ce1b14af9f15365f1b6da49906b5073a8cecc14
Summary:
Unit Test for a repro for the fix that was reported by https://github.com/facebook/rocksdb/pull/13743
There's potential dataloss when Remote Compaction entries are all removed due to various reasons (CompactionFilter, DeleteRange covering all keys of the SST file, etc)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13812
Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.EmptyResult*"
```
Failed before merging https://github.com/facebook/rocksdb/pull/13743, now passing
Reviewed By: cbi42
Differential Revision: D79192829
Pulled By: jaykorean
fbshipit-source-id: e200300c4a7993de21c63cd92bda65b692921b89
Summary:
We were seeing some internal builds apparently failing the `-d /mnt/gvfs/third-party` check. Although third-party2 is likely a better check (see dependencies_platform010.sh), that would create a big headache with check_format_compatible.sh which has to work across codebase versions.
* Report a WARNING when we detect on a Meta machine but the `-d /mnt/gvfs/third-party` check fails
* Let USE_CLANG influence default compiler choice so that things might still work in that case (e.g. `USE_CLANG=1 make -j24 check`)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13820
Test Plan: manual, CI
Reviewed By: jaykorean
Differential Revision: D79277197
Pulled By: pdillinger
fbshipit-source-id: 19b2d45ed794f64bbf838f4414568d77ae9ca6f1
Summary:
Preserve tombstone when allow_ingest_behind` is enabled so that they can be applied to ingested files. This can be useful when users use ingest_behind to buffer updates where Deletion needs to be preserved. This fixes https://github.com/facebook/rocksdb/issues/13571.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13807
Test Plan: updated a unit test to verify that tombstones are not dropped during compaction.
Reviewed By: hx235
Differential Revision: D79016109
Pulled By: cbi42
fbshipit-source-id: c4d31ef32c88468ababcc1ea5af5db6de42a3b0d
Summary:
As title. We will re-enable it once fixed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13815
Test Plan: N/A - Disabling the test.
Reviewed By: archang19
Differential Revision: D79172697
Pulled By: jaykorean
fbshipit-source-id: 936de3743816049cda811bde48b3b2207ed256ee
Summary:
**Issue**:
When running remote compaction, if all entries in the input files are expired, RocksDB incorrectly deletes an active file from the primary DB, leading to data loss and corruption.
**Root Cause**:
The current logic mistakenly mixed up the input and output file paths during the cleanup phase when no keys survive the compaction (all expired). This results in deleting the input files (which belong to the primary DB) instead of the output files (which belong to the SecondaryDB).
**Fix**:
Use `GetTableFileName` (virtual function) instead of `TableFileName`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13743
Reviewed By: hx235
Differential Revision: D79108650
Pulled By: jaykorean
fbshipit-source-id: 1c9ba971a0e9a62c15ebc014436cb8fc961af95c
Summary:
Building db_bench with clang and DEBUG_LEVEL=0 was failing with unused variable. This was not caught by CI so I have added this to the build-linux-clang-13-no_test_run job.
Also, while I was touching CI:
* Fold build-linux-release-rtti into build-linux-release by reducing the number of combinations tested between static/dynamic lib and rtti/not. I don't expect these to interact meaningfully with an extremely mature compiler.
* Combine build-linux-clang10-asan and build-linux-clang10-ubsan because clang is extremely reliable running both together
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13813
Test Plan: manual builds, CI
Reviewed By: krhancoc
Differential Revision: D79112643
Pulled By: pdillinger
fbshipit-source-id: 4ffc672718c05fa4597d637aacbc5a179ad8a0cf
Summary:
• Guard on __cpp_lib_atomic_shared_ptr to use std::atomic<std::shared_ptr<T>>::load()/store()
• Fallback to std::atomic_load_explicit()/store_explicit() under C++17
When attempting to build with CXX 20 using clang in a Linux environment, the build fails due to deprecation of atomic_load_explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13744
Reviewed By: xingbowang
Differential Revision: D78997919
Pulled By: cbi42
fbshipit-source-id: f829c282cba878f072d4b0ad44192a87f73b8a90
Summary:
Simulate Remote Compaction in Stress Test by running a separate set of threads that runs remote compaction.
Queue and ResultMap for the remote compactions are stored in memory as part of the `SharedState`. They are shared across main worker threads and remote compaction worker threads.
`enable_remote_compaction` is replaced by `remote_compaction_worker_threads`.
If `remote_compaction_worker_threads` is set to 0, remote compaction is not enabled in Stress Test.
**To Follow up**
This PR covers happy path only. Failure injection in the remote worker thread will be added as a follow up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13800
Test Plan:
```
./db_stress --remote_compaction_worker_threads=4 --flush_one_in=1000 --writepercent=40 --readpercent=40 --iterpercent=10 --prefixpercent=0 --delpercent=10 --destroy_db_initially=0 --clear_column_family_one_in=0 --reopen=0
```
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8
```
Reviewed By: hx235
Differential Revision: D78862084
Pulled By: jaykorean
fbshipit-source-id: b262058c92d7fecc5e014cef5df9cca4a209921b
Summary:
To be compatible with some upcoming compression change/refactoring where we supply a fixed size buffer to CompressBlock, we need to support CompressedSecondaryCache storing uncompressed values when the compression ratio is not suitable. It seems crazy that CompressedSecondaryCache currently stores compressed values that are *larger* than the uncompressed value, and even explicitly exercises that case (almost exclusively) in the existing unit tests. But it's true.
This change fixes that with some other nearby refactoring/improvement:
* Update the in-memory representation of these cache entries to support uncompressed entries even when compression is enabled. AFAIK this also allows us to safely get rid of "don't support custom split/merge for the tiered case".
* Use more efficient in-memory representation for non-split entries
* For CompressionType and CacheTier, which are defined as single-byte data types, use a single byte instead of varint32. (I don't know if varint32 was an attempt at future-proofing for a memory-only schema or what.) Now using lossless_cast will raise a compiler error if either of these types is made too large for a single byte.
* Don't wrap entries in a CacheAllocationPtr object; it's not necessary. We can rely on the same allocator being provided at delete time.
* Restructure serialization/deserialization logic, hopefully simpler or easier to read/understand.
* Use a RelaxedAtomic for disable_cache_ to avoid race.
Suggested follow-up on CompressedSecondaryCache:
* Refine the exact strategy for rejecting compressions
* Still have a lot of buffer copies; try to reduce
* Revisit the split-merge logic and try to make it more efficient overall, more unified with non-split case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13797
Test Plan:
Unit tests updated to use actually compressible strings in many places and more testing around non-compressible string.
## Performance Test
There was some pre-existing issue causing decompression failures in compressed secondary cache with cache_bench that is somehow fixed in this change. This decompression failures were present before the new compression API, but since then cause assertion failures rather than being quietly ignored. For the "before" test here, they are back to quietly ignored. And the cache_bench changes here were back-ported to the "before" configuration.
### No compressed secondary (setting expectations)
```
./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=8000000000 -populate_cache
```
Max key : 3906250
Before:
Complete in 12.784 s; Rough parallel ops/sec = 2503123
Thread ops/sec = 160329; Lookup hit ratio: 0.686771
After:
Complete in 12.745 s; Rough parallel ops/sec = 2510717 (in the noise)
Thread ops/sec = 159498; Lookup hit ratio: 0.68686
### Compressed secondary, no split/merge
Same max key and approximate total memory size
```
/usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri=compressed_secondary_cache://capacity=4000000000
```
Before:
Complete in 18.690 s; Rough parallel ops/sec = 1712144
Thread ops/sec = 108683; Lookup hit ratio: 0.776683
Latency: P50: 4205.19 P75: 15281.76 P99: 43810.98 P99.9: 71487.41 P99.99: 165453.32
max RSS (according to /usr/bin/time): 9341856
After:
Complete in 17.878 s; Rough parallel ops/sec = 1789951 (+4.5%)
Thread ops/sec = 114957; Lookup hit ratio: 0.792998 (+0.016)
Latency: P50: 4012.70 P75: 14477.63 P99: 40039.70 P99.9: 62521.04 P99.99: 167049.18
max RSS (according to /usr/bin/time): 9235688
The improved hit ratio is probably from fixing the failed decompressions (somehow). And my modifications could have improved CPU efficiency, or it could be the small penalty the benchmark naturally imposes on most misses (generate another value and insert it).
### Compressed secondary, with split/merge
```
/usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri='compressed_secondary_cache://capacity=4000000000;enable_custom_split_merge=true'
```
Before:
Complete in 20.062 s; Rough parallel ops/sec = 1595075
Thread ops/sec = 101759; Lookup hit ratio: 0.787129
Latency: P50: 5338.53 P75: 16073.46 P99: 46752.65 P99.9: 73459.11 P99.99: 201318.75
max RSS (according to /usr/bin/time): 9049852
After:
Complete in 18.564 s; Rough parallel ops/sec = 1723771 (+8.1%)
Thread ops/sec = 110724; Lookup hit ratio: 0.813414 (+0.026)
Latency: P50: 5234.75 P75: 14590.43 P99: 41401.03 P99.9: 65606.50 P99.99: 157248.04
max RSS (according to /usr/bin/time): 8917592
Looks like an improvement
Reviewed By: anand1976
Differential Revision: D78842120
Pulled By: pdillinger
fbshipit-source-id: 5f754b160c37ebee789279178ebb5e862071bdb2
Summary:
Create a new API FileSystem::SyncFile for file sync, so that we could use file sync directly in places where we need to sync file content to file system without any modification. This is mostly used combined with link file. In some file system link file does not guarantee the file content is synced to file system.
https://github.com/facebook/rocksdb/issues/13741
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13762
Test Plan:
Unit test
T229418750
Reviewed By: pdillinger
Differential Revision: D78121137
Pulled By: xingbowang
fbshipit-source-id: 0ea8a5a3b486e0b61636700400613fed6bbd3faa
Summary:
The yield is actually of not much use because waitFor should already be doing that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13796
Reviewed By: pdillinger
Differential Revision: D78823656
Pulled By: jainpr
fbshipit-source-id: 040eaf596938ce8db535bc810ad77a9e50b2d551
Summary:
This diff fixes up a miss in which the property_bag was not pushed down to the BlockBasedIterator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13795
Reviewed By: anand1976
Differential Revision: D78762294
Pulled By: krhancoc
fbshipit-source-id: 8970b0a87e35d07d5a0dd16f360ec96859f66550
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13794
LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning:
> warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated
Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Reviewed By: meyering
Differential Revision: D78635037
fbshipit-source-id: a90c68469947705c65f36588b2d575237689dbe8
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13793
LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning:
> warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated
Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Reviewed By: meyering
Differential Revision: D78635005
fbshipit-source-id: bd7cbfff0580b9579e78237ec4371615d3609536
Summary:
This patch reverted "NewRandomRWFile" back to "ReopenWritableFile" in external sst file ingestion job when file is linked instead of copied. The reason is that some of the file systems do not support "NewRandomRWFile". A long term fix is being worked in progress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13791
Test Plan: Unit test
Reviewed By: pdillinger
Differential Revision: D78697825
Pulled By: xingbowang
fbshipit-source-id: d3651223ab1f2369aac34b772bba8049c6c2c628
Summary:
This diff introduces the ScanOption Pruning, previously the intent was to do prefetching for each sub-iterator of the level iterator, however since BlockBasedIterator does not prefetch asynchronously, this optimization does not make sense just yet.
For now we will prune the ScanOptions to the overlapping ranges and make sure they are properly piped to the underlying layers (during Prepare, and Seek).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13780
Reviewed By: cbi42
Differential Revision: D78436869
Pulled By: krhancoc
fbshipit-source-id: 681fe7f7f88b04b5c2d60cb3a5de01e03f6f8431
Summary:
So that we can use --command=recompress with a custom CompressionManager. (It's not required for reading files using a custom CompressionManager because those can already use ObjectLibrary for dependency injection.)
Suggested follow-up:
* These tests should not be using C arrays, snprintf, manual delete, etc. except for thin compatibility with argc/argv.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13783
Test Plan: unit test added, some manual testing
Reviewed By: archang19
Differential Revision: D78574434
Pulled By: pdillinger
fbshipit-source-id: 609e6c6439090e6b7e9b63fbd4c2d3f04b104fcf