Updates to corresponding submodule position in main fork at v10.10.1 #5

Open
gamesguru wants to merge 221 commits from gamesguru/rocksdb:main-c10y into 10.5.fb
First-time contributor

Needs to be tested, both the C++ and some integration tests with continuwuity servers.

See release notes

Needs to be tested, both the C++ and some integration tests with continuwuity servers. ### See release notes - https://github.com/zaidoon1/rust-rocksdb/releases/tag/v0.46.0 - https://github.com/facebook/rocksdb/releases/tag/v10.10.1
Summary:
So that we can use --command=recompress with a custom CompressionManager. (It's not required for reading files using a custom CompressionManager because those can already use ObjectLibrary for dependency injection.)

Suggested follow-up:
* These tests should not be using C arrays, snprintf, manual delete, etc. except for thin compatibility with argc/argv.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13783

Test Plan: unit test added, some manual testing

Reviewed By: archang19

Differential Revision: D78574434

Pulled By: pdillinger

fbshipit-source-id: 609e6c6439090e6b7e9b63fbd4c2d3f04b104fcf
Summary:
This includes:
1. Release notes from 10.5 branch
2. Version.h update
3. Format compatibility check
4. Folly commit hash update (I chose https://github.com/facebook/folly/releases/tag/v2025.06.30.00 because later commits were causing CI failures)

Previous release: https://github.com/facebook/rocksdb/pull/13719

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13784

Reviewed By: pdillinger

Differential Revision: D78587604

Pulled By: archang19

fbshipit-source-id: a8611ef4527c3c6ee5c830349b7ae41701c1efb6
Summary:
This diff introduces the ScanOption Pruning, previously the intent was to do prefetching for each sub-iterator of the level iterator, however since BlockBasedIterator does not prefetch asynchronously, this optimization does not make sense just yet.

For now we will prune the ScanOptions to the overlapping ranges and make sure they are properly piped to the underlying layers (during Prepare, and Seek).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13780

Reviewed By: cbi42

Differential Revision: D78436869

Pulled By: krhancoc

fbshipit-source-id: 681fe7f7f88b04b5c2d60cb3a5de01e03f6f8431
Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13790

Test Plan:
```
./ttl_test --gtest_filter="*TtlTest.ChangeTtlOnOpenDb*"
```

Reviewed By: cbi42

Differential Revision: D78670347

Pulled By: jaykorean

fbshipit-source-id: 1b2538d6cd0f2a0fbf397a5d2f677852f97272c4
Summary:
This patch reverted "NewRandomRWFile" back to "ReopenWritableFile" in external sst file ingestion job when file is linked instead of copied. The reason is that some of the file systems do not support "NewRandomRWFile". A long term fix is being worked in progress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13791

Test Plan: Unit test

Reviewed By: pdillinger

Differential Revision: D78697825

Pulled By: xingbowang

fbshipit-source-id: d3651223ab1f2369aac34b772bba8049c6c2c628
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13793

LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning:

> warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated

Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: meyering

Differential Revision: D78635005

fbshipit-source-id: bd7cbfff0580b9579e78237ec4371615d3609536
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13794

LLVM has a warning `-Wdeprecated-redundant-constexpr-static-def` which raises the warning:

> warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated

Since we are now on C++20, we can remove the out-of-line definition of constexpr static data members. This diff does so.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Reviewed By: meyering

Differential Revision: D78635037

fbshipit-source-id: a90c68469947705c65f36588b2d575237689dbe8
Summary:
This diff fixes up a miss in which the property_bag was not pushed down to the BlockBasedIterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13795

Reviewed By: anand1976

Differential Revision: D78762294

Pulled By: krhancoc

fbshipit-source-id: 8970b0a87e35d07d5a0dd16f360ec96859f66550
Summary:
The yield is actually of not much use because waitFor should already be doing that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13796

Reviewed By: pdillinger

Differential Revision: D78823656

Pulled By: jainpr

fbshipit-source-id: 040eaf596938ce8db535bc810ad77a9e50b2d551
Summary:
Create a new API FileSystem::SyncFile for file sync, so that we could use file sync directly in places where we need to sync file content to file system without any modification. This is mostly used combined with link file. In some file system link file does not guarantee the file content is synced to file system.

https://github.com/facebook/rocksdb/issues/13741

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13762

Test Plan:
Unit test
T229418750

Reviewed By: pdillinger

Differential Revision: D78121137

Pulled By: xingbowang

fbshipit-source-id: 0ea8a5a3b486e0b61636700400613fed6bbd3faa
Summary:
To be compatible with some upcoming compression change/refactoring where we supply a fixed size buffer to CompressBlock, we need to support CompressedSecondaryCache storing uncompressed values when the compression ratio is not suitable. It seems crazy that CompressedSecondaryCache currently stores compressed values that are *larger* than the uncompressed value, and even explicitly exercises that case (almost exclusively) in the existing unit tests. But it's true.

This change fixes that with some other nearby refactoring/improvement:
* Update the in-memory representation of these cache entries to support uncompressed entries even when compression is enabled. AFAIK this also allows us to safely get rid of "don't support custom split/merge for the tiered case".
* Use more efficient in-memory representation for non-split entries
  * For CompressionType and CacheTier, which are defined as single-byte data types, use a single byte instead of varint32. (I don't know if varint32 was an attempt at future-proofing for a memory-only schema or what.) Now using lossless_cast will raise a compiler error if either of these types is made too large for a single byte.
  * Don't wrap entries in a CacheAllocationPtr object; it's not necessary. We can rely on the same allocator being provided at delete time.
* Restructure serialization/deserialization logic, hopefully simpler or easier to read/understand.
* Use a RelaxedAtomic for disable_cache_ to avoid race.

Suggested follow-up on CompressedSecondaryCache:
* Refine the exact strategy for rejecting compressions
* Still have a lot of buffer copies; try to reduce
* Revisit the split-merge logic and try to make it more efficient overall, more unified with non-split case

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13797

Test Plan:
Unit tests updated to use actually compressible strings in many places and more testing around non-compressible string.

## Performance Test
There was some pre-existing issue causing decompression failures in compressed secondary cache with cache_bench that is somehow fixed in this change. This decompression failures were present before the new compression API, but since then cause assertion failures rather than being quietly ignored. For the "before" test here, they are back to quietly ignored. And the cache_bench changes here were  back-ported to the "before" configuration.

### No compressed secondary (setting expectations)
```
./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=8000000000 -populate_cache
```
Max key             : 3906250

Before:
Complete in 12.784 s; Rough parallel ops/sec = 2503123
Thread ops/sec = 160329; Lookup hit ratio: 0.686771

After:
Complete in 12.745 s; Rough parallel ops/sec = 2510717 (in the noise)
Thread ops/sec = 159498; Lookup hit ratio: 0.68686

### Compressed secondary, no split/merge
Same max key and approximate total memory size
```
/usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri=compressed_secondary_cache://capacity=4000000000
```
Before:
Complete in 18.690 s; Rough parallel ops/sec = 1712144
Thread ops/sec = 108683; Lookup hit ratio: 0.776683
Latency: P50: 4205.19 P75: 15281.76 P99: 43810.98 P99.9: 71487.41 P99.99: 165453.32
max RSS (according to /usr/bin/time): 9341856

After:
Complete in 17.878 s; Rough parallel ops/sec = 1789951 (+4.5%)
Thread ops/sec = 114957; Lookup hit ratio: 0.792998 (+0.016)
Latency: P50: 4012.70 P75: 14477.63 P99: 40039.70 P99.9: 62521.04 P99.99: 167049.18
max RSS (according to /usr/bin/time): 9235688

The improved hit ratio is probably from fixing the failed decompressions (somehow). And my modifications could have improved CPU efficiency, or it could be the small penalty the benchmark naturally imposes on most misses (generate another value and insert it).

### Compressed secondary, with split/merge
```
/usr/bin/time ./cache_bench --cache_type=auto_hyper_clock_cache -cache_size=4000000000 -populate_cache -resident_ratio=0.125 -compressible_to_ratio=0.4 --secondary_cache_uri='compressed_secondary_cache://capacity=4000000000;enable_custom_split_merge=true'
```
Before:
Complete in 20.062 s; Rough parallel ops/sec = 1595075
Thread ops/sec = 101759; Lookup hit ratio: 0.787129
Latency: P50: 5338.53 P75: 16073.46 P99: 46752.65 P99.9: 73459.11 P99.99: 201318.75
max RSS (according to /usr/bin/time): 9049852

After:
Complete in 18.564 s; Rough parallel ops/sec = 1723771 (+8.1%)
Thread ops/sec = 110724; Lookup hit ratio: 0.813414 (+0.026)
Latency: P50: 5234.75 P75: 14590.43 P99: 41401.03 P99.9: 65606.50 P99.99: 157248.04
max RSS (according to /usr/bin/time): 8917592

Looks like an improvement

Reviewed By: anand1976

Differential Revision: D78842120

Pulled By: pdillinger

fbshipit-source-id: 5f754b160c37ebee789279178ebb5e862071bdb2
Summary:
Simulate Remote Compaction in Stress Test by running a separate set of threads that runs remote compaction.
Queue and ResultMap for the remote compactions are stored in memory as part of the `SharedState`. They are shared across main worker threads and remote compaction worker threads.

`enable_remote_compaction` is replaced by `remote_compaction_worker_threads`.
If `remote_compaction_worker_threads` is set to 0, remote compaction is not enabled in Stress Test.

**To Follow up**

This PR covers happy path only. Failure injection in the remote worker thread will be added as a follow up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13800

Test Plan:
```
./db_stress --remote_compaction_worker_threads=4  --flush_one_in=1000 --writepercent=40 --readpercent=40 --iterpercent=10 --prefixpercent=0 --delpercent=10 --destroy_db_initially=0 --clear_column_family_one_in=0 --reopen=0
```
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8
```

Reviewed By: hx235

Differential Revision: D78862084

Pulled By: jaykorean

fbshipit-source-id: b262058c92d7fecc5e014cef5df9cca4a209921b
Summary:
• Guard on __cpp_lib_atomic_shared_ptr to use std::atomic<std::shared_ptr<T>>::load()/store()
• Fallback to std::atomic_load_explicit()/store_explicit() under C++17

When attempting to build with CXX 20 using clang in a Linux environment, the build fails due to deprecation of atomic_load_explicit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13744

Reviewed By: xingbowang

Differential Revision: D78997919

Pulled By: cbi42

fbshipit-source-id: f829c282cba878f072d4b0ad44192a87f73b8a90
Summary:
Building db_bench with clang and DEBUG_LEVEL=0 was failing with unused variable. This was not caught by CI so I have added this to the build-linux-clang-13-no_test_run job.

Also, while I was touching CI:
* Fold build-linux-release-rtti into build-linux-release by reducing the number of combinations tested between static/dynamic lib and rtti/not. I don't expect these to interact meaningfully with an extremely mature compiler.
* Combine build-linux-clang10-asan and build-linux-clang10-ubsan because clang is extremely reliable running both together

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13813

Test Plan: manual builds, CI

Reviewed By: krhancoc

Differential Revision: D79112643

Pulled By: pdillinger

fbshipit-source-id: 4ffc672718c05fa4597d637aacbc5a179ad8a0cf
Summary:
Add the new MultiScan operation to db_stress (disabled by default)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13803

Test Plan: python3 tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1

Reviewed By: krhancoc

Differential Revision: D78938131

Pulled By: anand1976

fbshipit-source-id: 30fced56e46b79cebebc7ec4d4588c6c2fca232a
Summary:
**Issue**:
When running remote compaction, if all entries in the input files are expired, RocksDB incorrectly deletes an active file from the primary DB, leading to data loss and corruption.

**Root Cause**:
The current logic mistakenly mixed up the input and output file paths during the cleanup phase when no keys survive the compaction (all expired). This results in deleting the input files (which belong to the primary DB) instead of the output files (which belong to the SecondaryDB).

**Fix**:
Use `GetTableFileName` (virtual function) instead of `TableFileName`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13743

Reviewed By: hx235

Differential Revision: D79108650

Pulled By: jaykorean

fbshipit-source-id: 1c9ba971a0e9a62c15ebc014436cb8fc961af95c
Summary:
As title. We will re-enable it once fixed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13815

Test Plan: N/A - Disabling the test.

Reviewed By: archang19

Differential Revision: D79172697

Pulled By: jaykorean

fbshipit-source-id: 936de3743816049cda811bde48b3b2207ed256ee
Summary:
Preserve tombstone when allow_ingest_behind` is enabled so that they can be applied to ingested files. This can be useful when users use ingest_behind to buffer updates where Deletion needs to be preserved. This fixes https://github.com/facebook/rocksdb/issues/13571.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13807

Test Plan: updated a unit test to verify that tombstones are not dropped during compaction.

Reviewed By: hx235

Differential Revision: D79016109

Pulled By: cbi42

fbshipit-source-id: c4d31ef32c88468ababcc1ea5af5db6de42a3b0d
Summary:
We were seeing some internal builds apparently failing the `-d /mnt/gvfs/third-party` check. Although third-party2 is likely a better check (see dependencies_platform010.sh), that would create a big headache with check_format_compatible.sh which has to work across codebase versions.
* Report a WARNING when we detect on a Meta machine but the `-d /mnt/gvfs/third-party` check fails
* Let USE_CLANG influence default compiler choice so that things might still work in that case (e.g. `USE_CLANG=1 make -j24 check`)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13820

Test Plan: manual, CI

Reviewed By: jaykorean

Differential Revision: D79277197

Pulled By: pdillinger

fbshipit-source-id: 19b2d45ed794f64bbf838f4414568d77ae9ca6f1
Summary:
Unit Test for a repro for the fix that was reported by https://github.com/facebook/rocksdb/pull/13743

There's potential dataloss when Remote Compaction entries are all removed due to various reasons (CompactionFilter, DeleteRange covering all keys of the SST file, etc)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13812

Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.EmptyResult*"
```

Failed before merging https://github.com/facebook/rocksdb/pull/13743, now passing

Reviewed By: cbi42

Differential Revision: D79192829

Pulled By: jaykorean

fbshipit-source-id: e200300c4a7993de21c63cd92bda65b692921b89
Summary:
The main motivation for this change is to more flexibly and efficiently support compressing data without extra copies when we do not want to support saving compressed data that is LARGER than the uncompressed. We believe pretty strongly that for the various workloads served by RocksDB, it is well worth a single byte compression marker so that we have the flexibility to save compressed or uncompressed data when compression is attempted. Why? Compression algorithms can add tens of bytes in fixed overheads and percents of bytes in relative overheads. It is also an advantage for the reader when they can bypass decompression, including at least a buffer copy in most cases, after reading just one byte.

The block-based table format in RocksDB follows this model with a single-byte compression marker, and at least after https://github.com/facebook/rocksdb/pull/13797 so does CompressedSecondaryCache. (Notably, the blob file format DOES NOT. This is left to follow-up work.)

In particular, Compressor::CompressBlock now takes in a fixed size buffer for output rather than a `std::string*`. CompressBlock itself rejects the compression if the output would not fit in the provided buffer. This also works well with `max_compressed_bytes_per_kb` option to reject compression even sooner if its ratio is insufficient (implemented in this change). In the future we might use this functionality to reduce a buffer copy (in many cases) into the WritableFileWriter buffer of the block based table builder.

This is a large change because we needed to (or were compelled to)
* Update all the existing callers of CompressBlock, sometimes with substantial changes. This includes introducing GrowableBuffer to reuse between calls rather than std::string, which (at least in C++17) requires zeroing out data when allocating/growing a buffer.
* Re-implement built-in Compressors (V2; V1 is obsolete) to efficiently implement the new version of the API, no longer wrapping the `OLD_CompressData()` function. The new compressors appropriately leverage the CompressBlock virtual call required for the customization interface and no rely on `switch` on compression type for each block. The implementations are largely adaptations of the old implementations, except
  * LZ4 and LZ4HC are notably upgraded to take advantage of WorkingArea (see performance tests). And for simplicity in the new implementation, we are dropping support for some super old versions of the library.
  * Getting snappy to work with limited-size output buffer required using the Sink/Source interfaces, which appear to be well supported for a long time and efficient (see performance tests).
* Replace awkward old CompressionManager::GetDecompressorForCompressor with Compressor::GetOptimizedDecompressor (which is optional to implement)
* Small behavior change where we treat lack of support for compression closer to not configuring compression, such as incompatibility with block_align. This is motivated by giving CompressionManager the freedom of determining when compression can be excluded for an entire file despite the configured "compression" type, and thus only surfacing actual incompatibilities not hypothetical ones that might be irrelevant to the CompressionManager (or build configuration). Unit tests in `table_test` and `compact_files_test` required update.
* Some lingering clean up of CompressedSecondaryCache and a re-optimization made possible by compressing into an existing buffer.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13805

Test Plan:
for correctness, existing tests

## Performance Test

As I generally only modified compression paths, I'm using a db_bench write benchmark, with before & after configurations running at the same time. vc=1 means verify_compression=1

```
USE_CLANG=1 DEBUG_LEVEL=0 LIB_MODE=static make -j100 db_bench
SUFFIX=`tty | sed 's|/|_|g'`; for CT in zlib bzip2 none snappy zstd lz4 lz4hc none snappy zstd lz4 bzip2; do for VC in 0 1; do echo "$CT vc=$VC"; (for I in `seq 1 20`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -verify_compression=$VC 2>&1 | grep micros/op; done) | awk '{n++; sum += $5;} END { print int(sum / n); }'; done; done
```

zlib vc=0 524198 -> 524904 (+0.1%)
zlib vc=1 430521 -> 430699 (+0.0%)
bzip2 vc=0 61841 -> 60835 (-1.6%)
bzip2 vc=1 49232 -> 48734 (-1.0%)
none vc=0 1802375 -> 1906227 (+5.8%)
none vc=1 1837181 -> 1950308 (+6.2%)
snappy vc=0 1783266 -> 1901461 (+6.6%)
snappy vc=1 1799703 -> 1879660 (+4.4%)
zstd vc=0 1216779 -> 1230507 (+1.1%)
zstd vc=1 996370 -> 1015415 (+1.9%)
lz4 vc=0 1801473 -> 1943095 (+7.9%)
lz4 vc=1 1799155 -> 1935242 (+7.6%)
lz4hc vc=0 349719 -> 1126909 (+222.2%)
lz4hc vc=1 348099 -> 1108933 (+218.6%)
(Repeating the most important ones)
none vc=0 1816878 -> 1952221 (+7.4%)
none vc=1 1813736 -> 1904622 (+5.0%)
snappy vc=0 1794816 -> 1875062 (+4.5%)
snappy vc=1 1789363 -> 1873771 (+4.7%)
zstd vc=0 1202592 -> 1225164 (+1.9%)
zstd vc=1 994322 -> 1016688 (+2.2%)
lz4 vc=0 1786959 -> 1971518 (+10.3%)
lz4 vc=1 1829483 -> 1935871 (+5.8%)

I confirmed manually that the new WorkingArea for LZ4HC makes the huge difference on that one, but not as much difference for LZ4, presumably because LZ4HC uses much larger buffers/structures/whatever for better compression ratios.

Reviewed By: hx235

Differential Revision: D79111736

Pulled By: pdillinger

fbshipit-source-id: 1ce1b14af9f15365f1b6da49906b5073a8cecc14
Summary:
Fix a bug in MultiScan where BlockBasedTableIterator should not return out-of-bound when the all blocks of the last scan are exhausted. This prevented LevelIterator from entering the next file so iterator is returning less keys than expected.

Also fixed stress testing to specify iterate_upper_bound correctly.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13822

Test Plan:
- the following fails quickly before this PR and finishes after this PR
```python3 tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --seed=1 --fill_cache=1 --read_fault_one_in=0 --column_families=1 --allow_unprepared_value=0 --kill_random_test=88888```
- new unit test that fails before this PR

Reviewed By: krhancoc

Differential Revision: D79308957

Pulled By: cbi42

fbshipit-source-id: c9eafd1c8750b959b0185d7c63199b503493cbd2
Summary:
One of the parameters for constructing a Compaction object is `earliest_snapshot`, which is required for Standalone Range Deletion Optimization (introduced in [https://github.com/facebook/rocksdb/pull/13078](https://github.com/facebook/rocksdb/pull/13078)). Remote Compaction has been using the `CompactionPicker::CompactFiles()` API to create the Compaction object, but this API never sets the `earliest_snapshot` parameter. To address this, update `CompactionPicker::CompactFiles()` to optionally accept `earliest_snapshot` and pass it during the call in `DBImplSecondary::CompactWithoutInstallation()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13816

Test Plan:
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.StandaloneDeleteRangeTombstoneOptimization*"
```

\+ Tested in Meta's internal offload infra.

Reviewed By: hx235

Differential Revision: D79284769

Pulled By: jaykorean

fbshipit-source-id: 164834ef6972d5e0ddfc2970bb9234ef166d6e52
Summary:
This diff introduces the IntervalSet data structure, which will be used to help create sets of non overlapping sets of intervals for MultiScan scan options. Specifically, we add specializations for Slices to assist in this.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13787

Test Plan: Added test to catch various cases within adding intervals.

Reviewed By: anand1976

Differential Revision: D78624970

Pulled By: krhancoc

fbshipit-source-id: 9a3e4a28738ab8428788467540fc05ab5c1a1b67
Summary:
In anticipation of an enhancement related to parallel compression
* Rename confusing state variables `seperator_is_key_plus_seq_` -> `must_use_separator_with_seq_`
* Eliminate copy-paste code in `PartitionedIndexBuilder::AddIndexEntry`
* Optimize/simplify `PartitionedIndexBuilder::flush_policy_` by allowing a single policy to be re-targetted to different block builders. Added some additional internal APIs to make this work, and it only works because the FlushBlockBySizePolicy is otherwise stateless (after creation).
* Improve some comments, including another proposed optimization especially for the common case of no live snapshots affecting a large compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13828

Test Plan:
existing tests are pretty exhaustive, especially with crash test

Planning to validate performance in combination with next change. (This change is saving some extra allocate/deallocate with partitioned index.)

Reviewed By: cbi42

Differential Revision: D79570576

Pulled By: pdillinger

fbshipit-source-id: f7a16f0e6e6ad2023a3d1a2ebaa3cc22aac717af
Summary:
Introduce `CompactionJob::VerifyOutputRecordCount()` and make it align with `VerifyInputRecordCount()`.

Functionality-wise, it should be the same except when `db_options_.compaction_verify_record_count` is false. RocksDB will only print WARN message upon verification failure and not return `Status::Corruption()`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13830

Test Plan:
Existing tests cover both
```
 ./compaction_service_test --gtest_filter="*CompactionServiceTest.VerifyInputRecordCount*"
```

```
 ./compaction_service_test --gtest_filter="*CompactionServiceTest.CorruptedOutput*"
```

Reviewed By: hx235

Differential Revision: D79584795

Pulled By: jaykorean

fbshipit-source-id: 5851328999005601b28504085b688b80880bca7c
Summary:
... by ensuring that files in dropped column family are not returned to the caller upon successful, offline MANIFEST iteration.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13832

Test Plan: `DBTest2, GetFileChecksumsFromCurrentManifest_CRC32`

Reviewed By: pdillinger

Differential Revision: D79607298

Pulled By: mszeszko-meta

fbshipit-source-id: e7948e086ba6e6fb953a3959fdcc81300613d73e
Summary:
#Summary

Quick follow-up from https://github.com/facebook/rocksdb/pull/13816: `CompactFiles()` and `CompactRange()` in CompactionPickers do not run compaction as their names might suggest. What they actually do is create the Compaction object that will be passed to `CompactionJob` to run the compaction.

Renaming these two functions to better represent their purposes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13831

Test Plan: No functional change. Existing CI should be sufficient.

Reviewed By: hx235

Differential Revision: D79660196

Pulled By: jaykorean

fbshipit-source-id: ca831dbef5120e7115b52fd07b0059ca16c8f1e8
Summary:
**Context/Summary:**

The `RoundRobinSubcompactionsAgainstResources` test, specifically the `SubcompactionsUsingResources` case, is now disabled. This decision was made because the test's reliability depends on the absence of any concurrent compactions other than the round-robin compaction. Addressing this issue while maintaining the test's focus on resource reservation requires a deeper investigation, which is currently beyond my available bandwidth. Given the increased frequency of test failures, it has been temporarily disabled to prevent further disruptions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13839

Test Plan: - Should be no test failure from RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources anymore.

Reviewed By: cbi42

Differential Revision: D79686366

Pulled By: hx235

fbshipit-source-id: 3a226cfd2b67cabc6c585ea567e2b0c25aa5f345
Summary:
this option has the same functionality as DBOptions::allow_ingest_behind but allows the feature at per CF level. `DBOptions::allow_ingest_behind` is deprecated after this PR and users should use `cf_allow_ingest_behind` instead.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13810

Test Plan: updated some existing tests to use the new option.

Reviewed By: xingbowang

Differential Revision: D79191969

Pulled By: cbi42

fbshipit-source-id: 0da45f6be472ace6754ad15df93d45ac86313837
Summary:
Remote Compaction in the stress test previously failed with the following error, so we temporarily disabled it in PR https://github.com/facebook/rocksdb/issues/13815 :

```
reference std::vector<rocksdb::ThreadState *>::operator[](size_type) [_Tp = rocksdb::ThreadState *, _Alloc = std::allocator<rocksdb::ThreadState *>]: Assertion '__n < this->size()' failed.
```

The error was from accessing `remote_compaction_worker_threads[i]` when `i < remote_compaction_worker_threads.size()` which leads to an undefined behavior. This PR fixes the issue by properly setting the worker thread pointers in `remote_compaction_worker_threads`.

Note: We are still encountering errors when both BlobDB and Remote Compaction are enabled. It appears to be a race condition. For now, BlobDB is temporarily disabled if remote compaction is enabled. We will fix the race condition and re-enable BlobDB as a follow-up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13835

Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=16 --interval=2 --duration=180
```

Reviewed By: hx235

Differential Revision: D79684447

Pulled By: jaykorean

fbshipit-source-id: 65f5809f651865c3df76c2cf3b9e7b8d654bb90a
Summary:
**Summary:**
This test verifies that compaction respects the min_file_size parameter when triggered by deletions, preventing the compaction of files with deletions smaller than the threshold. The test logic includes two scenarios:
1. Verify that a large L0 file with deletions exceeding the minimum file size threshold triggers deletion-triggered compaction (DTC) and compacts to L1.
2. Verify that a small L0 file with deletions, but below the minimum file size threshold, does not trigger DTC and remains at L0.

Added the DeletionTriggeredCompactionWithMinFileSizeTestListener, which verifies that files selected for compaction based on deletion triggers meet the minimum file size threshold. The listener validates in OnCompactionBegin that all input files have sizes greater than or equal to the configured min_file_size parameter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13825

Test Plan:
Tested this feature on our devserver using the following commands:
```
DEBUG_LEVEL=2 make -j64 db_compaction_test && KEEP_DB=1 ./db_compaction_test --gtest_filter="*DBCompactionTest.CompactionWith*"
```

Test output confirms the expected behavior:
```
2025/07/31-11:24:49.473181 1431671 [/compaction/compaction_job.cc:2291] [default] [JOB 6] Compacting 2@0 files to L1, score 0.04
2025/07/31-11:24:49.473240 1431671 [/compaction/compaction_job.cc:2297] [default]: Compaction start summary: Base version 6 Base level 0, inputs: [15(52KB) 9(103KB)]
2025/07/31-11:24:49.473304 1431671 EVENT_LOG_v1 {"time_micros": 1753986289473273, "job": 6, "event": "compaction_started", "cf_name": "default", "compaction_reason": "FilesMarkedForCompaction", "files_L0": [15, 9], "score": 0.04, "input_data_size": 159848, "oldest_snapshot_seqno": -1}

```

**Tasks:**
T228156639

Reviewed By: cbi42

Differential Revision: D79395851

Pulled By: nmk70

fbshipit-source-id: 4c2a80a95521b40543981dd81b347f3984cd2a8b
Summary:
To reduce the noise, disable the incompatible ones for now when `remote_compaction_worker_threads > 0`. We will investigate each, fix as needed and re-enable them as follow up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13843

Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0
```

Reviewed By: cbi42

Differential Revision: D79735166

Pulled By: jaykorean

fbshipit-source-id: ae3be38a21073fd3282d6e8cd7d71f0363df3590
Summary:
https://github.com/facebook/rocksdb/issues/13676 unfortunately treated some IOErrors as corruption, which is not appropriate when remote storage is involved. To help enforce this, our crash test injects errors that are expected to be propagated back to the user rather than causing some other failure.

Saw crash test failures like this:
```
TestMultiGetEntity (AttributeGroup) error: Corruption: Failed to get file size: Not implemented: GetFileSize Not Supported for file ...
```

So fixing this handling by not injecting a false Corruption failure and allowing smooth fallback from FSRandomAccessFile::GetFileSize to FileSystem::GetFileSize

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13842

Test Plan: unit test added

Reviewed By: xingbowang

Differential Revision: D79728861

Pulled By: pdillinger

fbshipit-source-id: 33f7dfc85d86d88cb4ab24a8defd26618c95c954
Summary:
https://github.com/facebook/rocksdb/pull/13404 exposed pretty much everything via c api except allowing the user to set the sst file manager that was created

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13826

Reviewed By: hx235

Differential Revision: D79733147

Pulled By: cbi42

fbshipit-source-id: 6a18741581717a8b8b644b9f85bcd8fbeba94e6a
Summary:
We temporarily disabled WAL when Remote Compaction is enabled in Stress Test (https://github.com/facebook/rocksdb/pull/13843). There are few others to incompatible features when WAL is disabled. Due to the sanitization order, WAL was disabled at the end of the sanitization and these incompatible features weren't set properly. Stress Test failed with an error like the following.

e.g. `reopen` stress test is not compatible with `disable_wal` - `Error: Db cannot reopen safely with disable_wal set!`

This PR changes the order of sanitization so that `disable_wal` is set earlier when `remote_compaction_worker_threads > 0`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13845

Test Plan:
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --interval=5 --duration=6000 --continuous_verification_interval=10 --disable_wal=1 --use_txn=1 --txn_write_policy=2 --enable_pipelined_write=0 --checkpoint_one_in=0 --use_timed_put_one_in=0
```

Reviewed By: cbi42

Differential Revision: D79758670

Pulled By: jaykorean

fbshipit-source-id: aa6f4a74cc86c23f442928c301187b06e8137f53
Summary:
Previous attempts were not enough keep the stress test running with remote compaction enabled - https://github.com/facebook/rocksdb/pull/13845, https://github.com/facebook/rocksdb/pull/13843, https://github.com/facebook/rocksdb/pull/13835

We will disable the remote compaction in stress test and address this with a better strategy (using internal Meta infra)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13848

Test Plan: CI

Reviewed By: cbi42

Differential Revision: D79816733

Pulled By: jaykorean

fbshipit-source-id: e93b037adf4f775202e06c3fd4aa8a3b4b85c274
Summary:
**Context/Summary:**
The `CompactionJob::Run()` method has grown too large and complex, making it difficult to implement moderate changes or reason about the code flow (e.g., determining where to save compaction progress for resuming). This PR refactors the method into smaller, more focused functions to improve readability and maintainability.

The refactoring consists mostly of cosmetic changes that extract logical sections into separate methods, with two notable functional improvements:

1.  **Relocated output processing logic**: Moved code under `RemoveEmptyOutputs()` and `HasNewBlobFiles()` to where it's actually needed, rather than piggy-backing on the subcompaction state loop. While this introduces 2 additional loops over subcompactions, the performance impact should be negligible given the improved code clarity.

2.  **Repositioned statistics updates**: Moved `UpdateCompactionJobInputStats()` and `UpdateCompactionJobOutputStats()` from the record verification section to the end `FinalizeCompactionRun()` methods. This change is safe since record verification is a read-only operation that doesn't modify any statistics.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13849

Test Plan: Existing unit tests

Reviewed By: jaykorean

Differential Revision: D79824429

Pulled By: hx235

fbshipit-source-id: 6b73136f32ecc6842a04a77502b7dbb0bbf507f7
Summary:
**Context/Summary:**
This update, which should have been part of a previous refactoring [PR](d2ac955881), involves simple renaming for clarity and ensures output table properties are only set when compaction succeeds. Output properties are not meaningful if compaction fails, so this change prevents their population in such cases. Additionally, subsequent statistics updates already do not rely on output file table properties, maintaining correctness regardless of compaction success.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13851

Test Plan: Existing unit tests

Reviewed By: jaykorean

Differential Revision: D79862244

Pulled By: hx235

fbshipit-source-id: 1db16b8dc7b820fab3ec1d5c8a4b757466590e2c
Summary:
To better support future options, and changes, we need to convert the std::vector<ScanOptions> to something more malleable.

This diff introduces the MultiScanOptions structure and pipes it through the various points in the code in the Prepare path.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13837

Test Plan:
Ensure all associated tests pass
```
make check all
```

Reviewed By: cbi42

Differential Revision: D79655229

Pulled By: krhancoc

fbshipit-source-id: 3a90fb7420e9655021de85ed0158b866f8bfba05
Summary:
while debugging stress test failure, I noticed that sst_dump and ldb do not work if custom db_stress compression manager is used. This PR adds support for it.

```
 ./sst_dump --command=raw --show_properties --file=/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst
options.env is 0x7f2b1f4b9000
Process /tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst
Sst file format: block-based
/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst: Not implemented: Could not load CompressionManager: DbStressCustom1
/tmp/rocksdb_crashtest_whitebox4ny5mass/000589.sst is not a valid SST file

./ldb idump --db=/tmp/rocksdb_crashtest_whiteboxy_emah11 --ignore_unknown_options  --hex >> /tmp/i_dump
Failed: Not implemented: Could not load CompressionManager: DbStressCustom1
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13827

Test Plan: manually tested that ldb and sst_dump work with DbStressCustomCompressionManager after this PR

Reviewed By: pdillinger

Differential Revision: D79461175

Pulled By: cbi42

fbshipit-source-id: c8c092b10b4fde3a295b00751057749e8f0cf095
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/13508 accidentally didn't enable track_and_verify_wal back and this PR will enable it.

**Test**
[ongoing] Rehearsal stress test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13853

Reviewed By: pdillinger

Differential Revision: D79909991

Pulled By: hx235

fbshipit-source-id: aea91c98e43f26dec9a8988c837a6ed821979a3c
Summary:
fix the following error showing up in continuous tests:
```
Makefile:186: Warning: Compiling in debug mode. Don't use the resulting binary in production
port/mmap.cc:46:15: error: first argument in call to 'memcpy' is a pointer to non-trivially copyable type 'rocksdb::MemMapping' [-Werror,-Wnontrivial-memcall]
   46 |   std::memcpy(this, &other, sizeof(*this));
      |               ^
port/mmap.cc:46:15: note: explicitly cast the pointer to silence this warning
   46 |   std::memcpy(this, &other, sizeof(*this));
      |               ^
      |               (void*)
1 error generated.
make: *** [Makefile:2580: port/mmap.o] Error 1
make: *** Waiting for unfinished jobs....
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13864

Test Plan: `make USE_CLANG=1 j=150 check` with 13f054febb/build_tools/build_detect_platform (L61-L70) commented out.

Reviewed By: mszeszko-meta

Differential Revision: D80033441

Pulled By: cbi42

fbshipit-source-id: b2330eea71fe28243236b75128ec6f3f1e971873
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13846

This diff addresses few issues that was identified during testing of the user defined index.

1. During the finishing of the index blocks, we run into an infinite loop because the user defined index wrapper returns
early on incomplete status. This happens because the wrapper blindly returns the status if it not OK. But, the status
could legitimately be `Incomplete()` for some indices like Partitioned Index (serving as the internal index for the UDI
wrapper). Fix is to exclude `Incomplete()` check from the status check early in the UDI wrapper's finish.

2. Once we fixed (1), we noticed that the meta blocks for the UDI-based index writer were not written out to the final
SST file. This is because the UDI's meta blocks are created after the internal index's meta blocks and the block-based
index builder didn't account for this. The fix is to finish the UDI wrapper first which will create the necessary meta blocks
and then finish the internal index. If the internal index is incomplete, the block-based index builder should still continue
to write out the meta blocks.

3. OnKeyAdded when delegating to the user-defined index should only pass the user key. The UDI builder doesn't
understand RocksDB's internal key format and while that poses interesting challenges when the UDI is used for non
last level SST files, our plan is to restrict the usage of the UDI to last level files only (for now).

Reviewed By: pdillinger

Differential Revision: D79781453

fbshipit-source-id: 2239c8fc016da55df5c24be6aacc8f6357cab029
Summary:
... as we see some issues that rehearsal stress test didn't surface.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13869

Reviewed By: cbi42

Differential Revision: D80103341

Pulled By: hx235

fbshipit-source-id: 8b2c1d76d4c3099727ba3a69de44de67afd64369
Summary:
The original intention of the User Defined Index interface was to use the user key. However, the implementation mixed user and internal key usage. This PR makes it consistent. It also clarifies the UDI contract.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13865

Test Plan: Update tests in table_test.cc

Reviewed By: pdillinger

Differential Revision: D80050344

Pulled By: anand1976

fbshipit-source-id: ace47737d21684ec19709640a09e198cee2d98bd
Summary:
**Context/Summary:** ... for better readability

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13876

Test Plan: Existing UT

Reviewed By: mszeszko-meta

Differential Revision: D80185817

Pulled By: hx235

fbshipit-source-id: 534d37dd747369da48fc5903acc66bb9c8f5206d
Summary:
Allow a user defined index to be configured from a string

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13880

Test Plan: Add a unit test in table_test.cc

Reviewed By: bikash-c

Differential Revision: D80237701

Pulled By: anand1976

fbshipit-source-id: 8b3d0bcdfbb4bb76803916ea1b1f940a4d985dfd
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13882

The `expect_valid_internal_key` parameter was always passed as true, with false only used in one unit test. This change removes the parameter and always fail compaction when encountering corrupted internal keys, which is the expected production behavior.

Reviewed By: mszeszko-meta

Differential Revision: D80287672

fbshipit-source-id: e30a282ac30d7fded677504cec11173de8d15167
Summary:
This is an important feature for avoiding (reducing) unfair block cache treatment for a lot of blocks. It should also unlock some parallel optimizations (https://github.com/facebook/rocksdb/issues/13850) and code simplification.

Consider for follow-up:
* Feature to avoid majorly under0sized data blocks and filter and index partition blocks

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13881

Test Plan: existing tests, been looking good in production

Reviewed By: hx235

Differential Revision: D80288192

Pulled By: pdillinger

fbshipit-source-id: 5e274ffffb044713278d2a286db6bceaab2dadec
Summary:
Two instances of change that are not just cosmetic:

* InlineSkipList<>::Node::CASNext() was implicitly using memory_order_seq_cst to access `next_` while it's intended to be accessed with acquire/release. This is probably not a correctness issue for compare_exchange_strong but potentially a previously missed optimization.
* Similar for `max_height_` in Insert which is otherwise accessed with relaxed memory order.
* One non-relaxed access to `is_range_del_table_empty_` in a function only used in assertions. Access to this atomic is otherwise relaxed (and should be - comment added)

Didn't do all of memtable.h because some of them are more complicated changes and I should probably add FetchMin and FetchMax functions to simplify and take advantage of C++27 functions where available (intended follow-up).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13844

Test Plan: existing tests

Reviewed By: xingbowang

Differential Revision: D79742552

Pulled By: pdillinger

fbshipit-source-id: d97ce72ba9af6c105694b7d40622db9e994720cd
Summary:
Add the --list_meta_blocks option to sst_dump. This PR also refactors some of the test code in sst_dump_test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13838

Reviewed By: cbi42

Differential Revision: D80320812

Pulled By: anand1976

fbshipit-source-id: 921b6560fbd756f5f8b364893700d240d3b7ad00
Summary:
The assert occasionally throws off the stress test runs. We already have sufficient logging in place to collect the signal about secondary cache capacity exceeding primary cache reservation for further investigation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13885

Reviewed By: anand1976

Differential Revision: D80355513

Pulled By: mszeszko-meta

fbshipit-source-id: b36926f0493a3aca19818a1980ef79277db9fe7e
Summary:
Add a new filed `io_coalesce_threshold` to MultiScanArgs to make IO coalescing threshold configurable.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13886

Test Plan:
db_bench showing less IO requests with higher io_coalesce_threshold
```
Single L0 file, iterator uses BlockBasedTableIterator directly, skipping LevelIterator

DB Set up: ./db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=1 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=1000 --write_buffer_size=268435456

./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 ..

--multiscan_coalesce_threshold=0
rocksdb.non.last.level.read.bytes COUNT : 54591304136
rocksdb.non.last.level.read.count COUNT : 7680204
multiscan    :     397.197 micros/op 79401 ops/sec 10.377 seconds 823968 operations; (multscans:24999)

--multiscan_coalesce_threshold=16384
rocksdb.non.last.level.read.bytes COUNT : 95960989272
rocksdb.non.last.level.read.count COUNT : 912008
multiscan    :     389.099 micros/op 81064 ops/sec 10.312 seconds 835968 operations; (multscans:25999)

--multiscan_coalesce_threshold=163840
rocksdb.non.last.level.read.bytes COUNT : 98805008718
rocksdb.non.last.level.read.count COUNT : 827893
multiscan    :     392.831 micros/op 80357 ops/sec 10.353 seconds 831968 operations; (multscans:25999)

DB with multiple files in a level, iterator will use LevelIterator
./db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=6 --num=10000000

./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --num=10000000

--multiscan_coalesce_threshold=0
multiscan    :    1161.734 micros/op 26995 ops/sec 10.667 seconds 287968 operations; (multscans:8999)
rocksdb.non.last.level.read.bytes COUNT : 23917753523
rocksdb.non.last.level.read.count COUNT : 2868907

--multiscan_coalesce_threshold=16384
rocksdb.non.last.level.read.bytes COUNT : 35022281853
rocksdb.non.last.level.read.count COUNT : 287375
multiscan    :    1195.336 micros/op 26265 ops/sec 10.850 seconds 284968 operations; (multscans:8999)

```

Reviewed By: anand1976

Differential Revision: D80381441

Pulled By: cbi42

fbshipit-source-id: 57cc67df4a808e27c3a48ddf3ef6907bec131ee9
Summary:
I ran multiple runs of crash test jobs internally, so far I've seen one iterator mismatch and one assertion failure. I've added relevant logging improvements to help debugging them. use_multiscan will be stable within a crash test run to make it easier to triage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13888

Test Plan: `python3 tools/db_crashtest.py whitebox --prefix_size=-1 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888`

Reviewed By: anand1976

Differential Revision: D80627399

Pulled By: cbi42

fbshipit-source-id: 2fa3f77e730f5bc7d1d200dc122cf84e3558c588
Summary:
When fill_cache is ReadOptions is false, multi scan Prepare crashes with the following assertion failure. In this case, CreateAndPibBlockInCache needs to directly create a block with full ownership.

https://github.com/facebook/rocksdb/issues/9  0x00007f2fc003bc93 in __GI___assert_fail (assertion=0x7f2fc2147361 "pinned_data_blocks_guard[block_idx].GetValue()", file=0x7f2fc2146e08 "table/block_based/block_based_table_iterator.cc", line=1178, function=0x7f2fc2147262 "virtual void rocksdb::BlockBasedTableIterator::Prepare(const rocksdb::MultiScanArgs *)") at assert.c:101
101 in assert.c
https://github.com/facebook/rocksdb/issues/10 0x00007f2fc1d73088 in rocksdb::BlockBasedTableIterator::Prepare(rocksdb::MultiScanArgs const*) () from /data/users/anand76/rocksdb_anand76/librocksdb.so.10.6

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13889

Test Plan: Parameterize the DBMultiScanIteratorTest tests with fill_cache

Reviewed By: cbi42

Differential Revision: D80552069

Pulled By: anand1976

fbshipit-source-id: 1a0b64af1e14c63d826add1f994a832ebff12757
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/13887

Reviewed By: hx235

Differential Revision: D80717735

Pulled By: cbi42

fbshipit-source-id: efecf436188d473a18359e715df979ff24f2fd2e
Summary:
the function implementation is here: 8f0ab1598e/db/c.cc (L928-L930) but it wasn't fully exposed

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13875

Reviewed By: hx235

Differential Revision: D80717828

Pulled By: cbi42

fbshipit-source-id: d6aaa984f24e469aa8ddb81524dc156b85e891f2
Summary:
RocksDB currently aborts whenever `io_uring_wait_cqe` returns an error code. It also does not log what error code was returned.

While experimenting with `IO_URING`, my application crashed because of this.

I asked the Linux Kernel user group the best way to handle unsuccessful `io_uring_wait_cqe`.

It was recommended to retry on `EINTR`, `EAGAIN`, and `ETIME`. `ETIME` only happens when waiting with a timeout, so I am not handling it.

I also write to `stderr` so that we have some debugging information if we abort.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13890

Test Plan: Unfortunately this is hard to cover through unit/stress tests. We have to see what sort of errors get encountered in production.

Reviewed By: anand1976

Differential Revision: D80639955

Pulled By: archang19

fbshipit-source-id: e3a230bd37552ec0f36be34e6a4e53cfd2a254f1
Summary:
I saw failure when added some asserts near b9957c991c/table/block_based/block_based_table_iterator.cc (L1201-L1205) in stress test. The decompression failed with error message like "Corruption: Failed zlib inflate: -3". This PR fixes the issue to use the right decompressor for dictionary compression.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13896

Test Plan: updated unit test that checks no I/O is done after Prepare(), this would fail before this change.

Reviewed By: anand1976

Differential Revision: D80821500

Pulled By: cbi42

fbshipit-source-id: a4322c0da99a2d10e9787d0ec168668567c0c19a
Summary:
`IngestExternalFileOptions::allow_db_generated_files` requires SST files to have zero sequence number. This PR opens it up for any DB generated SST files. Currently we don't do global sequence number assignment when `allow_db_generated_files` is true, so we require that files do not overlap with any key in the CF. One behavior difference is that now we allow ingesting overlapping files when `allow_db_generated_files` is true. Users need to ensure that files are ordered such that later files have more recent updates.

Intended follow ups:
- Record smallest seqno in table property, so that we don't need to scan the file for it.
- Cover allow_db_generated_files in crash test. We may create a new DB and ingest all files from a CF for verification.
- Add APIs that uses allow_db_generated_files. For example, an API for ingesting SST files from a source CF, so that we take care of ingestion file ordering for user. If we are already getting metadata from the source CF, we may be use it as a hint for level placement instead of dividing input files into batches again (`ExternalSstFileIngestionJob::DivideInputFilesIntoBatches`).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13878

Test Plan: two new unit tests.

Reviewed By: hx235, xingbowang

Differential Revision: D80233727

Pulled By: cbi42

fbshipit-source-id: 74209386d8426c434bff2d9a734f06db537eb50c
Summary:
* Release notes from 10.6 branch
* Update version.h
* Add [10.6.fb](https://github.com/facebook/rocksdb/tree/10.4.fb) (to check_format_compatible.sh
* No update to folly commit hash due to build failures

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13897

Reviewed By: mszeszko-meta

Differential Revision: D80971628

Pulled By: anand1976

fbshipit-source-id: a24dbe90b5c54f781b2d017497ea3a22fcf6e148
Summary:
**Context/Summary:**
Clear statistics reference from options_ to intentionally shorten the statistics object lifetime to be same as the db object (which is the common case in practice) and detect if RocksDB access the statistics beyond its lifetime.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13899

Test Plan: - [Ongoing] Stress test rehearsal

Reviewed By: pdillinger

Differential Revision: D80985435

Pulled By: hx235

fbshipit-source-id: ab238231cd81f47fa451aea12a0c85fa11d9ac81
Summary:
**Context/Summary:**

RocksDB stress test verifies IOActivity is set correctly through reusing the pass-in Read/Write options through assertion. This is too strict for API that does not take or do not need to take Read/WriteOptions yet hence assertion failure.
```
stderr:
 db_stress: ... db_stress_tool/db_stress_env_wrapper.h:24: void rocksdb::(anonymous namespace)::CheckIOActivity(const IOOptions &): Assertion `io_activity == Env::IOActivity::kUnknown || io_activity == options.io_activity' failed.
Received signal 6 (Aborted)
```
An example is ManagedSnapshot snapshot_guard(db_); in TestMultiScan().

This PR ignores such check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13898

Test Plan: The same command repro-ed this assertion failure passes after this fix

Reviewed By: archang19

Differential Revision: D80983214

Pulled By: hx235

fbshipit-source-id: d8b660f8c8771198bc7fa0e805c3e86d2584f03e
Summary:
**Context/Summary:**

DumpStats() do not skip dropped CF and can run into a seg fault like below
```
2025-08-23T06:44:05.0469230Z �[0;32m[ RUN      ] �[mFormatLatest/ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily/0
2025-08-23T06:44:05.0470050Z Received signal 11 (Segmentation fault: 11)
2025-08-23T06:44:05.0470510Z #0   0x7000069305e0
2025-08-23T06:44:05.0471070Z https://github.com/facebook/rocksdb/issues/1   rocksdb::DBImpl::DumpStats() (in librocksdb.10.6.0.dylib) (db_impl.cc:1076)
```

This PR skipped it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13900

Test Plan:
- Deterministically repro-ed the seg fault before the fix and ensure it doesn't happen after the fix
```
 diff --git a/db/column_family_test.cc b/db/column_family_test.cc
index 3a2ca0617..f57d6f757 100644
 --- a/db/column_family_test.cc
+++ b/db/column_family_test.cc
@@ -2372,11 +2372,17 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) {
   int kKeysNum = 10000;
   PutRandomData(1, kKeysNum, 100);
   {
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+        {{"PostDrop", "BeforeAccessCFD"}, {"PostAccessCFD", "BeforeGo"}});
+
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
     std::unique_ptr<Iterator> iterator(
         db_->NewIterator(ReadOptions(), handles_[1]));
     iterator->SeekToFirst();

     DropColumnFamilies({1});
+    TEST_SYNC_POINT("PostDrop");
+    TEST_SYNC_POINT("BeforeGo");

     // Make sure iterator created can still be used.
     int count = 0;
@@ -2386,6 +2392,9 @@ TEST_P(ColumnFamilyTest, LiveIteratorWithDroppedColumnFamily) {
     }
     ASSERT_OK(iterator->status());
     ASSERT_EQ(count, kKeysNum);
+
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
+    ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->ClearAllCallBacks();
   }

   Reopen();
 diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index a8e4f5f8f..a8a0499c0 100644
 --- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -1073,8 +1073,10 @@ void DBImpl::DumpStats() {
         continue;
       }

-      auto* table_factory =
-          cfd->GetCurrentMutableCFOptions().table_factory.get();
+      TEST_SYNC_POINT("BeforeAccessCFD");
+      auto moptions = cfd->GetCurrentMutableCFOptions();
+      auto* table_factory = moptions.table_factory.get();
+      TEST_SYNC_POINT("PostAccessCFD");
       assert(table_factory != nullptr);
       // FIXME: need to a shared_ptr if/when block_cache is going to be mutable
       Cache* cache =
~
```

Reviewed By: archang19

Differential Revision: D81003739

Pulled By: hx235

fbshipit-source-id: bdf3c4cc45988f43e79ebc191a20af5b70ac289f
Summary:
Enable stress testing of deletion-triggered compaction.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13894

Test Plan:
```
 python3 -u tools/db_crashtest.py --simple whitebox --enable_compaction_on_deletion_trigger=true
```

Reviewed By: jaykorean

Differential Revision: D81175559

Pulled By: nmk70

fbshipit-source-id: c5128b7c1e2d07833b0e9385e04b342bc42c65cf
Summary:
The implementation of parallel compression has historically scaled rather poorly, or perhaps modestly with heavy compression, topping out around 3x throughput vs. serial and incurring big overheads in CPU consumption relative to the throughput.

This change addresses one source of that extra CPU consumption: stashing all the keys of a block for later processing into building index and filter blocks. Historically with parallel compression, the index and filter block updates were handled in the last stage of processing along with writing each data block to the file writer. This was because the index blocks needed to know the BlockHandle of the new data block, which could only be known after every preceeding data block was compressed, to know the starting location for the BlockHandle. And because index and filter partitions were historically coupled (see decouple_partitioned_filters), filter updates had to happen at the same time.

Here we get rid of stashing the keys for later processing and the extra CPU associated with it, by
* Creating a two stage process of adding to index blocks ("prepare" and "finish" each entry; one entry per data block). The two stages must be executable in parallel for separate index entries. NOTE: not yet supported by UserDefinedIndex
* Requiring decouple_partitioned_filters=true for parallel compression, because we now add to filters in the first stage of processing when each key is readily available and we cannot couple that with finalizing index entries in the last stage of processing.

It might seem like adding to filters is something that is expensive (hashing etc.) and should be kept out of the bottle-neck first stage of processing (which includes walking the compaction iterator) but it's probably similar cost to simply stashing the keys away for later processing. (We might be able to reduce a bottle-neck by stashing hashes, but we're not to a point where that is worth the effort.)

And it makes sense to make two more simple public API updates in conjunction with this:
* Set decouple_partitioned_filters=true by default. No signs of problems in production.
* Mark parallel compression as production-ready. It's being thoroughly tested in the crash test, successfully, and in limited production uses.

Follow-up:
* Improve the threading/sychronization model of parallel compression for the next major efficiency improvement
* Consider supporting the parallel-compatible index building APIs with UserDefinedIndex, unless it's considered too dangerous to expect users to safely handle the multi-threading.
* (In a subsequent release) remove all the code associated with coupling filter and index partitions and mark the option as ignored.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13850

Test Plan:
for correctness, existing tests

## Performance Data

The "before" data here includes revert of https://github.com/facebook/rocksdb/issues/13828 for combined performance measurement of this change and that one.

```
SUFFIX=`tty | sed 's|/|_|g'`; for CT in lz4 zstd lz4; do for PT in 1 2 3 4 6 8; do echo "$CT pt=$PT"; (for I in `seq 1 1`; do BIN=/dev/shm/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=30000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1 | tail -n 3 | head -n 2; done); done; done
```

To get a sense of the overall performance relative to number of parallel threads, we vary that with popular fast compression and popular heavier weight compression (some noise in this data, don't interpret each data point too strongly)

lz4 pt=1
2107431 -> 2112941 ops/sec (+0.3% - improvement)
(26.51 + 0.75) = 27.26 CPU sec -> (26.63 + 0.79) = 27.42 CPU sec (+0.6% - regression)
lz4 pt=2
1606660 -> 1580333 ops/sec (-1.6% - regression)
(47.10 + 8.37) = 55.47 CPU sec -> (45.05 + 9.23) = 54.28 CPU sec (-2.2% - improvement)
lz4 pt=3
1701353 -> 1889283 ops/sec (+11.1% - improvement)
(47.23 + 8.29) = 55.52 CPU sec -> (43.89 + 8.33) = 52.22 CPU sec (-6.0% - improvement)
lz4 pt=4
1651504 -> 1817890 ops/sec (+10.1% - improvement)
(48.07 + 8.31) = 56.38 CPU sec -> (44.77 + 8.45) = 53.22 CPU sec (-5.6% - improvement)
lz4 pt=6
1716099 -> 1888523 ops/sec (+10.1% - improvement)
(47.50 + 8.45) = 55.95 CPU sec -> (44.25 + 8.73) = 52.98 CPU sec (-5.3% - improvement)
lz4 pt=8
1696840 -> 1797256 ops/sec (+5.9% - improvement)
(48.09 + 8.61) = 56.70 CPU sec -> (45.90 + 8.68) = 54.58 CPU sec (-3.8% - improvement)

Clearly parallel threads do not help with fast compression like LZ4, but it's not as bad as it was before.

zstd pt=1
1214258 -> 1202863 ops/sec (-0.9% - regression)
(38.26 + 0.66) = 38.92 CPU sec -> (39.37 + 0.69) = 40.06 CPU sec (+2.9% - regression)
zstd pt=2
1194673 -> 1152746 ops/sec (-3.5% - regression)
(61.01 + 9.85) = 70.86 CPU sec -> (58.28 + 9.99) = 68.27 CPU sec (-3.7% - improvement)
zstd pt=3
1653661 -> 1825618 ops/sec (+10.4% - improvement)
(60.07 + 8.45) = 68.52 CPU sec -> (56.03 + 8.43) = 64.46 CPU sec (-5.9% - improvement)
zstd pt=4
1691723 -> 1890976 ops/sec (+11.8% - improvement)
(59.72 + 8.46) = 68.18 CPU sec -> (55.96 + 8.27) = 64.23 CPU sec (-5.7% - improvement)
zstd pt=6
1684982 -> 1900002 ops/sec (+12.8% - improvement)
(58.89 + 8.26) = 67.15 CPU sec -> (55.98 + 8.48) = 64.46 CPU sec (-4.0% - improvement)
zstd pt=8
1648282 -> 1892531 ops/sec (+14.8% - improvement)
(59.43 + 8.63) = 68.06 CPU sec -> (56.49 + 8.32) = 64.81 CPU sec (-4.8% - improvement)

The throughput is now able to increase by *more than half* with lots of parallelism, rather than only *about a third*.

Scalability is a bit better with higher compression level, and we still see a benefit from this change. (We've also enabled partitioned indexes and filters here, which sees essentially the same benefits):

zstd pt=1 compression_level=7
595720 -> 597359 ops/sec (+0.3% - improvement)
(63.45 + 0.73) = 64.18 CPU sec -> (63.25 + 0.71) = 63.96 CPU sec (-0.3% - improvement)
zstd pt=4 compression_level=7
1527116 -> 1501779 ops/sec (-1.7% - regression)
(85.00 + 8.14) = 93.14 CPU sec -> (81.85 + 9.02) = 90.87 CPU sec (-2.5% - improvement)
zstd pt=6 compression_level=7
1678239 -> 1956070 ops/sec (+16.5% - improvement)
(83.77 + 8.11) = 91.88 CPU sec -> (79.87 + 7.78) = 87.65 CPU sec (-4.6% - improvement)
zstd pt=8 compression_level=7
1696132 -> 1953041 ops/sec (+15.1% - improvement)
(83.97 + 8.14) = 92.11 CPU sec -> (80.61 + 7.78) = 88.39 CPU sec (-4.1% - improvement)

With more tests, not really seeing any consistent differences with no parallelism (despite some micro-optimizations thrown in)

Reviewed By: hx235

Differential Revision: D79853111

Pulled By: pdillinger

fbshipit-source-id: 7a34fd7811217fb74fa6d3efaea7ffcce72beec7
Summary:
**Context/Summary:**
`ProcessKeyValueCompaction()` has grown too long to resonate or add any logic to resume from some key and save progress for resumable compaction. This PR breaks this function into smaller functions. Almost all of them are cosmetic changes, except for one thing pointed out in below PR conversation.

Specially, this PR did the following:
- Added `SubcompactionInternalIterators`, `SubcompactionKeyBoundaries` and `BlobFileResources` to manage the lifetime of the local variables of the original functions to be used across smaller functions
- Moved AutoThreadOperationStageUpdater, some IO stats measurement to a different place that makes more sense

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13879

Test Plan: Existing UT

Reviewed By: jaykorean

Differential Revision: D80216092

Pulled By: hx235

fbshipit-source-id: 515615906e5e5fd5ec191bcdd4126f17d282cac2
Summary:
I am wanting to use std::counting_semaphore for something and the timing seems good to require C++20 support. The internets suggest:

* GCC >= 10 is adequate, >= 11 preferred
* Clang >= 10 is needed
* Visual Studio >= 2019 is adquate

And popular linux distributions look like this:
* CentOS Stream 9 -> GCC 11.2  (CentOS 8 is EOL)
* Ubuntu 22.04 LTS -> GCC 11.x  (Ubuntu 20 just ended standard support)
* Debian 12 (oldstable) -> GCC 12.2
  * (Debian 11 has ended security updates, uses GCC 10.2)

This required generating a new docker image based on Ubuntu 22 for CI using gcc. The existing Ubuntu 20 image works for covering appropriate clang versions (though we should maybe add a much later version as well, in the next increment of our Ubuntu 22 image; however the minimum available clang build from apt.llvm.org for Ubuntu 22 is clang 13).

Update to SetDumpFilter is to quiet a mysterious gcc-13 warning-as-error.

Removed --compile-no-warning-as-error from a cmake command line because cmake in the new docker image is too old for this option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13904

Test Plan: CI, one minor unit test added to verify std::counting_semaphor works

Reviewed By: xingbowang

Differential Revision: D81266435

Pulled By: pdillinger

fbshipit-source-id: 26040eeccca7004416e29a6ff4f6ea93f2052684
Summary:
Add a new argument --random_seed to script db_crashtest.py to allow reusing the same random seed to produce exactly same test argument. When the argument is missing, a random seed is used, and printed. When developer wants to reproduce the exactly same setup, they could use the same seed with --random_seed for reproduction. The example below shows running the command without and with the argument. All of the arguments are same, except --db and --expected_values_dir, which does not use python random.

* Without --random_seed, a new seed is generated and printed.
```
[xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000
Start with random seed 17953760416546706382
Running blackbox-crash-test with
interval_between_crash=120
total-duration=960

Running db_stress with pid=2957716: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxqishhgdc --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_udz8mw68 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

* With --random_seed, the seed specified in the argument is used.
```
[xbw@devvm16622.vll0 ~/workspace/ws1/rocksdb (crashtest)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=960 --max_key=2500000 --random_seed=17953760416546706382
Start with random seed 17953760416546706382
Running blackbox-crash-test with
interval_between_crash=120
total-duration=960

Running db_stress with pid=2959006: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=0 --advise_random_on_open=1 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=1 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=zstd --blob_file_size=1073741824 --blob_file_starting_level=0 --blob_garbage_collection_age_cutoff=0.5 --blob_garbage_collection_force_threshold=0.75 --block_align=1 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=1 --bloom_bits=12 --bottommost_compression_type=none --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=0 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=10000 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=100 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=0 --compression_manager=none --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc=23:30-03:15 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackbox0kxvhzbm --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kCold --delete_obsolete_files_period_micros=30000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_custom_split_merge=1 --enable_do_not_compress_roles=1 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_hhk9kcgo --fifo_allow_compaction=0 --file_checksum_impl=crc32c --file_temperature_age_thresholds= --fill_cache=0 --flush_one_in=1000 --format_version=4 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=2097152 --high_pri_pool_ratio=0 --index_block_restart_interval=1 --index_shortening=2 --index_type=0 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=0 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16777216 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=100 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=8 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=32 --metadata_write_fault_one_in=128 --min_blob_size=16 --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=5 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=1000 --readahead_size=0 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=10 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=211 --universal_max_read_amp=4 --universal_reduce_file_locking=0 --unpartitioned_pinning=0 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=0 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13906

Test Plan: stress test

Reviewed By: hx235

Differential Revision: D81201034

Pulled By: xingbowang

fbshipit-source-id: 0bb4e0cbcdcf2de9b730492342dcfa18f07e93d6
Summary:
**Context/Summary:**
A small change as titled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13891

Test Plan: - Existing UT and rehearsal stress test

Reviewed By: jaykorean

Differential Revision: D80588011

Pulled By: hx235

fbshipit-source-id: 6987e08a4855782305ad742eef6c0196da0d67ca
Summary:
Re-enabling Remote Compaction Stress Test with some changes to stress test feature combo sanitization changes

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13913

Test Plan:
Ran Meta Internal Tests for a few days

# Follow up
- Skip recovering from WAL in remote worker and re-enable WAL
- Investigate and fix races with Integrated BlobDB

Reviewed By: hx235

Differential Revision: D81509225

Pulled By: jaykorean

fbshipit-source-id: 949762c48ece0a25e3d0281e3510f1e7d3fe3667
Summary:
Fixing "Integrated BlobDB is currently incompatible with Remote Compaction" error

https://github.com/facebook/rocksdb/actions/runs/17417658959/job/49449586139

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13916

Test Plan: CI

Reviewed By: anand1976

Differential Revision: D81537676

Pulled By: jaykorean

fbshipit-source-id: f5e2c40cd498a17cb08486a1cb9404ccf1d812e0
Summary:
# Summary

Until we get WAL + Remote Compaction in Stress Test working, temporarily disable this

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13919

Test Plan: Meta Internal CI run

Reviewed By: anand1976

Differential Revision: D81605621

Pulled By: jaykorean

fbshipit-source-id: 6e1f9a0a7a0f27e7465512689b51364b63ef3e2b
Summary:
Add a new option `MultiScanArgs::max_prefetch_size` that limits the memory usage of per file pinning of prefetched blocks. Note that this only accounts for compressed block size. This is intended to be a stopgap until we implement some kind of global prefetch manager that limits the global multiscan memory usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13920

Test Plan: new unit test `./block_based_table_reader_test --gtest_filter="*MultiScanPrefetchSizeLimit/*"`

Reviewed By: xingbowang

Differential Revision: D81630629

Pulled By: cbi42

fbshipit-source-id: 9f66678915242fe1220620531a4b9fd22747cdea
Summary:
If user_defined_index_factory in BlockBasedTableOptions is configured and we try to open an SST file without the corresponding UDI (either during DB open or file ingestion), ignore a failure to load the UDI by default. If fail_if_no_udi_on_open in BlockBasedTableOptions is true, then treat it as a fatal error.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13921

Test Plan: Update unit tests

Reviewed By: xingbowang

Differential Revision: D81826054

Pulled By: anand1976

fbshipit-source-id: f4fe0b13ccb02b9448622af487680131e349c52b
Summary:
This diff adds logging in various places in the external file ingestion code where we check for non-OK status codes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13905

Test Plan: Debugging external file ingestion should be easier with additional logging.

Differential Revision: D81814033

Pulled By: archang19

fbshipit-source-id: 77f8b342cbad892acedc4603c02865c38886f2f4
Summary:
After running stress test over a week, we've identified more failures to fix. While we work on the fix, disable the remote compaction temporarily to reduce noise and avoid these failures hiding other failures.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13925

Test Plan: CI

Reviewed By: anand1976

Differential Revision: D81934248

Pulled By: jaykorean

fbshipit-source-id: 9ac11926429eebe1aebf7b520a548dc5987b7d76
Summary:
**Context/Summary:** it's for formal testing to cover statistics in our stress test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13926

Reviewed By: anand1976, jaykorean

Differential Revision: D81943762

Pulled By: hx235

fbshipit-source-id: 4186be0b35839976b7299667492d0cc722128a06
Summary:
A follow-up to https://github.com/facebook/rocksdb/issues/13904 which was incomplete in updating CI jobs to support C++20 because the C++20 usage was only in tests. Here we add subtle C++20 usage in the public API ("using enum" feature in db.h) to force the issue.

A lot of the work for this PR was in updating the Ubuntu22 docker image, for earlier compiler/runtime versions supporting C++20, and generating a new Ubuntu24 docker image, for later compiler/runtime versions. The Ubuntu22 image needed to be updated because there are incompatibilities with clang-13 + c++20 + libstdc++ for gcc 11, seen on these examples

```
#include <chrono>

int main(int argc, char *argv[]) {
  std::chrono::microseconds d = {}; return 0;
}
```

and

```
#include <coroutine>

int main() { return 0; }
```

The second was causing recurring failures in build-linux-clang-13-asan-ubsan-with-folly, now fixed.

So we have to install clang's libc++ to compile with clang-13. I haven't been able to get this to work with some of the libraries like benchmark, glog, and/or gflags, but I'm able to compile core RocksDB with clang-13. On this docker image, an extra compiler parameter is needed to compile with gcc and glog because it's built from source perhaps not perfectly, because the ubuntu package transitively conflicts with libc++.

The Ubuntu24 image seems to be low-drama and generally work for testing out newer compiler versions. The mingw build uses Ubuntu24 because the mingw package on Ubuntu22 uses a gcc version that is too old.

And the mass of other code changes are trying to work around new warnings, mostly from clang-analyze, which I upgraded to clang-18 in CI.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13915

Test Plan: CI, including temporarily including the nightly jobs in the PR jobs in earlier revisions to test and stabilize

Reviewed By: archang19

Differential Revision: D81933067

Pulled By: pdillinger

fbshipit-source-id: 7e33823006a79d5f3cf5bc1d625f0a3c08a7d74c
Summary:
PointLockManager manages point lock per key. The old implementation partition the per key lock into 16 stripes. Each stripe handles the point lock for a subset of keys. Each stripe have only one conditional variable. This conditional variable is used by all the transactions that are waiting for its turn to acquire a lock of a key that belongs to this stripe.

In production, we notified that when there are multiple transactions trying to write to the same key, all of them will wait on the same conditional variables. When the previous lock holder released the key, all of the transactions are woken up, but only one of them could proceed, and the rest goes back to sleep. This wasted a lot of CPU cycles. In addition, when there are other keys being locked/unlocked on the same lock stripe, the problem becomes even worse.

In order to solve this issue, we implemented a new PerKeyPointLockManager that keeps a transaction waiter queue at per key level. When a transaction could not acquire a lock immediately, it joins the waiter queue of the key and waits on a dedicated conditional variable. When previous lock holder released the lock, it wakes up the next set of transactions that are eligible to acquire the lock from the waiting queue. The queue respect FIFO order, except it prioritizes lock upgrade/downgrade operation.

However, this waiter queue change increases the deadlock detection cost, because the transaction waiting in the queue also needs to be considered during deadlock detection. To resolve this issue, a new deadlock_timeout_us (microseconds) configuration is introduced in transaction option. Essentially, when a transaction is waiting on a lock, it will join the wait queue and wait for the duration configured by deadlock_timeout_us without perform deadlock detection. If the transaction didn't get the lock after the deadlock_timeout_us timeout is reached, it will then perform deadlock detection and wait until lock_timeout is reached. This optimization takes the heuristic where majority of the transaction would be able to get the lock without perform deadlock detection.

The deadlock_timeout_us configuration needs to be tuned for different workload, if the likelihood of deadlock is very low, the deadlock_timeout_us could be configured close to a big higher than the average transaction execution time, so that majority of the transaction would be able to acquire the lock without performing deadlock detection. If the likelihood of deadlock is high, deadlock_timeout_us could be configured with lower value, so that deadlock would get detected faster.

The new PerKeyPointLockManager is disabled by default. It can be enabled by TransactionDBOptions.use_per_key_point_lock_mgr. The deadlock_timeout_us is only effective when PerKeyPointLockManager is used. When deadlock_timeout_us is set to 0, transaction will perform deadlock detection immediately before wait.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13731

Test Plan:
Unit test.
Stress unit test that validates deadlock detection and exclusive, shared lock guarantee.
A new point_lock_bench binary is created to help perform performance test.

Reviewed By: pdillinger

Differential Revision: D77353607

Pulled By: xingbowang

fbshipit-source-id: 21cf93354f9a367a78c8666596ed14013ac7240b
Summary:
There are some internal use cases that do not map cleanly onto the existing `IOActivity` enums. This PR creates new custom IOActivity types that internal users can use as they see fit.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13924

Test Plan: Wrote a simple unit test

Reviewed By: pdillinger

Differential Revision: D82029992

Pulled By: archang19

fbshipit-source-id: a3e23c360baa96cd2e9adf570e71c6e43947bfc8
Summary:
Add copyright notice to any_lock_manager_test.h

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13930

Reviewed By: xingbowang

Differential Revision: D82035581

Pulled By: anand1976

fbshipit-source-id: 2275f7c8b41fbd4384bdae011d244bfa117225f7
Summary:
Fix broken build in PointLockManager change with C++20

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13933

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D82073490

Pulled By: xingbowang

fbshipit-source-id: 0bd4936fe0a27a28db61ca5f23d3bea90bce73ef
Summary:
... and associated statistics, etc. Someone needs it, so here it is.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13927

Test Plan: Updated / extended / added some unit tests

Reviewed By: cbi42

Differential Revision: D81981469

Pulled By: pdillinger

fbshipit-source-id: 52558c08741890b781310906acbc18d9eb479363
Summary:
Fix uninitialized value complaint in valgrind due to gtest print padded struct.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13934

Test Plan: CI. Verified that valgrind no longer complains about it.

Reviewed By: pdillinger

Differential Revision: D82124983

Pulled By: xingbowang

fbshipit-source-id: 99eb7bab99726c45affe0a231777e5951844d73b
Summary:
We should add error logging to be able to pinpoint why RocksDB is returning status `NotSupported` for `ReadAsync`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13936

Test Plan: Look at logs (and client logs of error status)

Reviewed By: anand1976

Differential Revision: D82141529

Pulled By: archang19

fbshipit-source-id: c71b70967457be35ef5168321d449f96b2b9441d
Summary:
**Context/Summary:**
Internally `CompactionJobStats ::num_input_records` is only used for input record count [verification](1aca60c089/db/compaction/compaction_job.cc (L2535)) and such verification always checks for `CompactionJobStats::has_num_input_records` (now renamed) before using this field. This is needed because the `CompactionJobStats::num_input_records` gets its number from `CompactionIterator::NumInputEntryScanned()` in a subcompaction and this number can be inaccurate purposefully to increase performance, see [CompactionIterator::must_count_input_entries](https://github.com/facebook/rocksdb/pull/13929/files#diff-e6c876f655a21865c0f3dff94b9763f1bd40cf88a8a86f04868201b2e845a890R186-R199) for more.
- This PR renames the `CompactionJobStats::has_num_input_records` to more explicit naming and adds more comments. Not a behavior change.

Also, aggregation of  `CompactionJobStats::has_num_input_records` among all subcompactions is done by [AND](1aca60c089/util/compaction_job_stats_impl.cc (L62)) operation so it's false if any of the subcompaction has this field being false. The default value of this field should be "true" in order to not mistakenly "false" by default. We are currently fine because `CompactionJobStats::Reset()` that [sets the value to be true](1aca60c089/util/compaction_job_stats_impl.cc (L14)) is always called before such aggregation.
 - This PR changes the default value to be true.
 - Resumable compaction development plans to set `CompactionJobStats::has_num_input_records` to be false if the previous compaction carries inaccurate records. In order for this not be overwritten by the subsequent progress in [here](1aca60c089/db/compaction/compaction_job.cc (L1540-L1543)), this PR also changes this = to AND operation and +=. With the default value `CompactionJobStats::has_num_input_records` now to be true (or Reset() already called) and `CompactionJobStats::num_input_records=0` already, this will not a behavior change.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13929

Test Plan: - Existing UT to test "...changes the default value to be true" is safe.

Reviewed By: jaykorean

Differential Revision: D82014912

Pulled By: hx235

fbshipit-source-id: 6f211c3b2c9eb7d39abf37271d21a4d3f407b934
Summary:
This PR enables Stress Test to fall back to local compaction when a remote compaction fails, allowing the compaction to be retried on the main thread.

If the local compaction succeeds, the stress test will continue without failing. The main thread will log that the remote compaction failed and was retried locally, while detailed failure logs from the remote compaction attempt will still be printed by the worker thread for further investigation.

This approach allows us to keep collecting useful logs for diagnosing remote compaction failures in Stress Test, while ensuring the test continues to run with remote compaction enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13945

Test Plan:
```
python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10
```

# Internal Only

https://www.internalfb.com/sandcastle/workflow/1315051091202224133

https://www.internalfb.com/sandcastle/workflow/3382203320165521367

https://www.internalfb.com/sandcastle/workflow/2616591383512372892

https://www.internalfb.com/sandcastle/workflow/4607182418810099066

Reviewed By: hx235

Differential Revision: D82279337

Pulled By: jaykorean

fbshipit-source-id: 6f663ec2eeb642fd4ad885a90efb344432a32f89
Summary:
**Context/Summary:**
Fix a race condition (illustrated below) in FIFO size-based compaction where concurrent threads could select the same non-L0 file, causing assertion failures in debug builds or "Cannot delete table file from LSM tree" errors in release builds.
```
Thread 1                           Thread 2
--------                           --------
FIFO size-based compaction
   ↓
Pick L2 file
   ↓
Mark: file.being_compacted = true (file.being_compacted was false)
   ↓
WriteManifestStart (unlock mutex) ─→ FIFO size-based compaction starts
   ↓                                   ↓
Continue manifest write...          Pick SAME L2 file
   ↓                                Mark: file.being_compacted = true  (file.being_compacted was true) 
   ↓                                   ↓
   ↓                                Unlock mutex, wait for manifest
   ↓                                   ↓
Lock mutex ←─────────────────────────────────┘
   ↓
Delete L2 file 
   ↓
Complete ─────────────────────────────→ Try delete same file 
                                        ↓
                                     ERROR: "file not in LSM tree"

🐛 BUG: Both threads pick the same file!
    Thread 2 doesn't properly check file.being_compacted flag
```

**Test**
New test that fails before the fix and passes after

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13946

Reviewed By: xingbowang

Differential Revision: D82279731

Pulled By: hx235

fbshipit-source-id: b426517f2d1b23dd7d4951157822a2d322fe1435
Summary:
we saw some crash test failure at f46242cef6/table/block_based/block_based_table_iterator.cc (L964-L965). This is likely due to timestamp not being considered properly in some places in MultiScan code paths. This PR fixes the issue.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13938

Test Plan: crash test with timestamp and multiscan: `python3 -u ./tools/db_crashtest.py whitebox --enable_ts --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60`

Reviewed By: anand1976

Differential Revision: D82175263

Pulled By: cbi42

fbshipit-source-id: 5d40ede1aec15f8faeaa7fd041b939e68611ff73
Summary:
Complete redo of parallel compression in block_based_table_builder.cc to greatly reduce cross-thread hand-off and blocking. A ring buffer of blocks-in-progress is used to essentially bound working memory while enabling high throughput. Unlike before, all threads can participate in compression work, for a kind of work-stealing algorithm that reduces the need for threads to block. This builds on improvements in https://github.com/facebook/rocksdb/pull/13850

Previously, there was either
* parallel_threads==1, the *emit thread* (caller from flush/compaction) doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks, `parallel_threads` worker threads compress blocks, and a writer thread writes to the SST file. Total of `parallel_threads + 2` threads participating. (Other bookkeeping in emit and write steps omitted from description for simplicity.)

Now we have either
* parallel_threads==1 (same), the emit thread doing all the work
* parallel_threads > 1, the emit thread generates uncompressed blocks and can take up compression work when the ring buffer is full; `parallel_threads` worker threads have as their top priority to write compressed blocks to the SST file but also take up compression work in priority order of next-to-write. Total of `parallel_threads + 1` threads participating. In some cases, this could result in less throughput than before, but arguably the previous implementation was using more threads than explicitly allowed.

## Future/alternate considerations
Although we could likely have used some framework for micro-work sharing across threads, that could be difficult with the asymmetry of work loads and thread affinity. Specifically, (a) it would be quite challenging to allow emit work in other threads, because it happens in the caller of BlockBasedTableBuilder, (b) async programming is unlikely to pay off until we have an async interface for writing SST files, and (c) this implementation will nevertheless serve as a benchmark for what we lose or gain in such a framework vs. a hand-tuned system.

This implementation still creates and destroys threads for each SST file created. We hope in the future to have more governance and/or pooling of worker threads across various flushes and compactions, but that is not available currently and would require significant design and implementation work.

## More details
* This implementation makes use of semaphores for idling and re-waking threads. `std::counting_semaphore` and `binary_semaphore` offer the best performance (see benchmark results below) but some implementations are known to have correctness bugs. Also, my attempt at upgrading CI for C++20 support (required for these) in https://github.com/facebook/rocksdb/pull/13904 is actually incomplete. Therefore, using these structures is opt-in with `-DROCKSDB_USE_STD_SEMAPHORES` at compile time, and a naive semaphore implementation based on mutex and condvar is used by default. A folly alternative (folly::fibers::Semaphore) was dropped in during development and found to be less efficient than the naive implementation. One CI job is upgraded to test with the new opt-in.
* One of the biggest concerns about correctness/reliability for this implementation is the possibility of hitting a deadlock, in part because that is not well checked in the DB crash test (a challenging problem!). Note also that with the parallel compression improvements in this release, I am calling the feature production-ready, so there is an extra level of confidence needed in the reliability of the feature. Thus, for DEBUG builds including crash test, I have added a watchdog thread to each parallel SST construction that heuristically checks for the most likely kinds of deadlock that could happen, including for the case of buggy semaphore implementations. It periodically verifies that some thread is outside of its "idle" state, and if the watchdog wakes up repeatedly to see all live threads stuck in their idle state (even if wake-up was attempted) then it declares a deadlock. This feature was manually verified for several seeded deadlock bugs. (More details in code comments.)
* For CPU efficiency, this implementation greatly simplifies the logic to estimate the outstanding or "inflight" size not yet written to the SST file. I expect this size to generally be insignificant relative to the full SST file size so is not worth careful engineering. And based on Meta's current needs, landing under-size for an SST file is better than over-size. See comments on `estimated_inflight_size` for details.
* Some other existing atomics in block_based_table_builder.cc modified to use safe atomic wrappers.
* Status handling in BlockBasedTableBuilder was streamlined to get rid of essentially redundant `status`+`io_status` fields and associated code. Made small optimizations to reduce unnecessary IOStatus copies (with StatusOk()) and mark status conditional branches as LIKELY or UNLIKELY.
* Prefer inline field initialization to initialization in constructor.
* Minimize references to the `parallel_threads` configuration parameter for better separation of concerns / sanitization / etc.  For example, use non-nullity of `pc_rep` to indicate that parallel compression is enabled (and active).
* Some other refactoring to aid the new implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13910

Test Plan:
## Correctness
Already integrated into unit tests and crash test. CI updated for opt-in semaphore implementation. Basic semaphore unit tests added/updated.

As for the tremendous simplification of logic relating to hitting target SST file size, as expected, the new behavior could under-shoot the single-threaded behavior by a small number of blocks, which will typically affect the file size by ~1/1000th or less. I think that's a good trade-off for cutting out unnecessarily complex code with non-trivial CPU cost (FileSizeEstimator).
```
./db_bench -db=/dev/shm/dbbench_filesize_after8 -benchmarks=fillseq,compact -num=10000000 -compression_type=zstd -compression_level=8 -compression_parallel_threads=8
```

Before, PT=8 & PT=1, and After PT=1 the same or very similar
```
-rw-r--r-- 1 peterd users 67474097 Sep 12 15:32 000052.sst
-rw-r--r-- 1 peterd users 67474214 Sep 12 15:32 000053.sst
-rw-r--r-- 1 peterd users 67473834 Sep 12 15:32 000054.sst
-rw-r--r-- 1 peterd users 67473437 Sep 12 15:32 000055.sst
-rw-r--r-- 1 peterd users 67473835 Sep 12 15:32 000056.sst
-rw-r--r-- 1 peterd users 67473204 Sep 12 15:33 000057.sst
-rw-r--r-- 1 peterd users 67473294 Sep 12 15:33 000058.sst
-rw-r--r-- 1 peterd users 67473839 Sep 12 15:33 000059.sst
```

After, PT=8 (worst case here ~0.05% smaller)
```
-rw-r--r-- 1 peterd users 67463189 Sep 12 14:55 000052.sst
-rw-r--r-- 1 peterd users 67465233 Sep 12 14:55 000053.sst
-rw-r--r-- 1 peterd users 67466822 Sep 12 14:55 000054.sst
-rw-r--r-- 1 peterd users 67466221 Sep 12 14:55 000055.sst
-rw-r--r-- 1 peterd users 67441675 Sep 12 14:55 000056.sst
-rw-r--r-- 1 peterd users 67467855 Sep 12 14:55 000057.sst
-rw-r--r-- 1 peterd users 67455132 Sep 12 14:55 000058.sst
-rw-r--r-- 1 peterd users 67458334 Sep 12 14:55 000059.sst
```

## Performance, modest load
We are primarily interested in balancing throughput in building SST files and CPU usage in doing so. (For example, we could maximize throughput by having worker threads only spin waiting for work, but that would likely be extra CPU usage we want to avoid to allow other productive CPU work to be scheduled.) No read path code has been touched.

A benchmark script running "before" and "after" configurations at the same time to minimize random machine load effects:
```
$ SUFFIX=`tty | sed 's|/|_|g'`; for CT in none lz4 zstd; do for PT in 1 2 3 4 6 8; do echo -n "$CT pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```

Before this change:
```
none pt=1 -> ops/s: 1999603 cpu*s: 72.08
none pt=2 -> ops/s: 1871094 cpu*s: 148.3
none pt=3 -> ops/s: 1882907 cpu*s: 147.7
lz4  pt=1 -> ops/s: 1987858 cpu*s: 94.74
lz4  pt=2 -> ops/s: 1590192 cpu*s: 182.65
lz4  pt=3 -> ops/s: 1896294 cpu*s: 174.7
lz4  pt=4 -> ops/s: 1949174 cpu*s: 172.26
lz4  pt=6 -> ops/s: 1912517 cpu*s: 175.91
lz4  pt=8 -> ops/s: 1930585 cpu*s: 176.71
zstd pt=1 -> ops/s: 1239379 cpu*s: 129.85
zstd pt=2 -> ops/s: 1171742 cpu*s: 226.12
zstd pt=3 -> ops/s: 1832574 cpu*s: 214.21
zstd pt=4 -> ops/s: 1887124 cpu*s: 212.51
zstd pt=6 -> ops/s: 1920936 cpu*s: 211.7
zstd pt=8 -> ops/s: 1885544 cpu*s: 214.87
```

After this change:
```
none pt=1 -> ops/s: 1964361 cpu*s: 72.66
none pt=2 -> ops/s: 1914033 cpu*s: 104.95
none pt=3 -> ops/s: 1978567 cpu*s: 100.24
lz4  pt=1 -> ops/s: 2041703 cpu*s: 92.88
lz4  pt=2 -> ops/s: 1903210 cpu*s: 121.64
lz4  pt=3 -> ops/s: 1973906 cpu*s: 122.22
lz4  pt=4 -> ops/s: 1952605 cpu*s: 123.05
lz4  pt=6 -> ops/s: 1957524 cpu*s: 124.31
lz4  pt=8 -> ops/s: 1986274 cpu*s: 129.06
zstd pt=1 -> ops/s: 1233748 cpu*s: 130.43
zstd pt=2 -> ops/s: 1675226 cpu*s: 158.41
zstd pt=3 -> ops/s: 1929878 cpu*s: 159.77
zstd pt=4 -> ops/s: 1916403 cpu*s: 160.99
zstd pt=6 -> ops/s: 1942526 cpu*s: 166.21
zstd pt=8 -> ops/s: 1966704 cpu*s: 171.56
```

For parallel_threads=1, results are very similar, as expected.

For parallel_threads>1, throughput is usually improved a bit, but cpu consumption is dramatically reduced. For zstd, maximum throughput is essentially achieved with pt=3 rather than the previous roughly pt=4 to 6. And the old used about 30% more CPU.

We can also compare with more expensive compression by raising the compression level.
```
SUFFIX=`tty | sed 's|/|_|g'`; CT=zstd; for CL in 4 6 8; do for PT in 1 4 8; do echo -n "$CT@$CL pt=$PT -> "; (for I in `seq 1 10`; do BIN=/tmp/dbbench${SUFFIX}.bin; rm -f $BIN; cp db_bench $BIN; /usr/bin/time $BIN -db=/dev/shm/dbbench$SUFFIX --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 -compression_type=$CT -compression_parallel_threads=$PT -compression_level=$CL 2>&1; done) | awk '/micros.op/ {n++; sum += $5;} /system / { cpu += $1 + $2; } END { print "ops/s: " int(sum/n) " cpu*s: " cpu; }'; done; done
```

Before:
```
zstd@4 pt=1 -> ops/s:  883630 cpu*s: 161.12
zstd@4 pt=4 -> ops/s: 1878206 cpu*s: 243.25
zstd@4 pt=8 -> ops/s: 1885002 cpu*s: 245.89
zstd@6 pt=1 -> ops/s:  710767 cpu*s: 189.44
zstd@6 pt=4 -> ops/s: 1706377 cpu*s: 277.29
zstd@6 pt=8 -> ops/s: 1866736 cpu*s: 275.07
zstd@8 pt=1 -> ops/s:  529047 cpu*s: 237.87
zstd@8 pt=4 -> ops/s: 1401379 cpu*s: 330.61
zstd@8 pt=8 -> ops/s: 1895601 cpu*s: 321.59
```

After:
```
zstd@4 pt=1 -> ops/s:  889905 cpu*s: 161.03
zstd@4 pt=4 -> ops/s: 1942240 cpu*s: 193.18
zstd@4 pt=8 -> ops/s: 1922367 cpu*s: 205.21
zstd@6 pt=1 -> ops/s:  713870 cpu*s: 188.91
zstd@6 pt=4 -> ops/s: 1832314 cpu*s: 219.66
zstd@6 pt=8 -> ops/s: 1949631 cpu*s: 229.34
zstd@8 pt=1 -> ops/s:  530324 cpu*s: 238.02
zstd@8 pt=4 -> ops/s: 1479767 cpu*s: 271.65
zstd@8 pt=8 -> ops/s: 1949631 cpu*s: 275.6
```

And we can also look at the cumulative effect of this change and  https://github.com/facebook/rocksdb/pull/13850 that will combine for the parallel compression improvements in the upcoming 10.7 release:

Before both:
```
lz4 pt=1 -> ops/s: 1954445 cpu*s: 95.14
lz4 pt=3 -> ops/s: 1687043 cpu*s: 186.62
lz4 pt=5 -> ops/s: 1708196 cpu*s: 188.33
zstd pt=1 -> ops/s: 1220649 cpu*s: 131.2
zstd pt=3 -> ops/s: 1658100 cpu*s: 227.08
zstd pt=5 -> ops/s: 1685074 cpu*s: 226.08
```

After:
```
lz4 pt=1 -> ops/s: 2048214 cpu*s: 93.24
lz4 pt=3 -> ops/s: 1922049 cpu*s: 122.9
lz4 pt=5 -> ops/s: 1980165 cpu*s: 122.49
zstd pt=1 -> ops/s: 1245165 cpu*s: 128.84
zstd pt=3 -> ops/s: 1956961 cpu*s: 158.73
zstd pt=5 -> ops/s: 1970458 cpu*s: 161.02
```

In summary, before with zstd default level, you could see only
* about 38% increase in throughput for about 73% increase in CPU usage

Now you can get
* about 58% increase in throughput for about 25% increase in CPU usage

## Performance, high load
To validate this for usage on remote compaction workers, we also need to test whether it falls over at high load or anything concerning like that. For this I did a lot of testing with concurrent db_bench and zstd compression_level=8 and parallel_thread (PT) in {1,8} trying to observe "bad" behaviors such as stalls due to preempted threads and such. On a 166 core machine where a "job" is a db_bench process running a fillseq benchmark similar to above in parallel with others, I could summarize the results like this:

10 jobs PT=8 vs. PT=1 -> 12% more CPU usage, 75% reduction in wall time, 1.9 jobs/sec (vs. 0.5)
50 jobs PT=8 vs. PT=1 -> 89% more CPU usage, 27% reduction in wall time, 3.1 jobs/sec (vs. 2.3)
100 jobs PT=8 vs. PT=1 -> 24% more CPU usage, 5% reduction in wall time, 3.25 jobs/sec (vs. 3.1)
150 jobs PT=8 vs. PT=1 -> 4% more CPU usage, 2% increase in wall time, 3.3 jobs/sec (vs. 3.4)
500 jobs PT=8 vs. PT=1 -> 1% more CPU usage, insignificant difference in wall time, 3.3 jobs/sec

Even when there are 4000 threads potentially competing for 166 cores, the throughput (3.3 jobs / sec) is still very close to maximum (3.4). Enabling parallel compression didn't result in notably less throughput (based on wall clock time for all jobs to complete) in any case tested above, and much higher throughput for many cases. If parallel compression causes us to tip from comfortably under-saturating to over-saturating the cores (as in the 50 jobs case), the overall CPU usage can be much higher, presumably due to lower CPU cache hit rates and maybe clock throttling, but parallel compression still has the throughput advantage in those cases.

In other words, what would we stand to gain from being able to intelligently share worker threads between compaction jobs? It doesn't seem that much.

Reviewed By: xingbowang

Differential Revision: D81365623

Pulled By: pdillinger

fbshipit-source-id: 5db5151a959b5d25b84dbe185bc208bd188f2d1c
Summary:
add option MultiScanArgs::use_async_io option and implementation for using ReadAsync() for multiscan. Read requests are submitted during Prepare() and polled during actual scanning.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13932

Test Plan:
- updated existing unit test to use async_io.
- crash test: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=1 --mmap_read=0`

Benchmark:
- Default multiscan benchmark:
```
Set up: /db_bench --benchmarks="fillseq,compact" --disable_wal=1 --threads=1 --num_levels=1 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=1000 --write_buffer_size=268435456

Without async IO:
./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=0

multiscan    :     415.569 micros/op 75805 ops/sec 10.355 seconds 784968 operations; (multscans:24999)
rocksdb.read.async.micros COUNT : 0

With asycn IO:
./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --seek_nexts=100 --threads=32 --duration=10 --statistics=1 --use_direct_reads=1 --multiscan_use_async_io=1

multiscan    :     413.236 micros/op 76044 ops/sec 10.375 seconds 788968 operations; (multscans:24999)
rocksdb.read.async.micros COUNT : 3916499

Similar performance.
```

- Larger scan, more scans per multiscan, do not coalesce IO so that async IO can progress while scanning, and use one thread:
```
multiscan_stride = 1000
multiscan_size = 100
seek_nexts = 1000

./db_bench --db="/tmp/rocksdbtest-543376/dbbench" --use_existing_db=1 --benchmarks=multiscan --disable_auto_compactions=1 --threads=1 --duration=10 --statistics=0 --use_direct_reads=1  --cache_size=2097152 --multiscan_size=100 --multiscan_stride=1000 --seek_nexts=1000 --seed=1 --multiscan_coalesce_threshold=0  --multiscan_use_async_io=0

Without async IO:
multiscan    :   20495.205 micros/op 48 ops/sec 10.002 seconds 488 operations; (multscans:488)

With async IO:
multiscan    :   18337.883 micros/op 54 ops/sec 10.013 seconds 546 operations; (multscans:546)

~10% improvement in throughput
```

Reviewed By: xingbowang

Differential Revision: D82077818

Pulled By: cbi42

fbshipit-source-id: 66e32cf4039183c4841827409286dfbaa6dfbcd8
Summary:
... reporting false positive double-lock on some of the new parallel compression code. Switching from std::condition_variable to condition_variable_any simply changes the FP from double-lock to lock inversion. In addition, leaking ParallelCompressionRep instances to avoid memory location reuse fails to fix the FP reports. Thus, I've decided to disable the watchdog with GCC+TSAN.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13958

Test Plan: local crash test runs could reproduce, now don't reproduce. CLANG TSAN doesn't seem to be reporting the same supposed issues

Reviewed By: xingbowang

Differential Revision: D82555968

Pulled By: pdillinger

fbshipit-source-id: 537fbc3a787f917915a6faf0bdedd1449a7f378a
Summary:
The stress test runs concurrent transactions through many threads at the same time on a shared key space. It is possible that a dead lock or a timeout is detected from the transactiondb layer. When this happens, simply return from the function and continue the test, instead of fail the test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13950

Test Plan: Stress test pass locally with the same random seed from stress test 14723229280871643749.

Reviewed By: hx235

Differential Revision: D82373959

Pulled By: xingbowang

fbshipit-source-id: 5d72e89998171c5844fb22f13d8f061f81014c7d
Summary:
* Fix compaction/flush CPU usage stats to include CPU usage by parallel compression workers. (Validated with manual db_bench testing.)
* Disable the parallel compression framework when compression is disabled. See new code comment for details, because in theory it could be useful to hide SST write latency, but manual testing with db_bench and -rate_limiter_bytes_per_sec or -simulate_hdd options shows no useful increase in throughput, just more CPU usage.
* Fix some minor clean-up items in the implementation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13959

Test Plan: Also ran some tests like in https://github.com/facebook/rocksdb/issues/13910 to ensure the new CPU usage tracking did not regress performance, all good.

Reviewed By: xingbowang

Differential Revision: D82556686

Pulled By: pdillinger

fbshipit-source-id: 77c522159a7e6ab0ab6f7fb1d662070a46661557
Summary:
This PR fixes a bug in BlockBasedTableIterator::Prepare in conjunction with a user defined index (UDI). If the UDI determines a scan range to be empty and thus returns the kOutOfBound iteration result during Seek, the iteration result is not propagated up and Prepare() assumes end of file and aborts the remaining scans. This results in incorrect behavior and unpredictable multi scan results.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13960

Test Plan: Add unit test to table_test.cc

Reviewed By: xingbowang

Differential Revision: D82590892

Pulled By: anand1976

fbshipit-source-id: 8cfaaae2bb1a9509ddf8ec967cb8a8801748413d
Summary:
when ingesting DB generated file with non-zero sequence number, we need smallest seqno of each file for file meta data. To avoid full table scan, we record this information in table property and use it during file ingestion.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13942

Test Plan: new unit test and updated existing unit test.

Reviewed By: hx235

Differential Revision: D82331802

Pulled By: cbi42

fbshipit-source-id: 3009a6801ca7092cd0fde33692db1a13567068a9
Summary:
After seeing more people hit issues with thrashing small LRUCache shards and AutoHCC running fully in production for a while on a very large service, here I make these updates:

* In the public API, mark the case of `estimated_entry_charge = 0` (which is how you select AutoHCC) as production-ready and generally preferred. That means devoting a lot less space to how to tune FixedHCC (`estimated_entry_charge > 0`) because it is not generally recommended anymore even though in theory it is the fastest (conditional on a fragile configuration).
* In the public API, add more detail about potential problems with LRUCache and explicitly endorse HCC.
* When a default block cache is created, use AutoHCC instead of LRUCache. It's still a 32MB cache but that's just one cache shard for AutoHCC so the risk of issues with small cache shards is dramatically reduced. And a single AutoHCC shard is still essentially wait-free.
* Improve the handling of the hypothetical scenario of a failed anonymous mmap. This is hardly a concern for 64-bit Linux and likely most other OSes. It would in theory be possible to fall back on LRUCache in that case but the code structure makes that annoying/challenging. Instead we crash with an appropriate message.
* Cleaned up some includes
* Fixed some previously unreported leaks (better assertions on HCC perhaps, some subtle behavior changes)
* Added a new mode to cache_bench (detailed below)
* Avoid a particularly costly sanity check in `~AutoHyperClockTable()` even in debug builds so that unit testing, etc., isn't bogged down, except keep it in ASAN build.

Planned follow-up:
* Update HCC implementation to use my new "bit field atomics" API introduced in https://github.com/facebook/rocksdb/issues/13910 to make it easier to read and maintain

Possible follow-up:
* Re-engineer table cache to use AutoHCC also, instead of LRUCache and a single mutex to ensure no duplication across threads. (a) Pad table cache key to 128 bits for AutoHCC. (b) Stripe/shard the no-duplication mutex. (HCC's consistency model is too weak for concurrent threads to use its API to agree on a winner, even if entries could be inserted in an "open in progress" state.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13964

Test Plan:
existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.

## Performance

Although we've validated AutoHCC performance under high load, etc., before we haven't really considered whether there will be unacceptable overheads for small DBs and CFs, e.g. in unit tests. For this, I have added a new mode to cache_bench: with the -stress_cache_instances=n parameter, it will create and destroy n empty cache instances several times. In the debug build, this found that a particular check in `~AutoHyperClockTable()` was extremely costly for short-lived caches (fixed). Beyond that, we can answer the question of whether it is feasible for a single process to host 1000 DBs each with 1000 CFs with default block cache instances, after moving LRUCache -> AutoHCC, for example:

```
/usr/bin/time ./cache_bench -stress_
cache_instances=1000000 -cache_type=auto_hyper_clock_cache -cache_size=33554432
```

Release build:
Average 9.8 us per 32MB LRUCache creation, 2.9 us per destruction, 24.6GB max RSS (~25KB each)
->
Average 4.3 us per  32MB AutoHCC creation, 4.9 us per destruction, 4.8GB max RSS (~5KB each)

Debug build:
Average 10.9 us per 32MB LRUCache creation, 3.5 us per destruction, 28.7GB max RSS (~29KB each)
->
Average 4.5 us per 32MB AutoHCC creation, 4.9 us per destruction, 4.7GB max RSS (~5KB each)

Despite the anonymous mmaps, it's apparently more efficient for default/small/empty structures. This is likely due to the dramatically low number of cache shards at this size. If we switch to `-stress_cache_instances=10000  -cache_size=1073741824`:

Release build:
Average 10.6 us per 1GB LRUCache, 2.8 us per destruction, 2.3 GB max RSS (~230KB each)
->
Average 130 us per 1GB AutoHCC creation, 153 us per destruction, 1.5 GB max RSS (~150KB each)

Debug build:
Average 11.2 us per 1GB LRUCache, 3.6 us per destruction, 2.4 GB max RSS (~240KB each)
->
Average 130 us per 1GB AutoHCC creation, 150 us per destruction, 1.6 GB max RSS (~160KB each)

Here it's clear that we are paying a price in time for setting up all those mmaps for the good number of cache shards and potential table growth, even though the RSS is well under control. However, I am not concerned about this at all, as it's unlikely to slow down anything notably such as unit tests. Before and after full testsuite runs confirm:

3327.73user 5188.71system 3:38.88elapsed -> 3312.07user 5704.77system 3:41.61elapsed

There is increased kernel time but acceptable. With ASAN+UBSAN:

11618.70user 15671.30system 5:54.68elapsed -> 12595.81user 16159.67system 6:32.77elapsed

Acceptable given that our ASAN+UBSAN builds are not the slowest in CI

Reviewed By: hx235

Differential Revision: D82661067

Pulled By: pdillinger

fbshipit-source-id: ab25c766ca70f2b8664849c2a838b9e1b4e72d3b
Summary:
**Context/Summary:**
`Status::state` can be nullptr when created with no specific error message. std::strstr on nullptr caused some segfault in our stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13968

Test Plan: Monitor stress test

Reviewed By: jaykorean

Differential Revision: D82695541

Pulled By: hx235

fbshipit-source-id: cf08f70163a9ee6c911cdc3a3d79acd3429f0d15
Summary:
Force caller of MultiScanArgs to pass comparator. Pass comparator from CF handle to MultiScanArgs in NewMultiScan.
Expand MultiScanArgs unit test with different comparator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13970

Test Plan: unit test

Reviewed By: cbi42

Differential Revision: D82739270

Pulled By: xingbowang

fbshipit-source-id: e709f4a333ad547c0ba6d24d8fb2b22e50e8a12f
Summary:
Add a new CF immutable option `paranoid_memory_check_key_checksum_on_seek` that allows additional data integrity validations during seek on SkipList Memtable. When this option is enabled and memtable_protection_bytes_per_key is non zero, skiplist-based memtable will validate the checksum of each key visited during seek operation. The option is opt-in due to performance overhead. This is an enhancement on top of paranoid_memory_checks option.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13902

Test Plan:
* new unit test added for paranoid_memory_check_key_checksum_on_seek=true.
    * existing unit test for paranoid_memory_check_key_checksum_on_seek=false.
    * enable in stress test.

Performance Benchmark: we check for performance regression in read path where data is in memtable only. For each benchmark, the script was run at the same time for main and this PR:

### Memtable-only randomread ops/sec:

* Value size = 100 Bytes
```
for B in 0 1 2 4 8; do (for I in $(seq 1 50);do  ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=100 --num=250000 --reads=500000  --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done;
```

1. Main: 928999
2. PR with paranoid_memory_check_key_checksum_on_seek=false: 930993 (+0.2%)
3. PR with paranoid_memory_check_key_checksum_on_seek=true:
3.1 memtable_protection_bytes_per_key=1: 464577 (-50%)
3.2 memtable_protection_bytes_per_key=2: 470319 (-49%)
3.3 memtable_protection_bytes_per_key=4: 468457 (-50%)
3.4 memtable_protection_bytes_per_key=8: 465061 (-50%)

* Value size = 1000 Bytes
```
for B in 0 1 2 4 8; do (for I in $(seq 1 50);do  ./db_bench --benchmarks=fillseq,readrandom --write_buffer_size=268435456 --writes=250000 --value_size=1000 --num=250000 --reads=500000  --seed=1723056275 --paranoid_memory_check_key_checksum_on_seek=true --memtable_protection_bytes_per_key=$B 2>&1 | grep "readrandom"; done;) | awk '{ t += $5; c++; print } END { print 1.0 * t / c }'; done;
```

1. Main: 601321
2. PR with paranoid_memory_check_key_checksum_on_seek=false: 607885 (+1.1%)
3. PR with paranoid_memory_check_key_checksum_on_seek=true:
3.1 memtable_protection_bytes_per_key=1: 185742 (-69%)
3.2 memtable_protection_bytes_per_key=2: 177167 (-71%)
3.3 memtable_protection_bytes_per_key=4: 185908 (-69%)
3.4 memtable_protection_bytes_per_key=8: 183639 (-69%)

Reviewed By: pdillinger

Differential Revision: D81199245

Pulled By: xingbowang

fbshipit-source-id: e3c29552ab92f2c5f360361366a293fa26934913
Summary:
If there's a static initialization of Options() this could now instantiate an AutoHyperClockTable before kPageSize is initialized. Break the dependency because it's a very minor optimization.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13973

Test Plan: internal CI (not able to reproduce locally)

Reviewed By: hx235

Differential Revision: D82789849

Pulled By: pdillinger

fbshipit-source-id: 3f32b5779a4f56d2071be5aadacda2bf0f4b895d
Summary:
This change adds FFI support for exporting column family checkpoints, basic access to the export/import files metadata, and creating column families by import.

I've been able to successfully use this to [add checkpoint export and import support to `rust-rocksdb`](https://github.com/pcholakov/rust-rocksdb/pull/2), a forked version of which has been successfully used in production for some time.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13874

Reviewed By: hx235

Differential Revision: D82343565

Pulled By: jaykorean

fbshipit-source-id: fb4182bdfd5cce10743c021a1ac636fd6ac48df3
Summary:
Currently in MultiScan we only unpins a block after we scan through it. This PR adds unpinning during Seek to release all blocks pinned by the previous scan range. This is useful when users do not scan through the entire scan range. I plan to follow up with support for aborting async IOs from the previous scan.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13972

Test Plan: new test MultiScanUnpinPreviousBlocks validates unpinning behavior

Reviewed By: xingbowang

Differential Revision: D82779504

Pulled By: cbi42

fbshipit-source-id: 17ba7d1e5a6d8ff09ceea57b79c18febfba75584
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/13976

Missed an occurrence of kPageSize in the last PR
https://github.com/facebook/rocksdb/pull/13973

Reviewed By: mszeszko-meta

Differential Revision: D82826713

fbshipit-source-id: b112cd7c94b7d6604623ee80274b2b25911245eb
Summary:
* There was a bug where the compression manager would actually not be used for recompress because the options passed to SstFileDumper were not respected. That is now fixed by respecting the Options.
* Refactored SstFileDumper not to take explicit options that could naturally be embedded in Options.
* Report compressed and uncompressed data block sizes (and ratio) instead of total file size (without a useful ratio). Needed to add a new table property to support that.
* Allow --block_size instead of --set_block_size to be consistent with other tools
* Allow --compression_level as shorthand for both _from and _to options, for simplicity and consistency with other tools
* Support --compression_parallel_threads option

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13977

Test Plan:
* sst_dump manual testing
* TableProperties unit tests updated
* Made it much easier to detect when a functional change requires an update to ParseTablePropertiesString() (rather than causing cryptic downstream failures)

Reviewed By: cbi42

Differential Revision: D82841412

Pulled By: pdillinger

fbshipit-source-id: 8d3421be4d2a3e25b7590cd59d204a3779c2a928
Summary:
longtime wanted e.g. for easy tab-completion, now implemented

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13978

Test Plan: pretty good unit test updates, manual testing

Reviewed By: cbi42

Differential Revision: D82857671

Pulled By: pdillinger

fbshipit-source-id: d2b63b7d15e61ebf22c58a6ecd3003311e2d03cb
Summary:
Start the process of migrating the HCC implementation over to my new system of "bit field atomics" to clean up the code. Here I took on the simplest of the three "bit field atomic" formats in HCC, but ended up moving some things around to end up with less plumbing of definitions and values overall.

In the process, updated BitFields to use the CRTP pattern to simplify some things (see updated example, etc.)
https://en.wikipedia.org/wiki/Curiously_recurring_template_pattern

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13965

Test Plan: existing tests. ClockCacheTest.ClockEvictionEffortCapTest caught a regression during my development, and the crash test has a history of finding subtle HCC bugs.

Reviewed By: xingbowang

Differential Revision: D82669582

Pulled By: pdillinger

fbshipit-source-id: b73dd47361cbe9fbd334413dd4ce01b3c667159e
Summary:
Support random seed for white box test
Support per iteration random seed override, so that we could skip previous iterations, as sometimes failure happens after a few iterations.
The reason we still need initial random seed is that some of the parameter is initialized before each iteration, and not all of the parameters are randomized again in each iteration. The reason is that we want some of the parameters to be stable across the run.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13952

Test Plan:
Example for using per iteration random seed override to jump the to second iteration.

Simulate a normal run. 4205502355970671733 is the seed used for the second iteration.
```
[xbw@devvm16622.vll0 ~/workspace/ws2/rocksdb (plm_stress_fix)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=96000 --max_key=2500000 --interval=10 --initial_random_seed_override=10
Start with random seed 10
Running blackbox-crash-test with
interval_between_crash=10
total-duration=96000

Use random seed for iteration 13278846177722289202
Running db_stress with pid=2102945: ./db_stress --WAL_size_limit_MB=1 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=2 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=0 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=1000000 --blob_cache_size=2097152 --blob_compaction_readahead_size=4194304 --blob_compression_type=snappy --blob_file_size=1073741824 --blob_file_starting_level=3 --blob_garbage_collection_age_cutoff=0.0 --blob_garbage_collection_force_threshold=0.5 --block_align=0 --block_protection_bytes_per_key=1 --block_size=16384 --bloom_before_level=1 --bloom_bits=27.321469575655275 --bottommost_compression_type=zstd --bottommost_file_compaction_delay=86400 --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=1 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=1 --checkpoint_one_in=0 --checksum_type=kXXH3 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=0 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=autoskip --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4 --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/tmp/rocksdb_crashtest_blackboxs39kubu3 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_blob_files=1 --enable_blob_garbage_collection=0 --enable_checksum_handoff=0 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=1 --enable_custom_split_merge=1 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=1 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_rvq7p3ow --fifo_allow_compaction=0 --file_checksum_impl=xxh64 --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=15 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=16384 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kWarm --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=10000 --log_file_time_to_roll=0 --log_readahead_size=0 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=32768 --max_sequential_skip_in_iterations=1 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=1048576 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=1000 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=0 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=8 --min_write_buffer_number_to_merge=1 --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=1 --num_file_reads_for_auto_readahead=0 --open_files=100 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=0 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=1 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=32 --read_fault_one_in=0 --readahead_size=524288 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=10000 --sample_for_compression=0 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=1048576 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --statistics=1 --stats_dump_period_sec=0 --stats_history_buffer_size=0 --strict_bytes_per_sync=0 --subcompactions=3 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=-1 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=0 --track_and_verify_wals=0 --uncache_aggressiveness=3225 --universal_max_read_amp=-1 --universal_reduce_file_locking=0 --unpartitioned_pinning=1 --use_adaptive_mutex=1 --use_adaptive_mutex_lru=1 --use_attribute_group=1 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=0 --use_multi_get_entity=0 --use_multiget=1 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000 --verify_compression=0 --verify_db_one_in=10000 --verify_file_checksums_one_in=1000000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35

KILLED 2102945

stdout:

Use random seed for iteration 4205502355970671733
Running db_stress with pid=2107447: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=3 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=1 --async_io=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=4194304 --blob_compaction_readahead_size=1048576 --blob_compression_type=snappy --blob_file_size=1048576 --blob_file_starting_level=2 --blob_garbage_collection_age_cutoff=1.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=randommixed --compression_max_dict_buffer_bytes=34359738367 --compression_max_dict_bytes=16384 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxs39kubu3 --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/tmp/rocksdb_crashtest_expected_rvq7p3ow --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=5 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=11 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kCold --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=7 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=1 --track_and_verify_wals=0 --uncache_aggressiveness=136 --universal_max_read_amp=-1 --universal_reduce_file_locking=1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

Override the per iteration random seed directly 4205502355970671733, to jump to the second iteration parameter set. Only the file path name is different. The rest of the parameters are all same
```
[xbw@devvm16622.vll0 ~/workspace/ws2/rocksdb (plm_stress_fix)]$ /usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --cf_consistency blackbox --duration=96000 --max_key=2500000 --interval=10 --initial_random_seed_override=10 --per_iteration_random_seed_override=4205502355970671733
Start with random seed 10
Running blackbox-crash-test with
interval_between_crash=10
total-duration=96000

Use random seed for iteration 4205502355970671733
Running db_stress with pid=2110794: ./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=0 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --adm_policy=3 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=0 --allow_setting_blob_options_dynamically=1 --allow_unprepared_value=1 --async_io=1 --auto_readahead_size=0 --auto_refresh_iterator_with_snapshot=0 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --blob_cache_size=4194304 --blob_compaction_readahead_size=1048576 --blob_compression_type=snappy --blob_file_size=1048576 --blob_file_starting_level=2 --blob_garbage_collection_age_cutoff=1.0 --blob_garbage_collection_force_threshold=0.75 --block_align=0 --block_protection_bytes_per_key=8 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=0 --bottommost_compression_type=disable --bottommost_file_compaction_delay=600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=1 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=33554432 --cache_type=auto_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=1 --check_multiget_entity_consistency=0 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=4 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.0 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=randommixed --compression_max_dict_buffer_bytes=34359738367 --compression_max_dict_bytes=16384 --compression_parallel_threads=4 --compression_type=none --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxo1xvo_2n --db_write_buffer_size=0 --decouple_partitioned_filters=1 --default_temperature=kHot --default_write_temperature=kUnknown --delete_obsolete_files_period_micros=30000000 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=1 --disable_file_deletions_one_in=10000 --disable_manual_compaction_one_in=1000000 --disable_wal=0 --dump_malloc_stats=1 --enable_blob_files=1 --enable_blob_garbage_collection=1 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=0 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=1 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=0 --enable_write_thread_adaptive_yield=1 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=1 --expected_values_dir=/tmp/rocksdb_crashtest_expected_s0kmvlrj --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000000 --format_version=5 --get_all_column_family_metadata_one_in=10000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=10000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=100000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0.5 --index_block_restart_interval=11 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=0 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100000 --last_level_temperature=kCold --level_compaction_dynamic_level_bytes=1 --lock_wal_one_in=0 --log_file_time_to_roll=0 --log_readahead_size=16777216 --long_running_snapshots=0 --low_pri_pool_ratio=0.5 --lowest_used_cache_tier=1 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=0 --max_auto_readahead_size=524288 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=2500000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=2 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --memtable_avg_op_scan_flush_trigger=20 --memtable_insert_hint_per_batch=1 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=10 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=0 --memtable_whole_key_filtering=1 --memtablerep=skip_list --metadata_charge_policy=1 --metadata_read_fault_one_in=0 --metadata_write_fault_one_in=0 --min_blob_size=16 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=8 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=0 --optimize_filters_for_memory=0 --optimize_multiget_for_io=0 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=2 --pause_background_one_in=10000 --periodic_compaction_seconds=1000 --prefix_size=7 --prefixpercent=5 --prepopulate_blob_cache=1 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=0 --remote_compaction_worker_threads=0 --reopen=0 --report_bg_io_stats=0 --reset_stats_one_in=1000000 --sample_for_compression=0 --secondary_cache_fault_one_in=0 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --set_options_one_in=1000 --skip_stats_update_on_db_open=1 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=0 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=1 --sync=0 --sync_fault_injection=1 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=1 --test_cf_consistency=1 --test_ingest_standalone_range_deletion_one_in=0 --top_level_index_pinning=1 --track_and_verify_wals=0 --uncache_aggressiveness=136 --universal_max_read_amp=-1 --universal_reduce_file_locking=1 --unpartitioned_pinning=2 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_blob_cache=0 --use_delta_encoding=1 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=0 --use_get_entity=0 --use_merge=0 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=10 --use_shared_block_and_blob_cache=1 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=5 --use_write_buffer_manager=0 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=0 --write_fault_one_in=0 --write_identity_file=1 --writepercent=35
```

Reviewed By: jaykorean

Differential Revision: D82399857

Pulled By: xingbowang

fbshipit-source-id: 38f3bfefdd0adc7f527fd68982e2edc22b2304f4
Summary:
the test did not consider the ingestion_option settings that can result in different error message. This PR fixes the relevant check and ensure we have enough randomness in this test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13979

Test Plan: `gtest-parallel --repeat=20 --workers=20 ./external_sst_file_test --gtest_filter="*VaryingOptions/IngestDBGeneratedFileTest2.NonZeroSeqno/*"`

Reviewed By: hx235

Differential Revision: D82873439

Pulled By: cbi42

fbshipit-source-id: b0d74bf26a502ca3db59b4a0ea9717bf7d027400
Summary:
- updated release note
- updated version to 10.8 in version.h
- added 10.7 to check_format_compatible.sh
- did not updated folly commit hash due to some build failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13980

Reviewed By: xingbowang

Differential Revision: D82882035

Pulled By: cbi42

fbshipit-source-id: b5e0e78570fdd492d592ee77bd3901e4b39c25fb
Summary:
It is useful to be able to specify output temperatures in the CompactFiles API. For example it may be useful to store small L0 files produced by flushes locally, while larger intra-L0 compactions can store the compacted L0 file remotely.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13955

Test Plan: New unit tests

Reviewed By: jaykorean

Differential Revision: D82492503

Pulled By: joshkang97

fbshipit-source-id: e1225fe572a15d7c5c30a265762b048a4a9e7f0b
Summary:
for MultiScan and UDI we start to use bound check from index iterator, so removing this assert here.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13988

Test Plan: existing test

Reviewed By: hx235

Differential Revision: D82993180

Pulled By: cbi42

fbshipit-source-id: 442b2e83cb3aef96fc1a825bf733af9ce59c21c1
Summary:
**Context**
Resuming compaction is designed to periodically record the progress of an ongoing compaction and can resume from that saved progress after interruptions such as cancellation, database shutdown, or crashes.

This PR introduces the data structures needed to store  subcompaction progress in memory, along with serialization and deserialization support to persist and parse this progress to/from "a manifest-like compaction progress file" (the actual creation of such file is in upcoming PRs).

Flow of resuming: DB::OpenAndCompact() -> Compaction progress file  -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file  -> DB that is called with OpenAndCompact()

**Summary**
Progress represented by `SubcompactionProgress` will be tracked at the scope of a subcompaction, which is the smallest independent unit of compaction work.

The frequency of recording this progress is once every N compaction output files (to be detailed in future PRs).

When recording, all fields, except for the output files metadata in `SubcompactionProgress`, will directly overwrite the corresponding fields from the last saved progress (See `SubcompactionProgress` and `SubcompactionProgressBuilder` for more).

As a bonus, this PR refactors the file metadata encoding and decoding utilities into two static helper functions, EncodeToNewFile4() and DecodeNewFile4From(), to support subcompaction progress usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13928

Test Plan:
- Added various `SubcompactionProgressTest` unit tests in version_edit_test.cc to verify basic serialization/deserialization and forward compatibility handling
- Existing UTs and stress/crash test

**Follow up:**
- Move output entry number and file verification to after each file creation so we can remove kNumProcessedOutputRecords persistence support and make resuming compaction work with `paranoid_file_checks=true` (by default false). Output verification will be done before persistence of progress. As long as this follow-up is done before the landing of the integration PR to create the progress file, we can change the manifest-like compaction progress file format freely.

Reviewed By: jaykorean

Differential Revision: D81986583

Pulled By: hx235

fbshipit-source-id: b42766da7d9c2e2f596c892d050c753238d1039f
Summary:
This is causing some internal failure, we decide to revert this for now until we have a proper fix.

This reverts commit 961880b458.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13987

Reviewed By: anand1976

Differential Revision: D82990294

Pulled By: cbi42

fbshipit-source-id: 5f5b4d18d0afe47599738d27e11e3eb2d08d88a0
Summary:
**Context/Summary**
This reverts commit 73432a3f36. This is due to it mysteriously fails our internal CI running with this change to db_crashtest.py. The root-cause is unknown but the error only reproed with this commit frequently but not the one before it. The error message appears to be the command parsing leading to the db_stress binary can't be found

```
Traceback (most recent call last):
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1638, in <module>
    main()
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1627, in main
    blackbox_crash_main(args, unknown_args)
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1347, in blackbox_crash_main
    hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"])
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/sandcastle/boxes/trunk-hg-full-fbsource/fbcode/internal_repo_rocksdb/repo/tools/db_crashtest.py", line 1283, in execute_cmd
    child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: './db_stress'
```

**Test plan**
- Rehearsal crash test

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13989

Reviewed By: xingbowang

Differential Revision: D83010751

Pulled By: hx235

fbshipit-source-id: d8cfc70564074065b6bb8a3986d6c1011064dd5e
Summary:
Return a failure status for multi scan if Prepare fails, or if the scan options are unsupported, instead of falling back on a regular scan. This PR also fixes a bug in LevelIterator that caused max_prefetch_size to be ignored.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13974

Test Plan: Add new test in db_iterator_test and table_test

Reviewed By: xingbowang

Differential Revision: D82843944

Pulled By: anand1976

fbshipit-source-id: f12756c40ebd38d8d4e4425e97438b6e766a4663
Summary:
Fix the bug in Improve random seed override support in stress test.

The Bug:
`parser.parse_known_args()` is used to parse command line argument. When it is called without any argument, it uses sys.argv as input parameter. In sys.argv, the first argument is the command itself, so parser.parse_known_args skip the first argument. Meantime, the return value `remain_argv` of `parser.parse_known_args()` does not contain the command itself. When `remain_arg` replaces `sys.argv`, the first argument is treated as the command itself, which is skipped by `parser.parse_known_args()`. In the internal stress test tool, the first argument is `--stress_cmd`, therefore, it is skipped. Instead, the default value `./stress_db` is used. This is why `./stress_db` showed up in the error message. This is also why it works in local, as stress_db is located in the local folder.

The Fix:
When `parser.parse_known_args()` is called first time, the remain_argv is saved as a global variable. It is used in the second call of the `parser.parse_known_args(remain_argv)`. When argument is passed to `parser.parse_known_args` directly, the first argument will not be skipped.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13991

Test Plan:
The the value of first argument `--stress_cmd` is parsed correctly, and shown up in the error message.

```
/usr/local/bin/python3 -u tools/db_crashtest.py --stress_cmd=/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress --cleanup_cmd='' --simple blackbox  --print_stderr_separately
Start with random seed 11107847853133580500
Running blackbox-crash-test with
interval_between_crash=120
total-duration=6000

Use random seed for iteration 8577470137673434540
Traceback (most recent call last):
  File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1650, in <module>
    main()
  File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1639, in main
    blackbox_crash_main(args, unknown_args)
  File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1358, in blackbox_crash_main
    hit_timeout, retcode, outs, errs = execute_cmd(cmd, cmd_params["interval"])
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xbw/workspace/ws1/rocksdb/tools/db_crashtest.py", line 1294, in execute_cmd
    child = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1028, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/local/fbcode/platform010/lib/python3.12/subprocess.py", line 1957, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/data/sandcastle/boxes/trunk-hg-full-fbsource/buck-out/v2/gen/fbcode/d7db8b24dd42e2db/internal_repo_rocksdb/repo/__db_stress__/db_stress'
```

Reviewed By: hx235

Differential Revision: D83068960

Pulled By: xingbowang

fbshipit-source-id: 28334d38a444c6f8525444e15f460ec6b257ef38
Summary:
This combination causes MultiScan iteration to fail due to internal reseek by the iterator.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13992

Reviewed By: cbi42

Differential Revision: D83094631

Pulled By: anand1976

fbshipit-source-id: 96410747d88de391e6d65857d39063d4fb113d65
Summary:
In https://github.com/facebook/rocksdb/issues/13964 I changed an expensive DEBUG check in ~AutoHyperClockTable to only run in ASAN builds. It's still expensive so I'm modifying it to scan only about one page beyond what we expect to have written to the anonymous mmap, rather than scanning the whole thing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13998

Test Plan: manually checked that lru_cache_test running time went from 5.0s to 4.0s after the change. Verified that existing unit test ClockCacheTest.Limits uses the full anonymous mmap to be sure it is sized as expected, by temporarily breaking AutoHyperClockTable::Grow() to allow slightly exceeding the anonymous mmap size.

Reviewed By: cbi42

Differential Revision: D83178493

Pulled By: pdillinger

fbshipit-source-id: a2bf093e98bf68b540c073800be7e193021f2692
Summary:
Remove the restriction of only using BytewiseComparator(). In a follow on PR, the UDI interface will be updated to take the Comparator as a parameter.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13999

Test Plan: Add a unit test in table_test.cc

Reviewed By: cbi42

Differential Revision: D83179747

Pulled By: anand1976

fbshipit-source-id: 60222533c71022aa0701ac61c39268d36ca86338
Summary:
There can be concurrent reads/writes to fields in `IODebugContext`. One example we have seen is for the `cost_info` field which is of type `std::any`. In fact, in RocksDB's async MultiRead implementation, the same `IODebugContext` is re-used across separate async read requests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13993

Test Plan: Update code which reads/writes to `cost_data` to first acquire shared/exclusive lock on the `mutex` field. There should not be any race conditions when async MultiRead is used.

Reviewed By: pdillinger

Differential Revision: D83091423

Pulled By: archang19

fbshipit-source-id: 4db86d33cf162ed39114b1cd115fcd8964c8ff9b
Summary:
also requested by internal user, like kIce in https://github.com/facebook/rocksdb/issues/13927

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14000

Test Plan: unit tests updated

Reviewed By: archang19

Differential Revision: D83200479

Pulled By: pdillinger

fbshipit-source-id: 31f2842d87bcad40227aeee9687ff5772393689c
Summary:
currently BlockBasedTableIterator::Prepare() fails the iterator with non-ok status if an out-of-range scan option is detected. This is due to the interaction between LevelIterator and BlockBasedTableIterator, see added comment above BlockBasedTableIterator::Prepare(). This can fail stress test for L0 files since it doesn't use LevelIterator and scan options are not pruned. This PR fixes this by adding an internal option to MultiScanArgs that enables this check.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13995

Test Plan:
- new unit test
- stress test that fails before this pr: `python3 -u ./tools/db_crashtest.py whitebox --iterpercent=60 --prefix_size=-1 --prefixpercent=0 --readpercent=0 --test_batches_snapshots=0 --use_multiscan=1 --read_fault_one_in=0 --kill_random_test=88888 --interval=60 --multiscan_use_async_io=0 --mmap_read=0 --level0_file_num_compaction_trigger=20`

Reviewed By: anand1976

Differential Revision: D83166088

Pulled By: cbi42

fbshipit-source-id: 241a7d43c8c00d9a98eea0cabb03d2174d51aae5
Summary:
* Delete obsolete double-underscore version macros, `__ROCKSDB_MAJOR__` etc.
* Add convenient ROCKSDB_VERSION_GE(x, y, z) macro for conditional compilation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14004

Test Plan: Unit test added

Reviewed By: jaykorean

Differential Revision: D83264938

Pulled By: pdillinger

fbshipit-source-id: 23dcfb2760751fb87e232b8e0bbda610fd4ac73c
Summary:
Pass the comparator to UDI interface for both reader and builder.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14001

Test Plan: Unit test

Reviewed By: anand1976

Differential Revision: D83339943

Pulled By: xingbowang

fbshipit-source-id: 7f6541776b0995260e28224329f0cca37f13b3d4
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14005

Test Plan: verify according to https://github.com/facebook/rocksdb/tree/main/docs

Reviewed By: archang19

Differential Revision: D83365540

Pulled By: hx235

fbshipit-source-id: b674aca6a9977721b64cafcdfaf8690d1c5940b7
Summary:
In https://github.com/facebook/rocksdb/pull/13816, we added `earliest_snapshot` in the Compaction object picked by remote compaction which is required for Standalone Range Deletion Optimization (introduced in https://github.com/facebook/rocksdb/pull/13078)

The Standalone Range Deletion Optimization was supposed to be supported by Universal Compaction only. This PR properly skips the feature when the compaction style is not kUniversal

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14007

Test Plan:
Unit Test updated to include Leveled Compaction
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.StandaloneDeleteRangeTombstoneOptimization*"
```

In Stress Test, we were able to repro before, but not anymore
```
./db_stress --WAL_size_limit_MB=0 --WAL_ttl_seconds=60 --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --adm_policy=2 --advise_random_on_open=0 --allow_data_in_errors=True --allow_fallocate=1 --allow_setting_blob_options_dynamically=0 --allow_unprepared_value=1 --async_io=1 --atomic_flush=1 --auto_readahead_size=1 --auto_refresh_iterator_with_snapshot=1 --avoid_flush_during_recovery=0 --avoid_flush_during_shutdown=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=1000 --batch_protection_bytes_per_key=0 --bgerror_resume_retry_interval=100 --block_align=0 --block_protection_bytes_per_key=0 --block_size=16384 --bloom_before_level=2147483647 --bloom_bits=3.4547746144863423 --bottommost_compression_type=lz4hc --bottommost_file_compaction_delay=3600 --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_index_and_filter_blocks_with_high_priority=0 --cache_size=8388608 --cache_type=tiered_fixed_hyper_clock_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 --charge_table_reader=0 --check_multiget_consistency=0 --check_multiget_entity_consistency=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000 --compaction_pri=3 --compaction_readahead_size=1048576 --compaction_style=0 --compaction_ttl=2 --compress_format_version=1 --compressed_secondary_cache_ratio=0.5 --compressed_secondary_cache_size=0 --compression_checksum=1 --compression_manager=mixed --compression_max_dict_buffer_bytes=15 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=lz4hc --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --daily_offpeak_time_utc= --data_block_index_type=1 --db=/tmp/jewoongh/rocksdb_crashtest_blackbox_remote_compaction --db_write_buffer_size=1048576 --decouple_partitioned_filters=1 --default_temperature=kWarm --default_write_temperature=kWarm --delete_obsolete_files_period_micros=21600000000 --delpercent=4 --delrangepercent=1 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_file_deletions_one_in=1000000 --disable_manual_compaction_one_in=10000 --disable_wal=1 --dump_malloc_stats=0 --enable_blob_files=0 --enable_blob_garbage_collection=0 --enable_checksum_handoff=1 --enable_compaction_filter=0 --enable_compaction_on_deletion_trigger=1 --enable_custom_split_merge=0 --enable_do_not_compress_roles=0 --enable_index_compression=0 --enable_memtable_insert_with_hint_prefix_extractor=0 --enable_pipelined_write=0 --enable_sst_partitioner_factory=1 --enable_thread_tracking=1 --enable_write_thread_adaptive_yield=0 --error_recovery_with_no_fault_injection=0 --exclude_wal_from_write_fault_injection=0 --expected_values_dir=/tmp/jewoongh/rocksdb_crashtest_expected_remote_compaction --fifo_allow_compaction=1 --file_checksum_impl=none --file_temperature_age_thresholds= --fill_cache=1 --flush_one_in=1000 --format_version=6 --get_all_column_family_metadata_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_apis_one_in=1000000 --get_properties_of_all_tables_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --hard_pending_compaction_bytes_limit=274877906944 --high_pri_pool_ratio=0 --index_block_restart_interval=7 --index_shortening=2 --index_type=2 --ingest_external_file_one_in=0 --ingest_wbwi_one_in=500 --initial_auto_readahead_size=0 --inplace_update_support=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --key_may_exist_one_in=100 --last_level_temperature=kUnknown --level_compaction_dynamic_level_bytes=0 --lock_wal_one_in=1000000 --log_file_time_to_roll=60 --log_readahead_size=16777216 --long_running_snapshots=1 --low_pri_pool_ratio=0 --lowest_used_cache_tier=2 --manifest_preallocation_size=5120 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=2 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_log_file_size=1048576 --max_manifest_file_size=1073741824 --max_sequential_skip_in_iterations=8 --max_total_wal_size=0 --max_write_batch_group_size_bytes=16 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_avg_op_scan_flush_trigger=2 --memtable_insert_hint_per_batch=0 --memtable_max_range_deletions=0 --memtable_op_scan_flush_trigger=100 --memtable_prefix_bloom_size_ratio=0.1 --memtable_protection_bytes_per_key=4 --memtable_veirfy_per_key_checksum_on_seek=0 --memtable_whole_key_filtering=1 --metadata_charge_policy=1 --metadata_read_fault_one_in=1000 --metadata_write_fault_one_in=0 --min_write_buffer_number_to_merge=1 --mmap_read=1 --mock_direct_io=False --multiscan_use_async_io=0 --nooverwritepercent=1 --num_bottom_pri_threads=20 --num_file_reads_for_auto_readahead=1 --open_files=-1 --open_metadata_read_fault_one_in=0 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_hits=1 --optimize_filters_for_memory=1 --optimize_multiget_for_io=1 --paranoid_file_checks=1 --paranoid_memory_checks=0 --partition_filters=1 --partition_pinning=1 --pause_background_one_in=10000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --preserve_internal_time_seconds=36000 --progress_reports=0 --promote_l0_one_in=0 --read_amp_bytes_per_bit=0 --read_fault_one_in=32 --readahead_size=16384 --readpercent=50 --recycle_log_file_num=0 --remote_compaction_failure_fall_back_to_local=1 --remote_compaction_worker_threads=8 --reopen=0 --report_bg_io_stats=1 --reset_stats_one_in=1000000 --sample_for_compression=5 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --set_options_one_in=1000 --skip_stats_update_on_db_open=0 --snapshot_hold_ops=100000 --soft_pending_compaction_bytes_limit=68719476736 --sqfc_name=foo --sqfc_version=2 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --statistics=1 --stats_dump_period_sec=600 --stats_history_buffer_size=1048576 --strict_bytes_per_sync=0 --subcompactions=4 --sync=0 --sync_fault_injection=0 --table_cache_numshardbits=6 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_ingest_standalone_range_deletion_one_in=0 --test_secondary=0 --top_level_index_pinning=2 --track_and_verify_wals=0 --uncache_aggressiveness=72 --universal_max_read_amp=0 --universal_reduce_file_locking=1 --unpartitioned_pinning=1 --use_adaptive_mutex=0 --use_adaptive_mutex_lru=1 --use_attribute_group=0 --use_delta_encoding=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_get_entity=1 --use_merge=1 --use_multi_cf_iterator=1 --use_multi_get_entity=0 --use_multiget=0 --use_multiscan=0 --use_put_entity_one_in=0 --use_sqfc_for_range_queries=1 --use_timed_put_one_in=0 --use_write_buffer_manager=1 --user_timestamp_size=0 --value_size_mult=32 --verification_only=0 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_compression=0 --verify_db_one_in=100000 --verify_file_checksums_one_in=0 --verify_iterator_with_expected_state_one_in=5 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=1048576 --write_dbid_to_manifest=1 --write_fault_one_in=0 --write_identity_file=0 --writepercent=35
```

Reviewed By: hx235

Differential Revision: D83375779

Pulled By: jaykorean

fbshipit-source-id: 6dad06e3a825c4e9a7101ab8603d1c966be6a4f4
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/14002

Test Plan: verify according to https://github.com/facebook/rocksdb/tree/main/docs

Reviewed By: jaykorean

Differential Revision: D83209262

Pulled By: hx235

fbshipit-source-id: 688c855387e08c9b22644d4de3bc539e51a0ba0a
Summary:
Pretty self-explanatory from the changes, including re-arranging the "COOL" entries for easier tracking of which values are used.

I'm not touching the TICKER_ENUM_MAX issue because IIRC we've gotten in trouble in the past for changing any Java ticker values.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14012

Test Plan: CI, sufficient prompts to get AI to discover the known issues relayed by hx235, to help ensure we found any other outstanding issues.

Reviewed By: hx235

Differential Revision: D83497503

Pulled By: pdillinger

fbshipit-source-id: ec0bd7e28188e0430fb03fc5bd79c2ed7b28f3ad
Summary:
This PR fixes a bug in how MultiScan handled a scan range limit falling in the key range between files. The bug was in LevelIterator, where Prepare() relied on FindFile to determine the lower bound file for the range limit. FindFile returns the smallest file index with `range.limit < file.largest_key`. However, that doesn't guarantee that the range overlaps the file, as the `range.limit` could be smaller than `file.smallest_key`.

This also fixes a bug in BlockBasedTableIterator of Valid() returning true even if status() returned error. This was exposed by the previous bug.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14011

Test Plan: Add unit tests in db_iterator_test and table_test

Reviewed By: cbi42

Differential Revision: D83496439

Pulled By: anand1976

fbshipit-source-id: a9d2d138d69d0c816d9f4160a984b273d00d683f
Summary:
This method will be used to improve the compaction logic by accounting for the tail size, in addition to the data size,  when determining when to cut a file.

Problem: Currently the file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized.

Key changes:
- Add EstimateCurrentIndexSize() to IndexBuilder interface
- Implement in ShortenedIndexBuilder with buffer that accounts for the next index entry. The buffer addresses under-estimation where the current index size doesn't account for the next index entry associated with the data block currently being built. The 2x multiplier bounds the estimate in the right direction and handles outlier cases with large keys.
- Add num_index_entries_ member to track added index entries (== data blocks emitted). This is thread-safe since it's updated/read in the serialized emit step.

Next steps:
- Partitioned index size estimation implementation
- Update compaction file cutting logic to consider index size estimation

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14010

Test Plan: Added a new test class with unit tests for new builder size estimation across all IndexBuilder implementations.

Reviewed By: pdillinger

Differential Revision: D83501741

Pulled By: nmk70

fbshipit-source-id: d58fc2a9e92e12a162f6244d4abd707a9c9e1885
Summary:
### Context/Summary:
Flow of resuming: DB::OpenAndCompact() -> Compaction progress file  -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file  -> DB that is called with OpenAndCompact()

This PR focuses on SubcompactionProgress -> CompactionJob and  CompactionJob -> SubcompactionProgress -> Compaction progress file. For now only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway.

The actual triggering of progress persistence and resuming (i.e, integration) is through DB::OpenAndCompact() in the upcoming PR.

**Resume Flow**
1. input_iter->Seek(next_internal_key_to_compact)  // Position iterator
2. ReadTableProperties()                           // Validate existing outputs
3. RestoreCompactionOutputs() in CompactionOutputs                     // Rebuild output file metadata
4. Restore critical statistics about processed input and output records count for verification later
5. AdvanceFileNumbers()                            // Prevent file number conflicts
6. Continue normal compaction from positioned iterator or fallback to not resuming compaction in limited case or fail the compaction entirely

**Persistence Strategy**
1. When: At each SST file completion (FinishCompactionOutputFile()). This is the simplest but most expensive frequency. See below for benchmarking and potential follow-up items
2. What: Serialize, write and sync the in-memory SubcompactionProgress to a dedicated manifest-like file
3. For simplicity: Only persist at "clean" boundaries (no overlapping user keys, no range deletions, no timestamp for now)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13983

Test Plan:
- New unit test in CompactionJob level to cover basic compaction progress resumption
- Existing UTs and stress/crash test to test no correctness regression to existing compaction code
- Run benchmark to ensure no performance regression to existing compaction code
```
./db_bench --benchmarks=fillseq[-X10] --db=$db --disable_auto_compactions=true --num=100000 --value_size=25000 --compression_type=none --target_file_size_base=268435456 --write_buffer_size=268435456
```
Pre-PR:
fillseq [AVG    10 runs] : 45127 (± 799) ops/sec; 1076.6 (± 19.1) MB/sec
fillseq [MEDIAN 10 runs] : 45375 ops/sec; 1082.5 MB/sec
Post-PR (regressed 0.057%, ignorable)
fillseq [AVG    10 runs] : 45101 (± 920) ops/sec; 1076.0 (± 22.0) MB/sec
fillseq [MEDIAN 10 runs] : 45385 ops/sec; 1082.8 MB/sec

Reviewed By: jaykorean

Differential Revision: D82889188

Pulled By: hx235

fbshipit-source-id: 8553fd478f134969d331af2c5a125b94bd747268
Summary:
Pad block based table based on super block alignment

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13909

Test Plan:
Unit Test

No impact on perf observed due to change in the inner loop of flush.

upstream/main branch 202.15 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1; grep fillseq /tmp/x1 | grep -Po "\d+\.\d+ MB/s" | grep -Po "\d+\.\d+" | awk '{sum+=$1} END {print sum/NR}'
```

After the change without super block alignment 203.44 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 >> /tmp/x1 2>&1
```

After the change with super block alignment 204.47 MB/s
```
for i in `seq 1 10`; do ./db_bench --benchmarks=fillseq -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=1000 -fifo_compaction_allow_compaction=0 -disable_wal -write_buffer_size=12000000 -format_version=7 --super_block_alignment_size=131072 --super_block_alignment_max_padding_size=4096 >> /tmp/x1 2>&1;
```

Reviewed By: pdillinger

Differential Revision: D83068913

Pulled By: xingbowang

fbshipit-source-id: eecd65088ab3e9dbc7902aab8c2580f1bc8575df
Summary:
In RocksDB 10.6 with https://github.com/facebook/rocksdb/issues/13805, due to inaccurate testing of an async system, it went undetected at the time that LZ4 compression was using more CPU despite making a change to reuse stream objects which dramatically improved LZ4HC compression efficiency.

This change switches to using a basic LZ4 compress API which appears to be faster than all of these:
* Legacy behavior of creating LZ4_stream_t for each compression
* 10.6-10.7 behavior of re-using streams between compressions for the same file (with stream-as-WorkingArea)
* using LZ4's extState APIs without streams (with extState-as-WorkingArea) (data not shown in below results)

Also in this PR: more improvements to sst_dump --recompress, which is arguably the best SST construction benchmark right now since db_bench seems to be so noisy due to backgroun flush+compaction, even with no compaction (FIFO). Streamlined some output and added a SST read time test, mostly for decompression performance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14017

Test Plan:
Performance test using sst_dump --recompress with newer sst_dump back-ported to 10.5:
```
./sst_dump --command=recompress --compression_types=kLZ4Compression
test5.sst --compression_level_from=-6 --compression_level_to=-1
```
and with default compression level.

10.5:
```
Cx level: -6    Cx size:   61608137 Write usec:     880404
Cx level: -5    Cx size:   60793749 Write usec:     840903
Cx level: -4    Cx size:   58134030 Write usec:     836365
Cx level: -3    Cx size:   55193773 Write usec:     857113
Cx level: -2    Cx size:   54013891 Write usec:     855642
Cx level: -1    Cx size:   50400393 Write usec:     865194
Cx level: 32767 Cx size:   50400393 Write usec:     886310
```

Before this change (showing the regression, more time, from 10.6:
```
Cx level: -6    Cx size:   61608137 Write usec:     933448
Cx level: -5    Cx size:   60793749 Write usec:     893826
Cx level: -4    Cx size:   58134030 Write usec:     891138
Cx level: -3    Cx size:   55193773 Write usec:     898461
Cx level: -2    Cx size:   54013891 Write usec:     897485
Cx level: -1    Cx size:   50400393 Write usec:     936970
Cx level: 32767 Cx size:   50400393 Write usec:     958764
```

After this change (faster than both the above):
```
Cx level: -6    Cx size:   63641883 Write usec:     874190
Cx level: -5    Cx size:   58860032 Write usec:     834662
Cx level: -4    Cx size:   57150188 Write usec:     832707
Cx level: -3    Cx size:   58791894 Write usec:     850305
Cx level: -2    Cx size:   53145885 Write usec:     839574
Cx level: -1    Cx size:   49809139 Write usec:     845639
Cx level: 32767 Cx size:   49809139 Write usec:     875199
```

Similar tests with dictionary compression show essentially no difference (need to use stream APIs and reuse doesn't seem to matter). LZ4HC also unaffected (still improved vs. 10.5)

Reviewed By: hx235

Differential Revision: D83722880

Pulled By: pdillinger

fbshipit-source-id: 30149dd187686d5dd98321e6aa7d74bd7653a905
Summary:
When inplace_update_support and memtable_veirfy_per_key_checksum_on_seek are enabled at the same time, it would cause data race in memtable.

inplace_update_support allows key/value pair in place update in memtable.

memtable_veirfy_per_key_checksum_on_seek performs key checksum verification during seek. It is possible that one thread is updating the key/value pair in place, while another thread is reading the key/value pair for checksum verification during seek.

Therefore, there these 2 configurations could not be enabled at the same time

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14023

Test Plan: local stress test run stops reporting race condition

Reviewed By: anand1976

Differential Revision: D83812322

Pulled By: xingbowang

fbshipit-source-id: 6cb9f0f3faa8deba97305bfe87266f2fe78e0501
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14024

Fix some typo found along the codebase

Reviewed By: pdillinger

Differential Revision: D83789182

fbshipit-source-id: feb24d7d47a6faaf735fcfd50dd3ecce4a6c8cd5
Summary:
Implicit reseek in the middle of an iteration is not supported with MultiScan. Avoid this for now in crash tests by setting max_sequential_skip_in_iterations to an absurdly high value.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14015

Reviewed By: xingbowang

Differential Revision: D83761612

Pulled By: anand1976

fbshipit-source-id: 16f4e856374b79170c0a79c11c275cbb0fc83a70
Summary:
When there is an ingested SST file that only contains delete range operations, MultiScan may return error "Scan does not intersect with file". This is due to file selection during Prepare uses the file smallest and largest key without considering whether there is any key in the file. This is only a temporary fix.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14028

Test Plan: Unit test

Reviewed By: anand1976

Differential Revision: D83986964

Pulled By: xingbowang

fbshipit-source-id: e0961ca854e2062c2457be4324817ba073ae785d
Summary:
MultiScan currently doesn't handle delete range properly. In this specific case, a file with only delete range will have an empty index resulting in BlockBasedTableIterator wrongly thinking that a scan doesn't intersect the file due to empty result.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14026

Test Plan: Run crash test

Reviewed By: xingbowang

Differential Revision: D83881266

Pulled By: anand1976

fbshipit-source-id: dc1faa494ea23f36391b700dd1ee0430a1f20ac5
Summary:
Fix the binutils truncated download issue by switching to wget in the folly build scripts for downloading dependencies.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14030

Test Plan: make build_folly

Reviewed By: jaykorean

Differential Revision: D84033126

Pulled By: anand1976

fbshipit-source-id: bc6706d7e57c97d6edff149a965aa12c7959825f
Summary:
Fix assertion failure in crash tests with timestamp due to the wrong comparator passed to MultiScanArgs

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14033

Reviewed By: xingbowang

Differential Revision: D84036954

Pulled By: anand1976

fbshipit-source-id: 526be21c0754dcccf8e4d2b9fba33716fe35860a
Summary:
- Include Status in RemoteCompactionResultMap in SharedState so that we can directly check the status of the remote compaction in `DbStressCompactionService::Wait()`
- If result is empty, populate the result with the status that was returned from `GetRemoteCompactionResult()` so that the status can be bubbled up to the primary (main db thread)
- Get rid of Timeout in `Wait()`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14022

Test Plan:
With fall-back
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=1
```

Without fall-back
```
python3 -u tools/db_crashtest.py blackbox --remote_compaction_worker_threads=8 --remote_compaction_failure_fall_back_to_local=0
```

Reviewed By: hx235

Differential Revision: D83789172

Pulled By: jaykorean

fbshipit-source-id: 08f710c4ece5fcc1d4b95b3f9c353831882851b7
Summary:
**Context/Summary:**
There is no way to tag or rate-limit write IO occurs during FlushWAL() with priority. Under `Options::manual_wal_flush=true`, it is the major source of write IO during user writes so we decide to add that support. A new option struct `FlushWALOptions` is introduced to avoid making the API ugly for future new fields.

Also, we can't use the WriteOptions (https://github.com/facebook/rocksdb/blob/main/include/rocksdb/options.h#L2293-L2302 i) since is associated with that particular Put/Merge/.. associated with that option but FlushWAL() can happen after that write. There is no way to carry that write option over in RocksDB. I also avoided using the WriteOptions since it's mostly for live write.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14037

Test Plan: New UTs `TEST_P(DBRateLimiterOnManualWALFlushTest, ManualWALFlush)`

Reviewed By: archang19

Differential Revision: D84193522

Pulled By: hx235

fbshipit-source-id: 18feb5235672010d19a101ce52c8abdcc4a789f2
Summary:
self-explanatory. First drafts using AI then heavily revised. AI help with diagram also.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14035

Test Plan: https://pdillinger.github.io/rocksdb/blog/2025/10/08/parallel-compression-revamp.html

Reviewed By: hx235

Differential Revision: D84277660

Pulled By: pdillinger

fbshipit-source-id: 4d76f60f3f7304392836fa4df7a819e67d531a52
Summary:
Multi scan crash/stress tests are failing when skip_stats_update_on_db_open is true, because LevelIterator::Prepare relies on these stats in FileMetaData to make decisions. Disable it in crash tests until the proper fix is ready.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14039

Reviewed By: archang19

Differential Revision: D84280059

Pulled By: anand1976

fbshipit-source-id: f9f58b94c24d1f455432b05f3bf97f25c7233e3c
Summary:
The current seek key validation is too strict. This change relaxes it at block iterator level, and add additional check at DB iterator level. The new contract is that when MultiScan is used, after prepared is called, each following seek must seek the start key of the prepared scan range in order. Otherwise, the iterator is set with error status.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14040

Test Plan: Unit test

Reviewed By: anand1976

Differential Revision: D84292297

Pulled By: xingbowang

fbshipit-source-id: 7b31f727e67e7c0bfc53c2f9a6552e0c3d324869
Summary:
Currently in BlockBasedTableIterator's Prepare(), the index lookup for a MultiScan range is expected to return atleast 1 data block (unless UDI is in use). This is because there's an implicit assumption that only ranges intersecting with the keys in the file will be prepared. This assumption, however, doesn't hold if there are range deletions and the smallest and/or largest keys in the file extend beyond the keys in the file. The LevelIterator prunes the MultiScan ranges based on the smallest/largest key, so its possible for a range to only overlap the range deletion portion of the file and not overlap any of the data blocks. Furthermore, the BlockBasedTableIterator is now much more forgiving of Seek to targets outside of prepared ranges after https://github.com/facebook/rocksdb/issues/14040 .

Keeping the above in mind, this PR removes the check in BlockBasedTableIterator for non-empty index result. It adds assertions in LevelIterator to verify that ranges are being properly pruned. Another side effect is we can no longer rely solely on a scan range having 0 data blocks (i.e cur_scan_start_idx >= cur_scan_end_idx) to decide if the iterator is out of bound. We can only do so for all but the last range prepared range.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14046

Test Plan:
1. Add unit test in db_iterator_test
2. Run crash test

Reviewed By: xingbowang

Differential Revision: D84623871

Pulled By: anand1976

fbshipit-source-id: 2418e629f92b1c46c555ddea3761140f700819e4
Summary:
### Context/Summary:
This is stacked on top of https://github.com/facebook/rocksdb/pull/13983 and integrate compaction resumption with OpenAndCompact().

Flow of resuming: DB::OpenAndCompact() -> Compaction progress file  -> SubcompactionProgress -> CompactionJob
Flow of persistence: CompactionJob -> SubcompactionProgress -> Compaction progress file  -> DB that is called with OpenAndCompact()

This PR focuses on DB::OpenAndCompact() -> Compaction progress file  -> SubcompactionProgress  and Compaction progress file -> DB that is called with OpenAndCompact()

**Resume Flow**
1. Check configuration. Right now paranoid_file_check=true (by default false) is not yet compatible with allow_resumption=true. Also only single subcompaction is supported as OpenAndCompact() does not partition compaction anyway
2. Scan compaction output files for latest, old and temporary compaction progress file and output files. If latest compaction progress file exists, we should resume.
3. Clean up older or temporary progress files if any. They can exist if the last OpenAndCompact() crashed during resume flow
4. If any, parse the latest progress file into CompactionProgress and clean up extra compaction output files that are not yet tracked. These compaction output files can exist as tracking every output file is just best-effort and interrupted output files in the middle is not tracked as progress yet.
5. If allow_resumption=false or no valid compaction progress is found or parsed, clean up the latest progress file and existing compaction output files to start fresh compaction. If the clean up itself fails, fail the OpenAndCompact() call to prevent resuming with inconsistency between output files and progress file.

 **Progress File Creation**
1. Create temporary progress file
2. Persist the progress from latest compaction progress file to the temporary progress file. This is to simplify resuming from an interrupted compaction that was just resumed. Similar to how manifest recovery works.
3. Rename the temporary progress file to the newer compaction progress so it atomically becomes the "new" latest progress file
4. Delete the "old" latest progress file since it's useless now.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13984

Test Plan:
- Integrated unit tests to simulate OpenAndCompact gets canceled and optionally resumed for remote compaction
- Existing UTs and stress/crash test
- Manual stress test with https://github.com/facebook/rocksdb/pull/14041

### Performance testing:
**1. Latency**
Using
```
./db_bench --benchmarks=OpenAndCompact[X5] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path  --target_file_size_base=268435456
```
**allow_resumption = false**
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26766256.000 micros/op 0 ops/sec 26.766 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27837249.000 micros/op 0 ops/sec 27.837 seconds 1 operations;

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 26546234.000 micros/op 0 ops/sec 26.546 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 27918621.000 micros/op 0 ops/sec 27.919 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 42243571.000 micros/op 0 ops/sec 42.244 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 43497581.000 micros/op 0 ops/sec 43.498 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 34241357.000 micros/op 0 ops/sec 34.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 35655346.000 micros/op 0 ops/sec 35.655 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 0 (± 0) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 27083361.000 micros/op 0 ops/sec 27.083 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 28487999.000 micros/op 0 ops/sec 28.488 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 0 (± 0) ops/sec

OpenAndCompact [AVG    5 runs] : 0 (± 0) ops/sec; 31669.681 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec

**allow_resumption= true**
Input files: 101 files, 10000 keys
OpenAndCompact() API call : 25446470.000 micros/op 0 ops/sec 25.446 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26833415.000 micros/op 0 ops/sec 26.833 seconds 1 operations;

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 240745.000 micros/op 0 ops/sec 0.241 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact :  244934.000 micros/op 4 ops/sec 0.245 seconds 1 operations;
OpenAndCompact [AVG 2 runs] : 2 (± 3) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 24843383.000 micros/op 0 ops/sec 24.843 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 26192235.000 micros/op 0 ops/sec 26.192 seconds 1 operations;
OpenAndCompact [AVG 3 runs] : 1 (± 2) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 270819.000 micros/op 0 ops/sec 0.271 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact :  275140.000 micros/op 3 ops/sec 0.275 seconds 1 operations;
OpenAndCompact [AVG 4 runs] : 1 (± 2) ops/sec

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 23038311.000 micros/op 0 ops/sec 23.038 seconds 1 operations;
OpenAndCompact status: OK
Output: 92 files, average size: 271747380 bytes (259.16 MB)
OpenAndCompact : 24439097.000 micros/op 0 ops/sec 24.439 seconds 1 operations;
OpenAndCompact [AVG 5 runs] : 1 (± 1) ops/sec

OpenAndCompact [AVG    5 runs] : 1 (± 1) ops/sec; 638.417 ms/op
OpenAndCompact [MEDIAN 5 runs] : 0 ops/sec

**Persistence cost:** If we compare the odd number of OpenAndCompact() API, it's actually faster.
**Resumption saving:** (0.2 - 26.766 ) / 26.766 * 100 = 99.25% improvement when all the compaction progress is redone without the allow_resumption feature

**2.  Memory usage** (in case SubcompactionProgress storing its own memory copies of output filemetadata in https://github.com/facebook/rocksdb/pull/13983/files is a trouble)
 ```
// ~= 90 output files
/usr/bin/time -f "
Resource Summary:
  Wall time: %e seconds
  CPU time: %U user + %S system (%P total)
  Peak memory: %M KB
  Page faults: %F major + %R minor
" ./db_bench --benchmarks=OpenAndCompact[X1] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_seconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --use_existing_db=true --db=$db --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path  --target_file_size_base=268435456
```
**allow_resumption = false**
Peak memory: 275828 KB
**allow_resumption = true**
Peak memory: 277204 KB (regress 0.49% memory usage, most likely due to storing own copies of output files' file metadata in subcompaction progress)

### Near-term follow up:
- Add statistics to record the successfully resumed compaction output files bytes
- Add stress/crash test support to cover error paths (including progress file sync error), crash/cancel OpenAndCompact() at random compaction progress point and surface feature incompatibility
   - See https://github.com/facebook/rocksdb/pull/14041
- Resolve the TODO https://github.com/facebook/rocksdb/pull/13984/files#diff-17fbdec07244b1f07d1a4e5aed0a6feecf4474d20b3129818c10fc0ff9f3d547R1303-R1314
   - See https://github.com/facebook/rocksdb/pull/14042

Reviewed By: jaykorean

Differential Revision: D84299662

Pulled By: hx235

fbshipit-source-id: 69bbf395401604172a1a5c557ca834011a3d51d7
Summary:
We are adding more and more tests, so we need to increase the number of shards in macos build to reduce overall CI time.

macos-15-xlarge image is ARM, which has 5 vCPU cores, but is still 50% faster than the intel x86 12 vCPU.

Test time reduced from 1h 37m to 14m.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14048

Reviewed By: archang19

Differential Revision: D84741917

Pulled By: xingbowang

fbshipit-source-id: 9ba9bd696d3b2152f11dec2fb4280572b98233d5
Summary:
We simulate remote compaction in our stress test by running a separate set of worker threads to run compactions. In reality, these remote compactions run on a different host or (at least in a different process) where we cannot share the TableFactory and BlockCache with the main DB process.

To make this simulated remote compaction closer to reality, create a new TableFactory for each remote compaction in stress test.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14050

Test Plan:
```
python3 -u tools/db_crashtest.py --cleanup_cmd='' --simple blackbox --remote_compaction_worker_threads=8 --interval=10
```

Reviewed By: hx235

Differential Revision: D84775656

Pulled By: jaykorean

fbshipit-source-id: d6203fcbe0eca3539e008a19fd47b742553537ed
Summary:
* Reduce build time of folly from 45m~1hr down to 25m. This is achieved by caching folly build artifact from previous build.
* Reduce windows build time of folly from 1hr 15m down to 50m. This is done by increase windows build machine size.
* Fix build on macos on other macos target.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14057

Test Plan: github CI

Reviewed By: archang19, nmk70

Differential Revision: D84848041

Pulled By: xingbowang

fbshipit-source-id: 00306750737070e7e446ee436d607ed6ecae79ae
Summary:
**Context/Summary:**

This PR adds support to dump compaction progress file in ldb for debugging resumable compaction issue

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14058

Test Plan:
```
/data/users/huixiao/rocksdb$ ./ldb compaction_progress_dump --path=/home/huixiao/COMPACTION_PROGRESS-123
Compaction Progress File: /home/huixiao/COMPACTION_PROGRESS-123
============================================
Progress Record 0:
SubcompactionProgress{ next_internal_key_to_compact=user_key="b" (hex:62), seq=kMaxSequenceNumber, type=24, num_processed_input_records=1, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=1, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } }
Progress Record 1:
SubcompactionProgress{ next_internal_key_to_compact=user_key="bb" (hex:6262), seq=kMaxSequenceNumber, type=24, num_processed_input_records=2, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=2, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } }
Progress Record 2:
SubcompactionProgress{ next_internal_key_to_compact=user_key="cancel_before_this_key" (hex:63616E63656C5F6265666F72655F746869735F6B6579), seq=kMaxSequenceNumber, type=24, num_processed_input_records=3, output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=3, output_files_count=1, last_persisted_output_files_count=0 }, proximal_output_level_progress=SubcompactionProgressPerLevel{ num_processed_output_records=0, output_files_count=0, last_persisted_output_files_count=0 } }

Total records: 3
```

Reviewed By: jaykorean

Differential Revision: D84840680

Pulled By: hx235

fbshipit-source-id: 8e448c50348eb1dba92c4ffdbd2d1bb6306288d6
Summary:
Currently we return `File is too large for PlainTableReader!` when the file size exceeds our pre-defined constant. There was a request to have the file size information logged when this error is returned.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14056

Reviewed By: nmk70

Differential Revision: D84834869

Pulled By: archang19

fbshipit-source-id: 8f332b6a31d51f320c7e2db06ad49f50798ff70e
Summary:
Context/Summary: as titled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14054

Test Plan: new UT, manually checking

Reviewed By: jaykorean

Differential Revision: D84828431

Pulled By: hx235

fbshipit-source-id: 56e1a9159f7597a10d6c549657d8b22788aa0599
Summary:
**Problem:** RocksDB was making unnecessary prefetch system calls on file systems that don't support prefetch operations, potentially leading to wasted CPU cycles.

**Fix:** Add kFSPrefetch to FSSupportedOps enum to allow file systems to indicate prefetch support capability. File systems can now opt out of prefetch calls by not setting this field.

**Backwards compatibility:** File systems that don't override SupportedOps() continue to receive prefetch calls exactly as before. Only file systems that explicitly opt out by not setting kFSPrefetch will avoid the calls.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13917

Test Plan:
- Added a new test in block_based_table_reader.
- Run existing tests: ```make prefetch_test && ./prefetch_test```

Reviewed By: anand1976

Differential Revision: D81607145

Pulled By: nmk70

fbshipit-source-id: 3bbefa05919034e8776ea4e4540cdc695cdc6d3f
Summary:
Fixing a nullptr access in multiscan, under following situation.

```
Block Based Table: blk1:[k1,k2], blk2:[k3,                k8], blk3:[k9]
Scan ranges:            [k1,             k4), [k5,k6), [k7,            k10)
Prepared block ranges:  [0,2],                [2,2],   [1,3]
```

1. Seek key k1 on the first range, read key k1, k2.
2. Seek key k4 on the 2nd range, blocks 0,1 would be unpinned.
3. Seek key k9, block 1 would be accessed, but it is unpinned, which trigger assert failure in debug mode and nullptr access on release build.

This fix changes how blocks are unpinned. It is now only unpinning the block, when the cur_data_block_idx has passed it.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14062

Test Plan:
Unit Test
rand_seed 304010984 on UserDefinedIndexStressTest

Reviewed By: cbi42

Differential Revision: D84976410

Pulled By: xingbowang

fbshipit-source-id: 6b99bf85fc9d4108c5267ae77be77ccfe08923cd
Summary:
With cache hit and compiler option optimization, the compilation time build time is reduced from 40 min to 2 min. Overall build time is reduced from 60 min to less 20 minutes on cache hit on majority of the source file. On 100% cache miss, it would be around 40 minutes.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14064

Test Plan: Github CI

Reviewed By: mszeszko-meta

Differential Revision: D85023882

Pulled By: xingbowang

fbshipit-source-id: 98551880c98f14d36133ff43e6af8c3be94ab465
Summary:
**Context/Summary:** as titled.

This can be used to benchmark OpenAndCompact() and OpenAndCompactionOptions::allow_resumption. See below for usage.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14003

Test Plan:
1. Simple OpenAndCompact()
```
openandcompact_allow_resumption=false

./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_millseconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path

...
DB path: [/dev/shm/test]

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 39746440.000 micros/op 39.746 seconds/op
OpenAndCompact status: OK
Output: 358 files, average size: 69835396 bytes (66.60 MB)
openandcompact : 39977603.000 micros/op 0 ops/sec 39.978 seconds 1 operations;
```

2. OpenAndCompact() with cancellation (after the whole compaction essentially finishes) and resumption
```
openandcompact_allow_resumption=true
./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact[X2] --openandcompact_test_cancel_on_odd=false --openandcompact_cancel_after_millseconds=0 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path

..
DB path: [/dev/shm/test]
Running benchmark for 2 times

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 40095045.000 micros/op 40.095 seconds/op
OpenAndCompact status: OK
Output: 358 files, average size: 69835396 bytes (66.60 MB)
openandcompact : 41471226.000 micros/op 0 ops/sec 41.471 seconds 1 operations;

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 336588.000 micros/op 0.337 seconds/op // Resume
OpenAndCompact status: OK
Output: 358 files, average size: 69835396 bytes (66.60 MB)
openandcompact :  573885.000 micros/op 1 ops/sec 0.574 seconds 1 operations;
openandcompact [AVG 2 runs] : 0 (± 1) ops/sec

openandcompact [AVG    2 runs] : 0 (± 1) ops/sec; 1132.236 ms/op
openandcompact [MEDIAN 2 runs] : 0 ops/sec
```

3. OpenAndCompact() with cancellation at a fixed point and resumption
```
openandcompact_allow_resumption=true
./db_bench --use_existing_db=true --db=$db --benchmarks=openandcompact[X2] --openandcompact_test_cancel_on_odd=true --openandcompact_cancel_after_millseconds=6000 --openandcompact_allow_resumption=$openandcompact_allow_resumption  --disable_auto_compactions=true --compression_type=none --secondary_path=$secondary_path

...
DB path: [/dev/shm/test]
Running benchmark for 2 times

 --- Run 1 (odd - will cancel) ---

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 6005787.000 micros/op 6.006 seconds/op // Cancel accordingly
OpenAndCompact status: Result incomplete: Manual compaction paused
openandcompact : 7255346.000 micros/op 0 ops/sec 7.255 seconds 1 operations;

 --- Run 2 (even - resume) ---

Input files: 101 files, 10000 keys
OpenAndCompact() API call : 33013725.000 micros/op 33.014 seconds/op // Resume
OpenAndCompact status: OK
Output: 358 files, average size: 69835396 bytes (66.60 MB)
openandcompact : 33244026.000 micros/op 0 ops/sec 33.244 seconds 1 operations;
openandcompact [AVG 2 runs] : 0 (± 0) ops/sec

openandcompact [AVG    2 runs] : 0 (± 0) ops/sec; 11911.234 ms/op
openandcompact [MEDIAN 2 runs] : 0 ops/sec
```

Reviewed By: jaykorean

Differential Revision: D84839965

Pulled By: hx235

fbshipit-source-id: 21c4cd01be67da0a128e2de1c3aae93aa97662bd
Summary:
**Context/Summary:**
- Add resumable compaction to stress test with adaptive progress cancellation
- Add fault injection to remote compaction
- Fix a real minor bug in a couple testing framework bugs with remote compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14041

Test Plan: - Rehearsal stress test, finding bugs for https://github.com/facebook/rocksdb/pull/13984 effectively and did not create new failures.

Reviewed By: jaykorean

Differential Revision: D84524194

Pulled By: hx235

fbshipit-source-id: 42b4264e428c6739631ed9aa5eb02723367510bc
Summary:
10.8.0 branch has been cut.

Updated
- HISTORY.md
- include/rocksdb/version.h
- tools/check_format_compatible.sh

To follow up
- folly update

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14067

Test Plan: CI

Reviewed By: pdillinger

Differential Revision: D85186398

Pulled By: jaykorean

fbshipit-source-id: 44920156aa2a62ba40626766dc4ebdbc02f23fa8
Summary:
When a standalone range deletion file is ingested in L0, currently it is compacted with any overlapping L0 files. This is not desirable when we ingest new data on top of the range deletion file. This PR fixes the compaction picking logic to only consider L0 files older than the standalone range deletion file.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14061

Test Plan: added a new unit test and updated an existing one.

Reviewed By: xingbowang

Differential Revision: D84930780

Pulled By: cbi42

fbshipit-source-id: 65f4403ccb40ba964b9e65b09e2f7f7efebe81df
Summary:
Linter complains like this
```
  void foo(Arg parameter_name) {}
    void bar() {
    Arg a;
    foo(/*some_other_name=*/ a); // Wrong! Comment/parameter name mismatch
    foo(/*parameter_name=*/ a);  // This is OK; the names match.
  }
```
```
Argument name in comment (`read_only`) does not match parameter name (`unchanging`).
```

This used to be warning, but now treated as an error :(

Fixing a few other linter warnings before they become errors in the future.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14074

Test Plan: CI

Reviewed By: archang19

Differential Revision: D85370353

Pulled By: jaykorean

fbshipit-source-id: 20e96aad740d516a29c0424282674e655f99c0a2
Summary:
* Address feedback from https://github.com/facebook/rocksdb/issues/14040
* Add additional test for MultiScan
* Fix a bug when del range and data are in same file for multi-scan
* Rewrite the cases need to be handled in SeekMultiScan

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14055

Test Plan: Unit test

Reviewed By: cbi42, anand1976

Differential Revision: D84851788

Pulled By: xingbowang

fbshipit-source-id: 0f69632733afb99685f6341badbf239681010c38
Summary:
should not use async_io when not supported to avoid the assert failure here: dce33f9443/table/block_based/block_based_table_iterator.cc (L1710).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14077

Test Plan: monitor future CI failure.

Reviewed By: anand1976

Differential Revision: D85456447

Pulled By: cbi42

fbshipit-source-id: dccc865a5aedf194029a53616f4bbc99d0162691
Summary:
Fixed a missing CV signal when `FindObsoleteFiles()` decides there is nothing to purge and then decrements `pending_purge_obsolete_files_` to zero.  This bug could cause `DB::GetSortedWalFiles()` to hang, at least.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14069

Test Plan: unit test repro

Reviewed By: hx235

Differential Revision: D85453534

Pulled By: cbi42

fbshipit-source-id: cf5cfe7f5087459ca1f1f28ce81ea6afc84178f0
Summary:
The TTL-based WAL archive cleanup logic could incorrectly delete an archived WAL if the system clock moved backwards between the last write to that WAL and `WALManager::PurgeObsoleteWALFiles()`. This happened due to unsigned underflow in subtraction of two wall clock based timestamps: `now_seconds - file_m_time`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14016

Test Plan: unit test repro

Reviewed By: pdillinger

Differential Revision: D83879806

Pulled By: hx235

fbshipit-source-id: 643e7f623c6b5c31711565854314cfd6cbbcf3a7
Summary:
### Problem
The current C API implementation has inefficiencies that impact performance in production environments:

1. **Double allocations in Get operations**: Values are first copied into a `std::string`, then copied again into a malloc'd buffer
2. **Unnecessary string temporaries**: Using `std::string` as intermediate storage adds allocation/deallocation overhead
3. **No zero-copy read path**: All read operations require at least one allocation and copy
4. **Redundant operations**: CopyString performed unnecessary `sizeof(char)` multiplication

### Solution

#### 1. Use PinnableSlice for Get Operations
- **Before**: `DB::Get() → std::string → malloc'd buffer` (2 allocations, 2 copies)
- **After**: `DB::Get() → PinnableSlice → malloc'd buffer` (1 allocation, 1 copy)
- **Impact**: 50% reduction in allocations and copies

#### 2. Optimize CopyString Helper
- Removed redundant `sizeof(char)` multiplication
- Single implementation using `Slice` parameter (works with all types via implicit conversion)
- Added `inline` for better optimization

#### 3. New Zero-Copy API Functions
Added high-performance alternatives for allocation-sensitive workloads:
- rocksdb_get_pinned_v2/ rocksdb_get_pinned_cf_v2 - Zero-copy read access
- rocksdb_get_into_buffer/ rocksdb_get_into_buffer_cf - Copy into user-provided buffer
- `rocksdb_pinnable_handle_*` - Handle management functions

### Performance Improvements

| Operation | Allocations | Improvement |
|-----------|------------|-------------|
| [rocksdb_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1391:0-1411:1) | 2 → 1 | **50% reduction** |
| [rocksdb_get_cf](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1411:0-1431:1) | 2 → 1 | **50% reduction** |
| [rocksdb_multi_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:1495:0-1520:1) (per key) | 2 → 1 | **50% reduction** |
| [rocksdb_transaction_get](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:6730:0-6748:1) | 2 → 1 | **50% reduction** |
| [rocksdb_writebatch_wi_get_from_batch](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:2714:0-2732:1) | 2 → 1 | **50% reduction** |
| [rocksdb_get_pinned_v2](cci:1://file:///Users/zaidoon/public%20repos/rocksdb/db/c.cc:7761:0-7775:1) (new) | 0 | **100% reduction** |

### Functions Optimized (30+)
- All Get variants (regular, CF, with timestamps)
- All MultiGet variants
- All Transaction Get/MultiGet operations
- All WriteBatch Get operations
- KeyMayExist operations
- Metadata getters (column family names, SST file keys, transaction names, DB identity)

### Testing
- Added tests for new zero-copy functions
- Added tests for previously untested functions rocksdb_column_family_handle_get_name, rocksdb_transaction_get_name

### Migration Path
Applications can adopt improvements in three ways:
1. **No changes needed** - Existing code automatically benefits from 50% allocation reduction
2. **Incremental adoption** - Replace hot-path calls with zero-copy variants
3. **Full optimization** - Use rocksdb_get_into_buffer

Pull Request resolved: https://github.com/facebook/rocksdb/pull/13911

Reviewed By: cbi42

Differential Revision: D83508431

Pulled By: jaykorean

fbshipit-source-id: 96146a59b0f9e839f6603b376d4e51f0e97c3a8c
Summary:
Due to some internal requirements, what's being used for`$SSH` and `$SCP` has changed and it broke the regression test. (e.g. tarball streaming to remote host no longer works)

Minor behavior changes to the script to make the internal workflow work.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14079

Test Plan:
```
./tools/regression_test.sh
```
Meta Internal automation

Reviewed By: pdillinger

Differential Revision: D85502798

Pulled By: jaykorean

fbshipit-source-id: d294c2ee47661fbe368ccc318062e891f3ac7c81
Summary:
As titled, this fixes some internal crash test failures when UDT is enabled.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14085

Test Plan: monitor crash tests.

Reviewed By: anand1976

Differential Revision: D85617949

Pulled By: cbi42

fbshipit-source-id: da6fb21c0ca5803ea24e8daf7de8558321babcf4
Summary:
All remote compaction test failures had `mmap_read=1` in common. Unfortunately, the failure hasn't been very reproducible. Try disabling `mmap_read` to see if that shed some light.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14083

Test Plan: CI

Reviewed By: hx235

Differential Revision: D85622229

Pulled By: jaykorean

fbshipit-source-id: bbe9e08efc369813f0fec388c910446089e43650
Summary:
Stress test can fail with assertion inside MultiScan in some reseek scenario. E.g., data block 1 ends with k@9, data block 2 starts with k@8, when a DB iter seeks to k@0 (see option `max_sequential_skip_in_iterations`), MultiScan will land in data block 1 due to fd0b4e0cf0/table/block_based/block_based_table_iterator.cc (L1258-L1263).

We can't just use internal key as separator since index block might not use it. I plan to follow up with a fix that never moves `cur_data_block_idx` backward within a MultiScan.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14087

Test Plan: CI and internal crash tests

Reviewed By: anand1976

Differential Revision: D85701668

Pulled By: cbi42

fbshipit-source-id: d3f1aaff40a12be4e3d1b4b7160bf2547f43b849
Summary:
Comprehensive performance optimizations for the RocksDB C API that eliminate unnecessary memory allocations and copies.

## Key Changes

### 1. PinnableSlice for Get Operations (50% reduction in copies)
- Changed all `rocksdb_get*` functions to use `PinnableSlice` internally instead of `std::string`
- **Before:** RocksDB → std::string → malloc'd buffer (2 copies)
- **After:** RocksDB → malloc'd buffer (1 copy)
- Affects: Get, Transaction Get, TransactionDB Get, WriteBatch Get variants

### 2. Array-Based MultiGet with PinnableSlice (30% allocation reduction)
- Switched MultiGet operations to use optimized array-based RocksDB API with `PinnableSlice`
- Eliminates vector overhead and string allocations
- Affects: MultiGet, Transaction MultiGet, TransactionDB MultiGet variants

### New Zero-Copy APIs
Added high-performance zero-copy functions for applications that can use them:
- `rocksdb_iter_key_slice()` / `value_slice()` / `timestamp_slice()` - Return slices by value (eliminates output param overhead)
- `rocksdb_batched_multi_get_cf_slice()` - Batched get with slice array input
- `rocksdb_slice_t` - ABI-compatible slice type

Note that this pr builds on top of https://github.com/facebook/rocksdb/pull/13911

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14036

Reviewed By: pdillinger

Differential Revision: D85604919

Pulled By: jaykorean

fbshipit-source-id: 7f04b935eea79af1d45b3125a79b90e4706666f6
Summary:
Change PosixWritableFile's Truncate to the new end offset. This ensures that future appends are written with no holes or overwrites. RocksDB doesn't guarantee this in the FileSystem contract, and its left up to the specific implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14088

Reviewed By: cbi42

Differential Revision: D85786398

Pulled By: anand1976

fbshipit-source-id: 3520d9d6336362f5128a17bbf396297d821a5da3
Summary:
Resolving this folly upgrade required fixing the FOLLY_LITE build with header include from the 'fmt' library.

I was close to timing out on fixing USE_FOLLY_LITE and removing it altogether - it could be considered obsolete and/or not worth the maintenance cost.

Follow-up: make the folly build caching more friendly by hashing the relevant makefile parts. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves.

UPDATE/NOTE: I wasn't able to fully update to latest due to a failure seen in F14, using the next folly commit or later. The source of the bug is likely outside of F14 but investigation is in progress.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14071

Test Plan: CI

Reviewed By: jaykorean

Differential Revision: D85268833

Pulled By: pdillinger

fbshipit-source-id: 1d0a2d61f095524a20e6ec796ef46c02d0696f4e
Summary:
Following up on https://github.com/facebook/rocksdb/pull/14071, updating folly to
8a9fc1e80a or beyond was failing an F14Table assertion for a very subtle reason: ODR violation between the folly build and RocksDB build because folly build was release mode and RocksDB build was debug mode. What was happening was that folly change introduced a dependence on kDebug (whether build is debug) in a hashing implementation in a .h file, and the inconsistency between the inlined implementation during RocksDB build and the linked-to implementation from the folly build was leading to inconsistencies in the data structure.

The primary fix is to ensure we build folly in debug mode for debug mode RocksDB builds. Also,

* Needed to use the `patchelf` tool in `build_folly` to ensure the glog dependency shared library can always find its own gflags dependency. I explored many options for working around this, and this is what would work without reworking folly's own build.
* Updated folly to latest commit.
* Thrown in an ad hoc folly patch to use ftp.gnu.org mirrors (the canonical is super slow)
* Moved the placement of GETDEPS_USE_WGET=1 to apply to local builds also, to avoid the issue of a large download almost reaching completion and then stalling indefinitely.
* Fix failing nightly build-linux-cmake-with-folly-lite-no-test with fmt includes in cmake build (as was done with make build)
* Add a release mode folly+RocksDB to nightly CI, including both cmake and make. This also serves as a non-cached folly build to detect potential problems with PR jobs working from cached folly build.
* Move build-linux-cmake-with-folly to nightly because it's mostly covered by build-linux-cmake-with-folly-coroutines

Intended follow-up:
* folly-lite build with tests
* Make the folly build caching more friendly+accurate by hashing the relevant Makefile parts and tagging whether debug or release. Not in this PR because then you wouldn't be able to see what changed in the folly build steps themselves.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14094

Test Plan: manual + CI

Reviewed By: mszeszko-meta

Differential Revision: D85864871

Pulled By: pdillinger

fbshipit-source-id: 50009b33422d5781074fcbbdf18089be9e36800d
Summary:
... caused by public headers depending on build parameters (macro definitions). This change also adds a check under 'make check-headers' (already in CI) looking for potential future violations.

I've audited the uses of '#if' in public headers and either
* Eliminated them
* Systematically excluded them because they are intentional or similar (details in comments in check-public-header.sh
* Manually excluded them as being ODR-SAFE

In the case of ROCKSDB_USING_THREAD_STATUS, there was no good reason for this to appear in public headers so I've replaced it with a static bool ThreadStatus::kEnabled. I considered getting rid of the ability to disable this code but some relatively recent PRs have been submitted for fixing that case. I've added a release note and updated one of the CI jobs to use this build configuration. (I didn't want to combine with some jobs like no_compression and status_checked because the interaction might limit what is checked.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14096

Test Plan: manual 'make check-headers' + manual cmake as in new CI config + CI

Reviewed By: jaykorean

Differential Revision: D86241864

Pulled By: pdillinger

fbshipit-source-id: d16addc9e3480706b174a006720a4def0740bf2e
Summary:
* Fix nightly build-linux-cmake-with-folly-lite-no-test for real this time
  with correct include directory. (CMakeLists.txt)
* Add test runs to that build (and rename)
* Improve folly build caching with a folly.mk file with most of the relevant
  parts of Makefile that contribute to the checkout_folly and
  build_folly builds. This reduces the risk of false passing of CI job with
  cache folly build. This caching is still only for folly debug builds, (which
  is probably OK with just a single nightly build relying on release folly
  build, which also serves as a rough canary against false passing
  because of caching).
* Use `make VERBOSE=1` after cmake calls for detailed output

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14099

Test Plan:
temporary CI change to put the relevant parts in pr-jobs,
then back to homes including in nightly

Reviewed By: mszeszko-meta

Differential Revision: D86243363

Pulled By: pdillinger

fbshipit-source-id: f7975fa190ef45195c6d0b74417f7886e551516a
Summary:
**Summary:**
This change introduces tail size estimation during SST construction to improve compaction file cutting accuracy to prevent oversized files. The BlockBasedTableBuilder now estimates the SST tail size (index and filter blocks) and uses this estimate, in addition to the data size, to determine when to cut files during compaction.

**Problem:**
Currently, file cutting logic only considers data size when determining where to cut a file, failing to reserve space for index and filter blocks that are added when the file is finalized. This often leads to SST files that exceed target file size limits.

**Behavior Change:**
Implement size estimation methods for index and filter builders, and integrate these estimates into BlockBasedTableBuilder via a new EstimatedTailSize() method. This method aggregates estimates from all tail components and is used for file cutting decisions during compaction.

**Performance Considerations:**
To minimize CPU overhead, size estimates are updated when data blocks are finalized rather than on every key add. For index builders, estimates are updated when index entries are added (one per data block). For filter builders, the OnDataBlockFinalized() hook triggers estimate updates when data blocks are cut/finalized.

This approach provides:
* Minimal impact to compaction hot path (key additions)
* Near real-time estimates for file cutting decisions
* Meaningful estimate changes only when data blocks are finalized

**Usage:**
* Set true mutable cf option `compaction_use_tail_size_estimation`
to use tail size estimation for compaction file cutting decisions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14051

Test Plan:
* Assert tail size estimate is an overestimate in BlockBasedTableBuilder::Finish
* Add new test to verify compaction output file is below target file size

**Next steps:**
* Enable tail size estimation for compaction file cutting by default (and other improvements)

Reviewed By: pdillinger, cbi42

Differential Revision: D84852285

Pulled By: nmk70

fbshipit-source-id: c43cf5dbd2cb2f623a0622591ef24eee30ce0c87
Summary:
Adds auto-tuning of manifest file size to avoid the need to scale `max_manifest_file_size` in proportion to things like number of SST files to properly balance (a) manifest file write amp and new file creation, vs. (b) manifest file space amp and replay time, including non-incremental space usage in backups. (Manifest file write amp comes from re-writing a "live" record when the manifest file is re-created, or "compacted"; space amp is usage beyond what would be used by a compacted manifest file.) In more detail,

* Add new option `max_manifest_space_amp_pct` with default value of 500, which defaults to 0.2 write amp and up to roughly 5.0 space amp, except `max_manifest_file_size` is treated as the "minimum" size before re-creating ("compacting") the manifest file.
* `max_manifest_file_size` in a way means the same thing, with the same default of 1GB, but in a way has taken on a new role. What is the same is that we do not re-create the manifest file before reaching this size (except for DB re-open), and so users are very unlikely to see a change in default behavior (auto-tuning only kicking in if auto-tuning would exceed 1GB for effective max size for the current manifest file). The new role is as a file size lower bound before auto-tuning kicks in, to minimize churn in files considered "negligibly small." We recommend a new setting of around 1MB or even smaller like 64KB, and expect something like this to become the default soon.
* These two options along with `manifest_preallocation_size` are now mutable with SetDBOptions. The effect is nearly immediate, affecting the next write to the current manifest file.

Also in this PR:
* Refactoring of VersionSet to allow it to get (more) settings from MutableDBOptions. This touches a number of files in not very interesting ways, but notably we have to be careful about thread-safe access to MutableDBOptions fields, and even fields within VersionSet. I have decided to save copies of relevant fields from MutableDBOptions to simplify testing, etc. by not saving a reference to MutableDBOptions but getting notified of updates.
* Updated some logging in VersionSet to provide some basic data about final and compacted manifest sizes (effects of auto-tuning), making sure to avoid I/O while holding DB mutex.
* Added db_etc3_test.cc which is intended as a successor to db_test and db_test2, but having "test.cc" in its name for easier exclusion of test files when using `git grep`. Intended follow-up: rename db_test2 to db_etc2_test
* Moved+updated `ManifestRollOver` test to the new file to be closer to other manifest file rollover testing.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14076

Test Plan:
As for correctness, new unit test AutoTuneManifestSize is pretty thorough. Some other unit tests updated appropriately. Manual tests in the performance section were also audited for expected behavior based on the new logging in the DB LOG. Example LOG data with -max_manifest_file_size=2048 -max_manifest_space_amp_pct=500:

```
2025/10/24-11:12:48.979472 2150678 [/version_set.cc:5927] Created manifest 5, compacted+appended from 52 to 116
2025/10/24-11:12:49.626441 2150682 [/version_set.cc:5927] Created manifest 24, compacted+appended from 2169 to 1801
2025/10/24-11:12:52.194592 2150682 [/version_set.cc:5927] Created manifest 91, compacted+appended from 10913 to 8707
2025/10/24-11:13:02.969944 2150682 [/version_set.cc:5927] Created manifest 362, compacted+appended from 52259 to 13321
2025/10/24-11:13:18.815120 2150681 [/version_set.cc:5927] Created manifest 765, compacted+appended from 80064 to 13304
2025/10/24-11:13:35.590905 2150681 [/version_set.cc:5927] Created manifest 1167, compacted+appended from 79863 to 13304
```

As you can see, it only took a few iterations of ramp-up to settle on the auto-tuned max manifest size for tracking ~122 live SST files, around 80KB and compacting down to about 13KB. (13KB * (500 + 100) / 100 = 78KB). With the default large setting for max_manifest_file_size, we end up with a 232KB manifest, which is more than 90% wasted space. (A long-running DB would be much worse.)

As for performance, we don't expect a difference, even with TransactionDB because actual writing of the manifest is done without holding the DB mutex. I was not able to see a performance regression using db_bench with FIFO compaction and >1000 ~10MB SST files, including settings of -max_manifest_file_size=2048 -max_manifest_space_amp_pct={500,10,0}. No "hiccups" visible with -histogram either.

I also tried seeding a 1 second delay in writing new manifest files (other than the first). This had no significant effect at -max_manifest_space_amp_pct=500 but at 100 started causing write stalls in my test. In many ways this is kind of a worst case scenario and out-of-proportion test, but gives me more confidence that a higher number like 500 is probably the best balance in general.

Reviewed By: xingbowang

Differential Revision: D85445178

Pulled By: pdillinger

fbshipit-source-id: 1e6e07e89c586762dd65c65bb7cb2b8b719513f9
Summary:
This test is now taking > 6 hours, timing out, and has low signal, so creating a weekly job for it, with an explicit timeout of 12 hours.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14110

Test Plan: watch CI

Reviewed By: virajthakur

Differential Revision: D86428262

Pulled By: pdillinger

fbshipit-source-id: 44103518064ca378f3fd2ff8d21967ede698c8ea
Summary:
MultiScanUnexpectedSeekTarget() currently uses user key comparison to decide on the next data block for multiscan. This can cause a multiscan to move backward in the following scenario:

data block 1: ..., k@7, k@6
data block 2: k@5, ...

DB iter scan through k@7, k@6 and k@5 and decides to seek to k@0 due to option [`max_sequential_skip_in_iterations`](d56da8c112/include/rocksdb/advanced_options.h (L621-L629)). Multiscan was on data block 2, but moves to data block 1 after the seek.

This can cause assertion failure in debug mode and seg fault in prod since older data blocks are unpinned and freed as we advanced a multiscan. This PR fixes the issue by forcing a multiscan to never go backward.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14106

Test Plan: - added a new unit test that reproduces the scenario: `./db_iterator_test --gtest_filter="*ReseekAcrossBlocksSameUserKey*"`

Reviewed By: xingbowang

Differential Revision: D86428845

Pulled By: cbi42

fbshipit-source-id: ab623f93e73298a60857fb2ff268366f289092a0
Summary:
For all compactions, RocksDB performs a lightweight sanity check on output SST files before installation (in `CompactionJob::VerifyOutputFiles()`). However, this lightweight check may not catch corruption that is small enough to allow the SST files to still be opened.

There is an existing feature, `paranoid_file_check`, which opens the SST file, iterates through all keys, and checks the hash of each key. While this provides the ultimate level of data integrity checking, it comes at a high computational cost.

In this PR, we introduce a new mutable CF option, `verify_output_flags`. The `verify_output_flags` is a bitmask enum that allows users to select various verification types, including block checksum verification, full key iteration, and file checksum verification (to be added in subsequent PRs). Note that the existing `paranoid_file_check` option is equivalent to a full key iteration check. Block-level checksum verification is much lighter than the full key iteration check.

Please note that the previously deprecated `verify_checksums_in_compaction` option (removed in version 5.3.0) was for verifying the checksum of **input SST files**. RocksDB continues to perform this verification for both local and remote compactions, and this behavior remains unchanged. In contrast, this PR focuses on verifying the **output SST files**.

## To follow up
- File-level Checksum verification for output SST files
- Deprecate `paranoid_file_checks` option in favor of the new option
- Add to stress test / db_bench

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14103

Test Plan:
New Unit Test added. The corruption is both detected by `paranoid_file_check` and various types of verification set by this new option, `verify_output_flags`
```
./compaction_service_test --gtest_filter="*CompactionServiceTest.CorruptedOutput*"
```

Reviewed By: pdillinger

Differential Revision: D86357924

Pulled By: jaykorean

fbshipit-source-id: a9e04798f249c7e977231e179622a0830d6675fe
Summary:
**Summary:**
Merge the BuiltinFilterBitsBuilder into FilterBitsBuilder.  This enables using
CalculateSpace() for accurate filter size estimation instead of hardcoded
bits-per-key which could result in incorrect estimations for different filter types.
The previous hardcoded estimate of 15 bits per key was in the filter block builders UpdateFilterSizeEstimate().

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14111

Test Plan: - Existing filter tests pass (bloom_test, full_filter_block_test, filter_bench, db_bloom_filter_test)

Reviewed By: pdillinger

Differential Revision: D86473287

Pulled By: nmk70

fbshipit-source-id: cd4a47351e67444e944d5b1b375b3b13274dd6e3
Summary:
Support trivial move in CompactFiles API, which is not supported previously.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14112

Test Plan: Unit test

Reviewed By: cbi42

Differential Revision: D86546150

Pulled By: xingbowang

fbshipit-source-id: 08a3ae9a055f3d3d41711403b1695f44977e6ea8
Summary:
There are instances where  an application might be interested in knowing the distribution in SST files for a key range in a particular level.

This implementation creates an overloaded GetColumnFamilyMetaData api where  (startKey, EndKey) can be passed along with level information to filter the necessary sst files along with the keyranges for each sst file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14009

Reviewed By: anand1976

Differential Revision: D83389707

fbshipit-source-id: 6df1dc1f9233efe9000b03cc1831b3c618cbcef3
Summary:
Right now, in Java's Get() calls, the way Get() is treated is inefficient. Status.NotFound is turned into an exception in the JNI layer, and is caught in the same function to turn into not found return. This causes significant overhead in the scenario where most of the queries ending up with not found. For example, in Spark's deduplication query, this exception creation overhead is higher than Get() itself. With the proposed change, if return status is NotFound, we directly return, rather than going through the exception path

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14095

Test Plan: Existing tests should cover all Get() cases, and they are passing.

Reviewed By: jaykorean

Differential Revision: D86797594

Pulled By: cbi42

fbshipit-source-id: 1202d24e46a2358976bb7c8ff38a2fd4783d0f99
Summary:
We are seeing Github actions failures due to a compiler error:

https://github.com/facebook/rocksdb/actions/runs/19190877461/job/54865138898?fbclid=IwY2xjawN_Hc9leHRuA2FlbQIxMQBicmlkETFZeGlpZXZXMGlDTVhTYldwc3J0YwZhcHBfaWQBMAABHp6JoIoMBbZq-8Kgfc1honBdkAbHAZzW2ORiCM2Br2D9utxtMlq6IIqUUQnu_aem_SOU-DDsjDDMB3mTncKfLwQ&brid=VRqQ-asf2myW425wX1qqhg

When UpdatedMutableDbOptions is called from the VersionSet constructor, manifest_file_size_ is 0, and mu is nullptr. This is expected and fine, and we never enter the block where AssertHeld is called.

All other times UpdatedMutableDbOptions is called, the mutex must be held. This PR just checks that mu is not null, to satisfy the compiler. We could alternatively intentionally crash if there is concern over a silent failure if mu is passed as nullptr

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14115

Reviewed By: pdillinger

Differential Revision: D86733318

Pulled By: virajthakur

fbshipit-source-id: ce9ed6275c9495a3ea2a12f984dbceef7b441e24
Summary:
Oversight in https://github.com/facebook/rocksdb/issues/13964. More detail:
* Applies to cache_bench and db_bench (db_stress already using it)
* Make sure those along with db_stress treat "hyper_clock_cache" as "auto_hyper_clock_cache" because this is now the blessed implementation.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14120

Test Plan: manual runs of the tools

Reviewed By: krhancoc

Differential Revision: D86913202

Pulled By: pdillinger

fbshipit-source-id: 07b425d3522103417f4b034735376b9d759af5fb
Summary:
This diff introduces the async prepare of all iterators within a MultiScan. The current state has each iterator be prepared as its needed, and with this diff, we prepare all iterators during the prepare phase of the Level Iterator, this will allow more time for each IO to be dispatched and serviced, increasing the odds that a block is ready as the scan seeks to it.

Benchmark is prefilled using
```
KEYSIZE=64
VALUESIZE=512
NUMKEYS=5000000
SCAN_SIZE=100
DISTANCE=25000
NUM_SCANS=15
THREADS=1

./db_bench --db=$DB \
    --benchmarks="fillseq" \
    --write_buffer_size=5242880 \
    --max_write_buffer_number=4 \
    --target_file_size_base=5242880 \
    --disable_wal=1 --key_size=$KEYSIZE \
    --value_size=$VALUESIZE --num=$NUMKEYS --threads=32

}
```

And benchmark ran is
```
run() {
echo 1 | sudo tee /proc/sys/vm/drop_caches
./db_bench --db=$DB --use_existing_db=1 \
    --benchmarks=multiscan \
    --disable_auto_compactions=1 --seek_nexts=$SCAN_SIZE \
    --multiscan-use-async-io=1 \
    --multiscan-size=$NUM_SCANS --multiscan-stride=$DISTANCE \
    --key_size=$KEYSIZE --value_size=$VALUESIZE \
    --num=$NUMKEYS --threads=$THREADS --duration=60 --statistics
}
```

The benchmark uses large stride sides to ensure that two scans would touch separate files. We reduce the size of the block cache to increase likelyhood of reads (and simulate larger data sets)

**Branch:**

```
Integrated BlobDB: blob cache disabled
RocksDB:    version 10.8.0
Date:       Tue Nov 11 13:26:29 2025
CPU:        166 * AMD EPYC-Milan Processor
CPUCache:   512 KB
Keys:       64 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2746.6 MB (estimated)
FileSize:   1525.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
multiscan_stride = 25000
multiscan_size = 15
seek_nexts = 100
DB path: [/data/rocksdb/mydb]
multiscan    :     837.941 micros/op 1193 ops/sec 60.001 seconds 71605 operations; (multscans:71605)
```

**Baseline:**

```
Set seed to 1762898809121995 because --seed was 0
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
Integrated BlobDB: blob cache disabled
RocksDB:    version 10.9.0
Date:       Tue Nov 11 14:06:49 2025
CPU:        166 * AMD EPYC-Milan Processor
CPUCache:   512 KB
Keys:       64 bytes each (+ 0 bytes user-defined timestamp)
Values:     512 bytes each (256 bytes after compression)
Entries:    5000000
Prefix:    0 bytes
Keys per prefix:    0
RawSize:    2746.6 MB (estimated)
FileSize:   1525.9 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
multiscan_stride = 25000
multiscan_size = 15
seek_nexts = 100
DB path: [/data/rocksdb/mydb]
multiscan    :    1129.916 micros/op 885 ops/sec 60.001 seconds 53102 operations; (multscans:53102)
```
Repeated for confirmation.

This introduces a ~20% improvement in latency and op/s.

Note: Benchmarks are single threaded as, when increasing thread count, we start seeing large amounts of overhead being induced by block cache contention, finally resulting in both baseline and branch becoming equal.

Further on network attached storage with high latency, the level iterator, preparing all iterators so a 20% improvement even at high thread counts.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14100

Reviewed By: anand1976

Differential Revision: D86913584

Pulled By: krhancoc

fbshipit-source-id: da9d0c890e25e392a33389ce6b80f9bfb84d3f85
Summary:
**Context/Summary:**
This PR adds multi-cf support to option migration. The original implementation sets options, opens db, compacts files and reopens the db in almost all the three branches below. Such design makes expanding to multi-cf difficult as it needs to change all these places within each of the branch causing code redundancy.
```
Status OptionChangeMigration(std::string dbname, const Options& old_opts,
                             const Options& new_opts) {
  if (old_opts.compaction_style == CompactionStyle::kCompactionStyleFIFO) {
    // LSM generated by FIFO compaction can be opened by any compaction.
    return Status::OK();
  } else if (new_opts.compaction_style ==
             CompactionStyle::kCompactionStyleUniversal) {
    return MigrateToUniversal(dbname, old_opts, new_opts);
  } else if (new_opts.compaction_style ==
             CompactionStyle::kCompactionStyleLevel) {
    return MigrateToLevelBase(dbname, old_opts, new_opts);
  } else if (new_opts.compaction_style ==
             CompactionStyle::kCompactionStyleFIFO) {
    return CompactToLevel(old_opts, dbname, 0, 0 /* l0_file_size */, true);
  } else {
    return Status::NotSupported(
        "Do not how to migrate to this compaction style");
  }
}
```

Therefore this PR
-  Refactor the option migration implementation by moving the common parts into the high-level `OptionChangeMigration()` through `PrepareNoCompactionCFDescriptors()` and `OpenDBWithCFs()` so `MigrateAllCFs()` can focus on compaction only.
-  Treat the original OptionChangeMigration() API as a special case of the multi-cf version option migration
- Add multiple-cf support

A few notes:
- CompactToLevel() originally modifies the compaction-related options conditionally before doing compaction. This is moved into earlier steps through `ApplySpecialSingleLevelSettings()` in `PrepareNoCompactionCFDescriptors()`
- MigrateToUniversal() originally opens the db twice with essentially the same option. This PR reduces that to one open
- Option migration does not always use the old option to compact the db and reopen the db after migration, see `  return CompactToLevel(new_opts, dbname, new_opts.num_levels - 1,/*l0_file_size=*/0, false);`. `PrepareNoCompactionCFDescriptors()` is where we handle those decisions.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14059

Test Plan:
- Existing UTs
- New UTs

Reviewed By: cbi42

Differential Revision: D84852970

Pulled By: hx235

fbshipit-source-id: 936b456cf9fb4c3ccb687e5d1387f2d67a1448be
Summary:
Saw a mysterious failure of assertion
`assert(rep_->props.num_data_blocks == 0)` in
DBCompressionTest/CompressionFailuresTest.CompressionFailures/45. This seems to be caused by a parallel compression failure arriving after the emit thread has started Finish() but before the Flush() at the start of Finish(). We can fix this by relaxing the assertion to allow for the !ok() case. Testing revealed more ok() assertions that needed to be relaxed/moved.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14130

Test Plan: Added a sync point to inject a failure status in the right place and added to unit test to be sure the case is essentially covered. It would arguably be a more realistic test to force a particular thread interleaving but I believe simple is good here.

Reviewed By: hx235

Differential Revision: D87377709

Pulled By: pdillinger

fbshipit-source-id: 4bd465673b084afcc235688503d1c2f464eed32d
Summary:
I have been using sst_dump --command=recompress for some ad hoc automation for compression engineering and these new options help with that.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14133

Test Plan: manual

Reviewed By: hx235

Differential Revision: D87453635

Pulled By: pdillinger

fbshipit-source-id: 2ae54e13a9221ec27c6637fea16623465a9163ae
Summary:
As title

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14134

Test Plan:
The following command generated the BUCK file correctly
```
python3 buckifier/buckify_rocksdb.py
```

Reviewed By: anand1976

Differential Revision: D87469877

Pulled By: jaykorean

fbshipit-source-id: 9ec330084cfe96ad9b71aa13c8eb16593256a5ac
Summary:
**Context/Summary:**
Due to double's 53-bit mantissa limitation, large uint64_t values lose precision when converted to double. Value equals to or smaller than UINT64_MAX (but greater than 2^64 - 1024) round up to 2^64 since rounding up results in less error than rounding down, which exceeds UINT64_MAX. `std::numeric_limits<uint64_t>::max() / op1 < op2` won't catch those cases. Casting such out-of-range doubles back to uint64_t causes undefined behavior. T

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14132
UndefinedBehaviorSanitizer: undefined-behavior options/cf_options.cc:1087:32 in
```
before the fix but not after.

Test Plan:
```
COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CC=clang-18 CXX=clang++-18 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j55 db_stress

python3 tools/db_crashtest.py --simple blackbox --compact_range_one_in=5 --target_file_size_base=9223372036854775807 // Half of std::numeric_limits<uint64_t>::max()
```
It fails with
```
stderr:
 options/cf_options.cc:1087:32: runtime error: 1.84467e+19 is outside the range of representable values of type 'unsigned long'

Reviewed By: pdillinger

Differential Revision: D87434936

Pulled By: hx235

fbshipit-source-id: 65563edf9faf732410bdba8b9e4b7fd61b958169
Summary:
**Context/Summary:**

.. because verify_output_flags contains information of usage of paranoid_file_check that is currently not yet compatible with resumable remote compaction

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14139

Test Plan: Existing tests

Reviewed By: jaykorean

Differential Revision: D87582635

Pulled By: hx235

fbshipit-source-id: ef21223da53a0696fa3ca9b1617c2c1ee2e19878
Summary:
Fixing internal validator failure

```
Every project specific source file must contain a doc block with an appropriate copyright header. Unrelated files must be listed as exceptions in the Copyright Headers Exceptions page in the repo dashboard.
A copyright header clearly indicates that the code is owned by Meta. Every open source file must start with a comment containing "Meta Platforms, Inc. and affiliates"
https://github.com/facebook/rocksdb/blob/main/buckifier/targets_cfg.py:
The first 16 lines of 'buckifier/targets_cfg.py' do not contain the patterns:
	(Meta Platforms, Inc. and affiliates)|(Facebook, Inc(\.|,)? and its affiliates)|([0-9]{4}-present(\.|,)? Facebook)|([0-9]{4}(\.|,)? Facebook)
```

While fixing the text to pass the linter, I took the opportunity to modify `format-diff.sh` script to add the copyright header automatically if missing in new files.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14143

Test Plan:
```
$> make format
```
**new python file**
```
build_tools/format-diff.sh
Checking format of uncommitted changes...
Checking for copyright headers in new files...
Added copyright header to build_tools/test.py
Copyright headers were added to new files.
Nothing needs to be reformatted!
```
**new header file**
```
build_tools/format-diff.sh
Checking format of uncommitted changes...
Checking for copyright headers in new files...
Added copyright header to db/db_impl/db_impl_jewoongh.h
Copyright headers were added to new files.
Nothing needs to be reformatted!
```

Reviewed By: hx235

Differential Revision: D87653124

Pulled By: jaykorean

fbshipit-source-id: 164322cfcd2c162bb3b41bb8f3bafefa3f20b695
Summary:
Fixing crash test failure caused by `prepared_iters_.size() == 0`

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14144

Test Plan:
```
python3 -u tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --simple blackbox
```

Reviewed By: krhancoc

Differential Revision: D87656914

Pulled By: jaykorean

fbshipit-source-id: 9ef7cf4ea5d34fe9dee6219b32323e91a2ea3e5f
Summary:
Range deletion start keys are considered during compaction for cutting output files. Due to some ordering requirement (see comment above InsertNextValidRangeTombstoneAtLevel()) between truncated range deletion start key and a file's point keys, there was logic in f6c9c3bf1c/db/range_del_aggregator.cc (L39) that changes the value type to be kTypeMaxValid. However, kTypeMaxValid is not supposed to be persisted per f6c9c3bf1c/db/dbformat.h (L75-L76). This can cause forward compatibility issues reported in https://github.com/facebook/rocksdb/issues/14101. This PR fixes this issue by removing the logic that sets kTypeMaxValid and always skip truncated range deletion start key in CompactionMergingIterator.

For existing SST files, we want to avoid using this kTypeMaxValid, so this PR also introduces a new placeholder value type. This allows us to re-strengthen the relevant value type checks (IsExtendedValueType()) that was loosen for kTypeMaxValid.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14122

Test Plan:
- a unit test that persists kTypeMaxValid before this fix
- crash test with frequent range deletion: `python3 ./tools/db_crashtest.py blackbox --delrangepercent=11 --readpercent=35`
- Generate SST files with 0x1A as value type (kTypeMaxValid before this change) in file metadata. Run ldb with the strengthened check in IsExtendedValueType() to dump the MANIFEST. It failed to parse MANIFEST as expected before this PR and succeeds after this PR.
```
Error in processing file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 Corruption: VersionEdit: new-file4 entry  The file /tmp/rocksdbtest-543376/db_range_del_test_2549357_6547198162080866792/MANIFEST-000005 may be corrupted.
```

Reviewed By: pdillinger

Differential Revision: D87016541

Pulled By: cbi42

fbshipit-source-id: 9957a095db2cd9947463b403f352bd9a1fd70a76
Summary:
This change enables a custom CompressionManager / Compressor to adopt custom handling for data and index blocks. In particular, index blocks for format_version >= 4 use a distinct variant of the block format. Thus, a potentially format-aware compression algorithm such as OpenZL should be told which kind of block we are compressing. (And previously I avoided passing block type in CompressBlock for efficient handling of things like dictionaries but also avoiding checks on every CompressBlock call.)

Most of the change is in BlockBasedTableBuilder to call MaybeCloneSpecialized for both kDataBlock and for kIndexBlock. But I also needed some small tweaks/additions to the public API also:
* Require a Clone() function from Compressors, to support proper implementations of MaybeCloneSpecialized() in wrapper Compressors.
* Assert that the default implementation of CompressorWrapper::MaybeCloneSpecialized() is only used in allowable cases.
* Convenience function Compressor::CloneMaybeSpecialized()

This also fixes a serious bug/oversight in ManagedPtr for (ManagedWorkingArea) that somehow wasn't showing up before. It probably doesn't need a release note because CompressionManager stuff is still considered experimental.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14140

Test Plan: Greatly expanded DBCompressionTest.CompressionManagerWrapper to make sure the distinction between data blocks and index blocks is properly communicated to a custom CompressionManager/Compressor. The test includes processing the expected structure of data and index blocks, to serve as a tested example for structure-aware compressors.

Reviewed By: hx235

Differential Revision: D87600019

Pulled By: pdillinger

fbshipit-source-id: 252ef78910073a0e45f2c81dd45ac87ff8a41fc6
Summary:
10.9.0 branch has been cut.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14148

Reviewed By: nmk70

Differential Revision: D87688882

Pulled By: xingbowang

fbshipit-source-id: 5fe95d3c64851b4f9490aed5d92451b38abe008d
Summary:
... from https://github.com/facebook/rocksdb/issues/14140. The assertion in the default implementation of CompressorWrapper::MaybeCloneSpecialized() could fail because this wrapper wasn't overriding it when it should. (See the NOTE on that implementation.)

Because this release already has a breaking modification to the Compressor API (adding Clone()), I took this opportunity to add 'const' to MaybeCloneSpecialized(). Also marked some compression classes as 'final' that could be marked as such.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14150

Test Plan: unit test expanded to cover this case (verified failing before). Audited the rest of our CompressorWrappers.

Reviewed By: archang19

Differential Revision: D87793987

Pulled By: pdillinger

fbshipit-source-id: 61c4469b84e4a47451a9942df09277faeeccfe63
Summary:
Support ccache in make file

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14123

Test Plan: local build

Reviewed By: cbi42

Differential Revision: D87332892

Pulled By: xingbowang

fbshipit-source-id: 2088bd19bdab1bd7070734c886200be80f1a65af
Summary:
I want to reduce the time from when we call `StopBackup` to `CreateNewBackup` returning `BackupStopped`. We already check for the `stop_backup_` inside `CopyOrCreateFile` and `ReadFileAndComputeChecksum`, but we should add a check at the top of these methods to abort immediately. This could help save some latency from the file system metadata operations, like creating the sequential file and writable file.

We also want to update the API documentation for `StopBackup` which currently does not indicate that once it is called, all subsequent requests to create backups will fail.

In a follow up PR, we should also add coverage of `StopBackup` to the crash tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14129

Test Plan:
We were missing unit test coverage for `StopBackup`. I added test cases which cancel backups at different points in time.

Once this change is rolled out to production, we can monitor the DB close latencies, which depend on first cancelling ongoing backups

Reviewed By: pdillinger

Differential Revision: D87356536

Pulled By: archang19

fbshipit-source-id: 687094a41f096f6a156be65b2cce0b5054fb26f2
Summary:
Continuing work from https://github.com/facebook/rocksdb/issues/13965. Here I'm migrating the "next with shift" kind of bit field and for that I've added an API for atomic additive transformations that can be combined into a single atomic update for multiple fields. (I implemented more features than needed, just in case they are needed someday and to demonstrate what is possible.)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14027

Test Plan: BitFields unit test updated/added, existing HCC tests

Reviewed By: xingbowang

Differential Revision: D83895094

Pulled By: pdillinger

fbshipit-source-id: e4487f34f5607b20f94b85a645ca654e6401e35d
Summary:
**Context/Summary:**

Best-efforts recovery can cause a use-after-free bug after retrying for a failed recovery attempt. The issue occurs in VersionSet::Reset():
- First recovery attempt: Opens SST files, caching BlockBasedTable objects in table_cache_
ac412b1095/db/version_edit_handler.cc (L565)
- Recovery fails: Calls Reset() which deletes the old ColumnFamilySet (and all CFDs)
ac412b1095/db/version_set.cc (L6631)
- Creates new CFDs: But reuses the same table_cache_
ac412b1095/db/version_set.cc (L5579)
- Bug: Cached BlockBasedTable objects contain now-dangling reference to previous CFD's member such as rep_->internal_comparator or rep_->ioptions as below. References instead of object copies are used for memory efficiency
```
struct BlockBasedTable::Rep {
  Rep(const ImmutableOptions& _ioptions, ..
      const InternalKeyComparator& _internal_comparato...)) {}
  ~Rep() { status.PermitUncheckedError(); }
  const ImmutableOptions& ioptions;
  ...
  const InternalKeyComparator& internal_comparator;
```
- Crash: Accessing any of the above reference in cached tables during read or compaction after recovery finishes triggers use-after-free

This PR calls table_cache_->EraseUnRefEntries()  to clear tables containing the dangling reference in VersionSet::Reset() before creating the new ColumnFamilySet.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14155

Test Plan:
- Add new unit test that fails before the fix under ASAN run and pass after
```
[ RUN      ] DBBasicTest.BestEffortRecoveryFailureWithTableCacheUseAfterFree
=================================================================
==1976446==ERROR: AddressSanitizer: heap-use-after-free on address 0x61e00000a8c8 at pc 0x7f6b21beae57 bp 0x7ffd65bacec0 sp 0x7ffd65baceb8
READ of size 8 at 0x61e00000a8c8 thread T0
    #0 0x7f6b21beae56 in rocksdb::UserComparatorWrapper::user_comparator() const util/user_comparator_wrapper.h:29 // rep_->ioptions
    https://github.com/facebook/rocksdb/issues/1 0x7f6b21beb02b in rocksdb::InternalKeyComparator::user_comparator() const db/dbformat.h:421
    https://github.com/facebook/rocksdb/issues/2 0x7f6b229a7a50 in rocksdb::BinarySearchIndexReader::NewIterator(rocksdb::ReadOptions const&, bool, rocksdb::IndexBlockIter*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) table/block_based/binary_search_index_reader.cc:62
    https://github.com/facebook/rocksdb/issues/3 0x7f6b22a9a649 in rocksdb::BlockBasedTable::NewIndexIterator(rocksdb::ReadOptions const&, bool, rocksdb::IndexBlockIter*, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*) const table/block_based/block_based_table_reader.cc:1683
    https://github.com/facebook/rocksdb/issues/4 0x7f6b22aa39be in rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bool) table/block_based/block_based_table_reader.cc:2533
    https://github.com/facebook/rocksdb/issues/5 0x7f6b2241201c in rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::MutableCFOptions const&, rocksdb::HistogramImpl*, bool, int, unsigned long) db/table_cache.cc:492

0x61e00000a8c8 is located 72 bytes inside of 2784-byte region [0x61e00000a880,0x61e00000b360)
freed by thread T0 here:
    #0 0x7f6b248d20d7 in operator delete(void*, unsigned long) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:172
    https://github.com/facebook/rocksdb/issues/1 0x7f6b21ca8703 in rocksdb::ColumnFamilyData::UnrefAndTryDelete() db/column_family.cc:785
    https://github.com/facebook/rocksdb/issues/2 0x7f6b21cb25ee in rocksdb::ColumnFamilySet::~ColumnFamilySet() db/column_family.cc:1771
    https://github.com/facebook/rocksdb/issues/3 0x7f6b225683df in std::default_delete<rocksdb::ColumnFamilySet>::operator()(rocksdb::ColumnFamilySet*) const (/data/users/huixiao/rocksdb/librocksdb.so.10.10+0x1f683df)
    https://github.com/facebook/rocksdb/issues/4 0x7f6b22568ceb in std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*) /mnt/gvfs/third-party2/libgcc/d1129753c8361ac8e9453c0f4291337a4507ebe6/11.x/platform010/5684a5a/include/c++/trunk/bits/unique_ptr.h:182
    https://github.com/facebook/rocksdb/issues/5 0x7f6b22550c52 in std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset(rocksdb::ColumnFamilySet*) /mnt/gvfs/third-party2/libgcc/d1129753c8361ac8e9453c0f4291337a4507ebe6/11.x/platform010/5684a5a/include/c++/trunk/bits/unique_ptr.h:456
    https://github.com/facebook/rocksdb/issues/6 0x7f6b224fa09e in rocksdb::VersionSet::Reset() db/version_set.cc:5587
    https://github.com/facebook/rocksdb/issues/7 0x7f6b2250752c in rocksdb::VersionSet::TryRecover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*) db/version_set.cc:6640
    https://github.com/facebook/rocksdb/issues/8 0x7f6b220c5a88 in rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*, bool*) db/db_impl/db_impl_open.cc:565

previously allocated by thread T0 here:
    #0 0x7f6b248d1257 in operator new(unsigned long) /home/engshare/third-party2/gcc/11.x/src/gcc-11.x/libsanitizer/asan/asan_new_delete.cpp:99
    https://github.com/facebook/rocksdb/issues/1 0x7f6b21cb30e0 in rocksdb::ColumnFamilySet::CreateColumnFamily(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned int, rocksdb::Version*, rocksdb::ColumnFamilyOptions const&, bool) db/column_family.cc:1827
    https://github.com/facebook/rocksdb/issues/2 0x7f6b22516a11 in rocksdb::VersionSet::CreateColumnFamily(rocksdb::ColumnFamilyOptions const&, rocksdb::ReadOptions const&, rocksdb::VersionEdit const*, bool) db/version_set.cc:7715
    https://github.com/facebook/rocksdb/issues/3 0x7f6b22494910 in rocksdb::VersionEditHandler::CreateCfAndInit(rocksdb::ColumnFamilyOptions const&, rocksdb::VersionEdit const&) db/version_edit_handler.cc:494
    https://github.com/facebook/rocksdb/issues/4 0x7f6b2249005f in rocksdb::VersionEditHandler::Initialize() db/version_edit_handler.cc:209
    https://github.com/facebook/rocksdb/issues/5 0x7f6b2248cd13 in rocksdb::VersionEditHandlerBase::Iterate(rocksdb::log::Reader&, rocksdb::Status*) db/version_edit_handler.cc:32
    https://github.com/facebook/rocksdb/issues/6 0x7f6b225081db in rocksdb::VersionSet::TryRecoverFromOneManifest(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*) db/version_set.cc:6679
    https://github.com/facebook/rocksdb/issues/7 0x7f6b225074a1 in rocksdb::VersionSet::TryRecover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, bool*) db/version_set.cc:6635
    https://github.com/facebook/rocksdb/issues/8 0x7f6b220c5a88 in rocksdb::DBImpl::Recover(std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, bool, bool, bool, bool, unsigned long*, rocksdb::DBImpl::RecoveryContext*, bool*) db/db_impl/db_impl_open.cc:565
```

Reviewed By: anand1976

Differential Revision: D87991593

Pulled By: hx235

fbshipit-source-id: 2379b297ff592cadf02659e355cdc8e170917cfc
Summary:
Add a new option in sst_dump command to show seq no and value type in raw mode

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14166

Test Plan:
Sample output

```
sst_dump --file=rocksdb_crashtest_blackbox/000010.sst  --command=raw --show_sequence_number_type

...

Range deletions:
--------------------------------------
  HEX    000000000000038D000000000000012B000000000000029A  seq: 3016892  type: 15 : 000000000000038D000000000000012B000000000000029E
  ASCII  \0 \0 \0 \0 \0 \0   \0 \0 \0 \0 \0 \0  + \0 \0 \0 \0 \0 \0   : \0 \0 \0 \0 \0 \0   \0 \0 \0 \0 \0 \0  + \0 \0 \0 \0 \0 \0
  ------

Data Block # 1 @ 0073
--------------------------------------
  HEX    000000000000038D000000000000012B000000000000029D  seq: 3004554  type: 0 :
  ASCII  \0 \0 \0 \0 \0 \0   \0 \0 \0 \0 \0 \0  + \0 \0 \0 \0 \0 \0   :
  ------
  HEX    000000000000038D000000000000012B000000000000029D  seq: 0  type: 1 : 03000000070605040B0A09080F0E0D0C13121110171615141B1A19181F1E1D1C
  ASCII  \0 \0 \0 \0 \0 \0   \0 \0 \0 \0 \0 \0  + \0 \0 \0 \0 \0 \0   :  \0 \0 \0
```

Reviewed By: hx235

Differential Revision: D88396223

Pulled By: xingbowang

fbshipit-source-id: b006cd7f51f941951349e4ec60ed5ef1e838919d
Summary:
Revert "Fix a bug where compaction with range deletion can persist kTypeMaxValid in file metadata (https://github.com/facebook/rocksdb/issues/14122)"

Add a new unit test to capture the situation found by stress test

This reverts commit 8c7c8b8dab.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14170

Test Plan: Unit Test

Reviewed By: anand1976

Differential Revision: D88395956

Pulled By: xingbowang

fbshipit-source-id: 226649dc79a86010ad326ffb2eae35109dc96bc4
Summary:
Fix missing const for arg of OptionChangeMigration

We switched from std::string to std::string & for API OptionChangeMigration, which caused const qualifier to be lost at call site, which causes compilation failure.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14173

Test Plan: Unit test

Reviewed By: pdillinger

Differential Revision: D88431457

Pulled By: xingbowang

fbshipit-source-id: a705f3b80cc5ff56dab73aa6a31c940798d8df45
Summary:
To help find potential issues not showing up in ARM unit tests. I'm running it with and without TransactionDB (write-committed) for better coverage. The job expands the size of /dev/shm for adequate space on maximum performance storage, and adds swap space to reduce risk of OOM in case we fill that up.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14172

Test Plan: earlier drafts of this PR added the job to PR jobs, and the last before putting in "nightly" can be seen here: https://github.com/facebook/rocksdb/actions/runs/19945493840/job/57193797390?pr=14172

Reviewed By: archang19

Differential Revision: D88429479

Pulled By: pdillinger

fbshipit-source-id: bd4d9cda9256950c3c6c126c299a44dbbbc30c7e
Summary:
Crash tests have been failing of late with this assertion failure - db_stress: `./table/block_based/block_based_table_iterator.h:656: void rocksdb::BlockBasedTableIterator::PrepareReadAsyncCallBack(rocksdb::FSReadRequest &, void *): Assertion `async_state->status.IsAborted()' failed.` Instead of asserting, surface the failure status so we can troubleshoot.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14171

Reviewed By: xingbowang

Differential Revision: D88396654

Pulled By: anand1976

fbshipit-source-id: 8d59d7ace0c522c17b7af17c50e16af876911bad
Summary:
r? cbi42

Exposes RocksDB's remote compaction functionality through the C API, enabling C/FFI clients (Go, Rust, Python, etc.) to offload compaction work to remote workers.

## API Components
### Compaction Service

Create service with schedule, wait, cancel, and on_installation callbacks
Ownership transfers to options object (auto-destroyed, no manual cleanup)

### Job Info (13 getters)

DB/CF metadata and compaction details (priority, reason, levels, flags)

### Schedule Response

Create with job ID and status (validated with errptr)
Status: success, failure, aborted, use_local

### OpenAndCompact (for remote workers)

Execute compaction on worker node with environment/comparator overrides
Cancellation support via atomic flags

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14136

Reviewed By: hx235

Differential Revision: D88316558

Pulled By: jaykorean

fbshipit-source-id: 60a0fee69ff1e650dd785d96ec656649263214f8
Summary:
**Context/Summary:**
Stress test flag printed by db_crashtest.py like `./db_stres ....-secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true --otherflags=xxxx` is not copy-paste-run friendly. Directly running this command will cause parsing hiccups due to special characters like // or ;. This PR made the db_crashtest.py print a single-quoted value so at least the copy-paste-run works for unix-like shell (the most common case).

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14180

Test Plan:
`python3 tools/db_crashtest.py --simple blackbox ...` display the following

Before fix, no single-quoted
```
Use random seed for iteration 9698536012932546857
Running db_stress with pid=1280640:./db_stress --secondary_cache_uri=compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true  ...

// Directly copy, paste and run the ./db_stress command will encounter
Error: Read(-readpercent=0)+Prefix(-prefixpercent=0)+Write(-writepercent=45)+Delete(-delpercent=0)+DeleteRange(-delrangepercent=30)+Iterate(-iterpercent=40)+CustomOps(-customopspercent=0) percents != 100!
bash: --set_options_one_in=0: command not found
```
After fix, has single-quoted
```
se random seed for iteration 6017815530972723112
Running db_stress with pid=1234632: ./db_stress --secondary_cache_uri='compressed_secondary_cache://capacity=8388608;enable_custom_split_merge=true' ....

// Directly copy, paste and run the ./db_stress command is fine
```

Reviewed By: archang19

Differential Revision: D88688584

Pulled By: hx235

fbshipit-source-id: 88b8b2de7c2c5619b6e19900f4144dcd8e032f7b
Summary:
**Context/Summary:**

Truncated range deletion in input files can be output by CompactionIterator with type kMaxValid instead of kTypeRangeDeletion, to satisfy ordering requirement between the truncated range deletion start key and a file's point keys. There was a plan to skip such key in https://github.com/facebook/rocksdb/pull/14122 but blockers remain to fulfill the plan.

Resumable compaction is not able to handle resumption from range deletion well at this point and should consider kMaxValid type same as kTypeRangeDeletion for resumption. Previously, it didn't and mistakenly allow resumption from a delete range. That led to an assertion failure, complaining about lacking information to update file boundaries in the presence of range deletion needed during cutting an output file, after the compaction resumes from that delete range and happens to cut the output file shortly after without any point keys in between.

```
frame https://github.com/facebook/rocksdb/issues/9: 0x00007f4f4743bc93 libc.so.6`__GI___assert_fail(assertion="meta.smallest.size() > 0", file="db/compaction/compaction_outputs.cc", line=530, function="rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice*, const rocksdb::Slice*, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&)") at assert.c:101:3
frame https://github.com/facebook/rocksdb/issues/10: 0x00007f4f4808c68c librocksdb.so.10.9`rocksdb::CompactionOutputs::AddRangeDels(this=0x00007f4f0c27e1a0, range_del_agg=0x00007f4f0c21ecc0, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, range_del_out_stats=0x00007f4f0dffa140, bottommost_level=false, icmp=0x00007f4ef4c93040, earliest_snapshot=13108729, keep_seqno_range=<unavailable>, next_table_min_key=0x00007f4ef4c8f540, full_history_ts_low="") at compaction_outputs.cc:530:7
frame https://github.com/facebook/rocksdb/issues/11: 0x00007f4f480480dd librocksdb.so.10.9`rocksdb::CompactionJob::FinishCompactionOutputFile(this=0x00007f4f0dffb890, input_status=<unavailable>, prev_table_last_internal_key=0x00007f4f0dffa650, next_table_min_key=0x00007f4ef4c8f540, comp_start_user_key=0x0000000000000000, comp_end_user_key=0x0000000000000000, c_iter=0x00007f4ef4c8f400, sub_compact=0x00007f4f0c27e000, outputs=0x00007f4f0c27e1a0) at compaction_job.cc:1917:31
```

This PR simply prevents  MaxValid from being a resumption point like regular range deletion - see commit 842d66eb18ea67e965d6acb1fce12c18eeb778d2

Besides that, the PR also improves the testing, variable naming, logging in resumable compaction codes that were needed to debug this assertion failure - see commit https://github.com/facebook/rocksdb/pull/14184/commits/aecd4e7f971f6dd4df672d9e5f1409fe4747c561. These improvements are covered by existing tests.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14184

Test Plan:
- The stress initially surfaced the error. Using the exact same LSM shapes and files that were used in stress test but in a unit test, I'm able to get a deterministic repro and confirmed the fix resolves the error.  This is the repro test 1075936e69
```
./compaction_service_test --gtest_filter=ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
# Before fix
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ResumableCompactionServiceTest
[ RUN      ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
compaction_service_test: db/compaction/compaction_outputs.cc:530: rocksdb::Status rocksdb::CompactionOutputs::AddRangeDels(rocksdb::CompactionRangeDelAggregator&, const rocksdb::Slice*, const rocksdb::Slice*, rocksdb::CompactionIterationStats&, bool, const rocksdb::InternalKeyComparator&, rocksdb::SequenceNumber, std::pair<long unsigned int, long unsigned int>, const rocksdb::Slice&, const string&): Assertion `meta.smallest.size() > 0' failed.
Received signal 6 (Aborted)
Invoking GDB for stack trace...
[New LWP 2621610]
[New LWP 2621611]
[New LWP 2621612]
[New LWP 2621613]
[New LWP 2621614]
[New LWP 2621630]
[New LWP 2621631]

# After fix
Note: Google Test filter = ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ResumableCompactionServiceTest
[ RUN      ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume
[       OK ] ResumableCompactionServiceTest.CompactSpecificFilesFromExistingDBWithCancelAndResume (4722 ms)
[----------] 1 test from ResumableCompactionServiceTest (4722 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (4722 ms total)
[  PASSED  ] 1 test.

```
- Follow-up: I tried a couple time to coerce the truncated range delete from scratch in the unit test but failed doing so. Considering kMaxValid may not be outputted by compaction iterator anymore after https://github.com/facebook/rocksdb/pull/14122/files gets landed again (and obsolete the bug) ADN the simple nature of this fix 842d66eb18ea67e965d6acb1fce12c18eeb778d2 AND the worst case of such fix going wrong is just less resumption, I decided to leave writing a unit test to coerce truncated ranged deletion from scratch a follow-up. Maybe I will draw inspiration from https://github.com/facebook/rocksdb/pull/14122/files.

Reviewed By: jaykorean

Differential Revision: D88912663

Pulled By: hx235

fbshipit-source-id: 80a01135684c8fea659650faaa00c2dc452c482a
Summary:
`PosixRandomAccessFile::MultiRead` was introduced in Dec 2019 in https://github.com/facebook/rocksdb/pull/5881. Subsequently, 2 years after, we introduced the `PosixRandomAccessFile::ReadAsync` API in https://github.com/facebook/rocksdb/pull/9578, which was reusing the same `PosixFileSystem` IO ring as `MultiRead` API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving `ReadAsync` and `MultiRead` API calls on the very same thread might result in reading 'unknown' events in `MultiRead` leading to `Bad cqe data` errors (and therefore falsely perceived  as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the `ReadAsync` API. This change aims to solve this problem by maintaining separate thread local IO rings for `async reads` and `multi reads` assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (`IORING_SETUP_SINGLE_ISSUER` [if available]) and defer the task just before the application intends to process completions (`IORING_SETUP_DEFER_TASKRUN` [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.

## Benchmark

**TLDR**
There's no evident advantage of using `io_uring_submit` (relative to proposed `io_uring_submit_and_wait`) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above `kIoUringDepth` batch size. `io_uring_submit` might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from `io_uring_submit_and_wait` -> `io_uring_submit` can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).

**Compilation**
```
DEBUG_LEVEL=0 make db_bench
```

**Create DB**

```
./db_bench \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --benchmarks=fillseq \
    --num=2500000 \
    --key_size=100 \
    --value_size=6144 \
    --compression_type=LZ4 \
    --block_size=16384 \
    --seed=1723056275
```

**LSM**

* L0: 2 files, L1: 5, L2: 49, L3: 79
* Each file is roughly ~35M in size

### MultiReadRandom (with caching disabled)

Each run was preceded by OS page cache cleanup with `echo 1 | sudo tee /proc/sys/vm/drop_caches`.

```
./db_bench \
    --use_existing_db=true \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --compression_type=LZ4 \
    --benchmarks=multireadrandom \
    --num= **<N>** \
    --batch_size= **<B>** \
    --io_uring_enabled=true \
    --async_io=false \
    --optimize_multiget_for_io=false \
    --threads=4 \
    --cache_size=0 \
    --use_direct_reads=true \
    --use_direct_io_for_flush_and_compaction=true \
    --cache_index_and_filter_blocks=false \
    --pin_l0_filter_and_index_blocks_in_cache=false \
    --pin_top_level_index_and_filter=false \
    --prepopulate_block_cache=0 \
    --row_cache_size=0 \
    --use_blob_cache=false \
    --use_compressed_secondary_cache=false
```

  | B=10; N=100,000 | B = 250; N=80,000  | B = 1,000; N=20,000
-- | -- | -- | --
baseline | 31.5 (± 0.4) us/op | 17.5 (± 0.5) us/op | 13.5 (± 0.4) us/op
io_uring_submit_and_wait |  31.5 (± 0.6) us/op |  17.7 (± 0.4) us/op |  13.6 (± 0.4) us/op
io_uring_submit | 31.5 (± 0.6) us/op | 17.5 (± 0.5) us/op | 13.4 (± 0.45) us/op

### Specs

  | Property | Value
-- | --
RocksDB | version 10.9.0
Date | Tue Dec 9 15:57:03 2025
CPU | 56 * Intel Sapphire Rapids (T10 SPR)
Kernel version | 6.9.0-0_fbk12_0_g28f2d09ad102

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14158

Reviewed By: anand1976

Differential Revision: D88172809

Pulled By: mszeszko-meta

fbshipit-source-id: 5198de3d2f18f76fee661a2ec5f447e79ba06fbd
Summary:
Causing failures and not yet supported. Also putting a note in db.h about the combination being unsupported.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14189

Test Plan: started up blackbox_crash_test_with_ts many times and checked command line to be confident it's excluded.

Reviewed By: hx235

Differential Revision: D89297971

Pulled By: pdillinger

fbshipit-source-id: c5134351d9ecb37879c7e3319c17dd9228d7f12a
Summary:
When building a Release on Windows RTTI is not available, so asserts that use dynamic_cast need to be disabled

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14280

Reviewed By: nmk70

Differential Revision: D91807791

Pulled By: mszeszko-meta

fbshipit-source-id: e29c19c757bcd076a1f09ed40b306bb50ba9e882
10.10.1 Patch Release
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u main-c10y:gamesguru-main-c10y
git switch gamesguru-main-c10y

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git switch 10.5.fb
git merge --no-ff gamesguru-main-c10y
git switch gamesguru-main-c10y
git rebase 10.5.fb
git switch 10.5.fb
git merge --ff-only gamesguru-main-c10y
git switch gamesguru-main-c10y
git rebase 10.5.fb
git switch 10.5.fb
git merge --no-ff gamesguru-main-c10y
git switch 10.5.fb
git merge --squash gamesguru-main-c10y
git switch 10.5.fb
git merge --ff-only gamesguru-main-c10y
git switch 10.5.fb
git merge gamesguru-main-c10y
git push origin 10.5.fb
Sign in to join this conversation.
No description provided.