rocksdb/env
Maciej Szeszko 5a06787a26 IO uring improvements (#14158)
Summary:
`PosixRandomAccessFile::MultiRead` was introduced in Dec 2019 in https://github.com/facebook/rocksdb/pull/5881. Subsequently, 2 years after, we introduced the `PosixRandomAccessFile::ReadAsync` API in https://github.com/facebook/rocksdb/pull/9578, which was reusing the same `PosixFileSystem` IO ring as `MultiRead` API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving `ReadAsync` and `MultiRead` API calls on the very same thread might result in reading 'unknown' events in `MultiRead` leading to `Bad cqe data` errors (and therefore falsely perceived  as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the `ReadAsync` API. This change aims to solve this problem by maintaining separate thread local IO rings for `async reads` and `multi reads` assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (`IORING_SETUP_SINGLE_ISSUER` [if available]) and defer the task just before the application intends to process completions (`IORING_SETUP_DEFER_TASKRUN` [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference.

## Benchmark

**TLDR**
There's no evident advantage of using `io_uring_submit` (relative to proposed `io_uring_submit_and_wait`) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above `kIoUringDepth` batch size. `io_uring_submit` might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from `io_uring_submit_and_wait` -> `io_uring_submit` can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]).

**Compilation**
```
DEBUG_LEVEL=0 make db_bench
```

**Create DB**

```
./db_bench \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --benchmarks=fillseq \
    --num=2500000 \
    --key_size=100 \
    --value_size=6144 \
    --compression_type=LZ4 \
    --block_size=16384 \
    --seed=1723056275
```

**LSM**

* L0: 2 files, L1: 5, L2: 49, L3: 79
* Each file is roughly ~35M in size

### MultiReadRandom (with caching disabled)

Each run was preceded by OS page cache cleanup with `echo 1 | sudo tee /proc/sys/vm/drop_caches`.

```
./db_bench \
    --use_existing_db=true \
    --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \
    --compression_type=LZ4 \
    --benchmarks=multireadrandom \
    --num= **<N>** \
    --batch_size= **<B>** \
    --io_uring_enabled=true \
    --async_io=false \
    --optimize_multiget_for_io=false \
    --threads=4 \
    --cache_size=0 \
    --use_direct_reads=true \
    --use_direct_io_for_flush_and_compaction=true \
    --cache_index_and_filter_blocks=false \
    --pin_l0_filter_and_index_blocks_in_cache=false \
    --pin_top_level_index_and_filter=false \
    --prepopulate_block_cache=0 \
    --row_cache_size=0 \
    --use_blob_cache=false \
    --use_compressed_secondary_cache=false
```

  | B=10; N=100,000 | B = 250; N=80,000  | B = 1,000; N=20,000
-- | -- | -- | --
baseline | 31.5 (± 0.4) us/op | 17.5 (± 0.5) us/op | 13.5 (± 0.4) us/op
io_uring_submit_and_wait |  31.5 (± 0.6) us/op |  17.7 (± 0.4) us/op |  13.6 (± 0.4) us/op
io_uring_submit | 31.5 (± 0.6) us/op | 17.5 (± 0.5) us/op | 13.4 (± 0.45) us/op

### Specs

  | Property | Value
-- | --
RocksDB | version 10.9.0
Date | Tue Dec 9 15:57:03 2025
CPU | 56 * Intel Sapphire Rapids (T10 SPR)
Kernel version | 6.9.0-0_fbk12_0_g28f2d09ad102

Pull Request resolved: https://github.com/facebook/rocksdb/pull/14158

Reviewed By: anand1976

Differential Revision: D88172809

Pulled By: mszeszko-meta

fbshipit-source-id: 5198de3d2f18f76fee661a2ec5f447e79ba06fbd
2025-12-12 14:25:40 -08:00
..
composite_env.cc Support GetFileSize API in FSRandomAccessFile (#13676) 2025-07-09 10:40:28 -07:00
composite_env_wrapper.h Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
emulated_clock.h Remove 'virtual' when implied by 'override' (#12319) 2024-01-31 13:14:42 -08:00
env.cc Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
env_basic_test.cc internal_repo_rocksdb (4372117296613874540) (#12117) 2023-12-04 11:17:32 -08:00
env_chroot.cc internal_repo_rocksdb (4372117296613874540) (#12117) 2023-12-04 11:17:32 -08:00
env_chroot.h Remove RocksDB LITE (#11147) 2023-01-27 13:14:19 -08:00
env_encryption.cc Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
env_encryption_ctr.h Standardize on clang-format version 18 (#13233) 2024-12-19 10:58:40 -08:00
env_posix.cc Port codemod changes from fbcode/rocksdb (#13714) 2025-06-20 17:56:24 -07:00
env_test.cc IO uring improvements (#14158) 2025-12-12 14:25:40 -08:00
file_system.cc Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
file_system_tracer.cc Use C++20 in public API, fix CI (#13915) 2025-09-08 13:11:28 -07:00
file_system_tracer.h Change ReadAsync callback API to remove const from FSReadRequest (#11649) 2024-02-16 09:14:55 -08:00
fs_on_demand.cc Fix compile error in Clang (#12588) 2024-05-02 16:54:21 -07:00
fs_on_demand.h Basic RocksDB follower implementation (#12540) 2024-04-19 19:13:31 -07:00
fs_posix.cc IO uring improvements (#14158) 2025-12-12 14:25:40 -08:00
fs_readonly.h Standardize on clang-format version 18 (#13233) 2024-12-19 10:58:40 -08:00
fs_remap.cc Standardize on clang-format version 18 (#13233) 2024-12-19 10:58:40 -08:00
fs_remap.h Standardize on clang-format version 18 (#13233) 2024-12-19 10:58:40 -08:00
io_posix.cc IO uring improvements (#14158) 2025-12-12 14:25:40 -08:00
io_posix.h IO uring improvements (#14158) 2025-12-12 14:25:40 -08:00
io_posix_test.cc Change PosixWritableFile Truncate to reseek to new end of file (#14088) 2025-10-29 12:58:03 -07:00
mock_env.cc Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
mock_env.h Revert "Create a new API FileSystem::SyncFile for file sync (#13762)" (#13987) 2025-09-22 15:30:24 -07:00
mock_env_test.cc internal_repo_rocksdb (4372117296613874540) (#12117) 2023-12-04 11:17:32 -08:00
unique_id_gen.cc Fix windows build errors (rdtsc and fnptr) (#12008) 2023-10-24 16:20:37 -07:00
unique_id_gen.h Internal API for generating semi-random salt (#11331) 2023-06-21 11:32:49 -07:00