forked from continuwuation/rocksdb
Summary: `PosixRandomAccessFile::MultiRead` was introduced in Dec 2019 in https://github.com/facebook/rocksdb/pull/5881. Subsequently, 2 years after, we introduced the `PosixRandomAccessFile::ReadAsync` API in https://github.com/facebook/rocksdb/pull/9578, which was reusing the same `PosixFileSystem` IO ring as `MultiRead` API, consequently writing to the very same ring's submission queue (without waiting!). This 'shared ring' design is problematic, since sequentially interleaving `ReadAsync` and `MultiRead` API calls on the very same thread might result in reading 'unknown' events in `MultiRead` leading to `Bad cqe data` errors (and therefore falsely perceived as a corruption) - which, for some services (running on local flash), in itself is a hard blocker for adopting RocksDB async prefetching ('async IO') that heavily relies on the `ReadAsync` API. This change aims to solve this problem by maintaining separate thread local IO rings for `async reads` and `multi reads` assuring correct execution. In addition, we're adding more robust error handling in form of retries for kernel interrupts and draining the queue when process is experiencing terse memory condition. Separately, we're enhancing the performance aspect by explicitly marking the rings to be written to / read from by a single thread (`IORING_SETUP_SINGLE_ISSUER` [if available]) and defer the task just before the application intends to process completions (`IORING_SETUP_DEFER_TASKRUN` [if available]). See https://man7.org/linux/man-pages/man2/io_uring_setup.2.html for reference. ## Benchmark **TLDR** There's no evident advantage of using `io_uring_submit` (relative to proposed `io_uring_submit_and_wait`) across batches of size 10, 250 and 1000 simulating significantly-less, close-to and 4x-above `kIoUringDepth` batch size. `io_uring_submit` might be more appealing if (at least) one of the IOs is slow (which was NOT the case during the benchmark). More notably, with this PR switching from `io_uring_submit_and_wait` -> `io_uring_submit` can be done with a single line change due to implemented guardrails (we can followup with adding optional config for true ring semantics [if needed]). **Compilation** ``` DEBUG_LEVEL=0 make db_bench ``` **Create DB** ``` ./db_bench \ --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \ --benchmarks=fillseq \ --num=2500000 \ --key_size=100 \ --value_size=6144 \ --compression_type=LZ4 \ --block_size=16384 \ --seed=1723056275 ``` **LSM** * L0: 2 files, L1: 5, L2: 49, L3: 79 * Each file is roughly ~35M in size ### MultiReadRandom (with caching disabled) Each run was preceded by OS page cache cleanup with `echo 1 | sudo tee /proc/sys/vm/drop_caches`. ``` ./db_bench \ --use_existing_db=true \ --db=/db/testdb_2.5m_k100_v6144_16kB_LZ4 \ --compression_type=LZ4 \ --benchmarks=multireadrandom \ --num= **<N>** \ --batch_size= **<B>** \ --io_uring_enabled=true \ --async_io=false \ --optimize_multiget_for_io=false \ --threads=4 \ --cache_size=0 \ --use_direct_reads=true \ --use_direct_io_for_flush_and_compaction=true \ --cache_index_and_filter_blocks=false \ --pin_l0_filter_and_index_blocks_in_cache=false \ --pin_top_level_index_and_filter=false \ --prepopulate_block_cache=0 \ --row_cache_size=0 \ --use_blob_cache=false \ --use_compressed_secondary_cache=false ``` | B=10; N=100,000 | B = 250; N=80,000 | B = 1,000; N=20,000 -- | -- | -- | -- baseline | 31.5 (± 0.4) us/op | 17.5 (± 0.5) us/op | 13.5 (± 0.4) us/op io_uring_submit_and_wait | 31.5 (± 0.6) us/op | 17.7 (± 0.4) us/op | 13.6 (± 0.4) us/op io_uring_submit | 31.5 (± 0.6) us/op | 17.5 (± 0.5) us/op | 13.4 (± 0.45) us/op ### Specs | Property | Value -- | -- RocksDB | version 10.9.0 Date | Tue Dec 9 15:57:03 2025 CPU | 56 * Intel Sapphire Rapids (T10 SPR) Kernel version | 6.9.0-0_fbk12_0_g28f2d09ad102 Pull Request resolved: https://github.com/facebook/rocksdb/pull/14158 Reviewed By: anand1976 Differential Revision: D88172809 Pulled By: mszeszko-meta fbshipit-source-id: 5198de3d2f18f76fee661a2ec5f447e79ba06fbd |
||
|---|---|---|
| .. | ||
| composite_env.cc | ||
| composite_env_wrapper.h | ||
| emulated_clock.h | ||
| env.cc | ||
| env_basic_test.cc | ||
| env_chroot.cc | ||
| env_chroot.h | ||
| env_encryption.cc | ||
| env_encryption_ctr.h | ||
| env_posix.cc | ||
| env_test.cc | ||
| file_system.cc | ||
| file_system_tracer.cc | ||
| file_system_tracer.h | ||
| fs_on_demand.cc | ||
| fs_on_demand.h | ||
| fs_posix.cc | ||
| fs_readonly.h | ||
| fs_remap.cc | ||
| fs_remap.h | ||
| io_posix.cc | ||
| io_posix.h | ||
| io_posix_test.cc | ||
| mock_env.cc | ||
| mock_env.h | ||
| mock_env_test.cc | ||
| unique_id_gen.cc | ||
| unique_id_gen.h | ||