Summary: Add `open_files_async` option for faster DB startup. When enabled, SST file opening and validation is deferred to a background thread after `DB::Open` returns, reducing startup latency for databases with many SST files. WAL recovery remains synchronous. To support this, `FindTable` is extended with a pinning mechanism that stores the cache handle directly on `FileMetaData` via a new `PinnedTableReader` class, and sets the table reader atomically so subsequent reads skip cache lookups. `FileDescriptor::table_reader` is replaced with `PinnedTableReader pinned_reader` which wraps a `std::atomic<TableReader*>` with acquire/release ordering to safely handle concurrent access between the background opener and read threads. Should validations fail, the background opener sets a `kAsyncFileOpen` background error. Future read requests will look up the table reader again via the cache, and if any validations fail there it will get propagated to the user (existing behavior when `max_open_files > 0`). This feature is most useful when `max_open_files=-1`, because otherwise file opening is already capped at 16 files and DB open should be fast. ## Restrictions - This feature also is incompatible with fifo compaction because fifo compaction requires reading table properties under DB mutex. When table reader is unpinned, this may cause a DB hang. - This feature is also incompatible with `skip_stats_update_on_db_open=false` because it will result in even longer DB open ## Key changes - New `open_files_async` DB option with C, Java, and `db_bench` bindings - `BGWorkAsyncFileOpen` background worker that opens all SST files post-`DB::Open`, with shutdown awareness via `shutting_down_` flag - New `PinnedTableReader` class in `version_edit.h` — thread-safe wrapper holding `std::atomic<TableReader*>` and `Cache::Handle*` with proper acquire/release ordering. Replaces the old `FileDescriptor::table_reader` raw pointer and `FileMetaData::table_reader_handle` - Extract `LoadTableHandlersHelper` into `db/version_util.cc` — shared between `VersionBuilder::LoadTableHandlers` (for version edits during recovery) and `BGWorkAsyncFileOpen` (for base storage post-open) - `FindTable` extended with `pin_table_handle` and `out_table_reader` params — when pinning is enabled, the table reader is stored on `FileMetaData` so Get/MultiGet/Iterator skip redundant cache lookups. `FindTable` now performs the pinned-reader fast-path check internally instead of requiring callers to check `fd.table_reader` beforehand - Note: pinning is explicit (not default) because some callers create temporary `FileMetaData`s that would need to properly clean up table handles - `CompactedDBImpl` updated to use `FindTable` + pinning instead of raw `fd.table_reader` access for Get/MultiGet - New `kAsyncFileOpen` background error reason in `listener.h` and `error_handler.cc` - Add a check in ~DBImpl to ensure async file open task has not been forgotten to be scheduled in (future) subclasses of DBImpl. Certain subclasses that never use it will need to explicitly mark it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/14322 Test Plan: - `OpenFilesAsyncTest` parameterized over `num_flushes` (1, 20), `ReadType` (Get, MultiGet, Iterator), `max_open_files` (-1, 10), and `read_only` (true, false) - **ConcurrentFileAccess**: concurrent reads and compactions race with async opener - **AfterRead**: reads happen before async opener, verifying lazy open and that the opener sees already-pinned readers - **BeforeRead**: async opener completes first, verifying reads use pre-loaded table readers - **Shutdown**: DB closes before async opener starts, verifying clean cancellation with 0 file opens - **Error**: corrupted SST files, verifying `kAsyncFileOpen` background error is set and reads return corruption - **DropColumnFamily**: CF dropped before async opener runs, verifying the opener gracefully skips dropped CFs - Added to crash test ### Benchmark To simulate a high-latency remote filesystem, I set up a virtual filesystem with dm-delay using 10ms reads, 0 ms writes. ``` # Generate a DB with many L0 files TEST_TMPDIR=/data/users/jkangs/dm-delay-test/mnt ./db_bench -benchmarks=fillseq -disable_auto_compactions=true -write_buffer_size=1000 -num=1000000 ``` ``` ./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=true -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open OpenDb: 25.1419 milliseconds ``` ``` ./db_bench -use_existing_db=true -db=/data/users/jkangs/dm-delay-test/mnt/dbbench -benchmarks=readrandom -reads=1 -report_open_timing=true -open_files_async=false -use_direct_reads -file_opening_threads=1 -skip_stats_update_on_db_open OpenDb: 23109.4 milliseconds ``` ### No read regressions On main branch ``` ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 readrandom : 4.827 micros/op 1657100 ops/sec 30.005 seconds 49720992 operations; 183.3 MB/s (6198999 of 6198999 found) ``` On this branch ``` ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 readrandom : 4.863 micros/op 1644808 ops/sec 30.007 seconds 49354992 operations; 182.0 MB/s (6099999 of 6099999 found) ./db_bench -use_existing_db=true -db=/dev/shm/dbbench -benchmarks=readrandom -seed=1 -threads=8 -duration=30 -open_files_async=true readrandom : 4.803 micros/op 1665392 ops/sec 30.004 seconds 49968992 operations; 184.2 MB/s (6222999 of 6222999 found) ``` Reviewed By: pdillinger, xingbowang Differential Revision: D93538033 Pulled By: joshkang97 fbshipit-source-id: 32ac70c112cd733b7c1e1c1e2e7ce6422318a5ae
142 lines
5.9 KiB
C++
142 lines
5.9 KiB
C++
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
//
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
//
|
|
#pragma once
|
|
|
|
#include <memory>
|
|
|
|
#include "db/version_edit.h"
|
|
#include "rocksdb/file_system.h"
|
|
#include "rocksdb/metadata.h"
|
|
#include "rocksdb/slice_transform.h"
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
struct ImmutableCFOptions;
|
|
class TableCache;
|
|
class VersionStorageInfo;
|
|
class VersionEdit;
|
|
struct FileMetaData;
|
|
class InternalStats;
|
|
class Version;
|
|
class VersionSet;
|
|
class VersionEditHandler;
|
|
class ColumnFamilyData;
|
|
class CacheReservationManager;
|
|
|
|
// A helper class so we can efficiently apply a whole sequence
|
|
// of edits to a particular state without creating intermediate
|
|
// Versions that contain full copies of the intermediate state.
|
|
class VersionBuilder {
|
|
public:
|
|
VersionBuilder(const FileOptions& file_options,
|
|
const ImmutableCFOptions* ioptions, TableCache* table_cache,
|
|
VersionStorageInfo* base_vstorage, VersionSet* version_set,
|
|
std::shared_ptr<CacheReservationManager>
|
|
file_metadata_cache_res_mgr = nullptr,
|
|
ColumnFamilyData* cfd = nullptr,
|
|
VersionEditHandler* version_edit_handler = nullptr,
|
|
bool track_found_and_missing_files = false,
|
|
bool allow_incomplete_valid_version = false);
|
|
~VersionBuilder();
|
|
|
|
bool CheckConsistencyForNumLevels();
|
|
|
|
Status Apply(const VersionEdit* edit);
|
|
|
|
// Save the current Version to the provided `vstorage`.
|
|
Status SaveTo(VersionStorageInfo* vstorage) const;
|
|
|
|
// Load table handlers for newly added files in the builder. This does not
|
|
// load any files in the base storage.
|
|
Status LoadTableHandlers(InternalStats* internal_stats, int max_threads,
|
|
bool prefetch_index_and_filter_in_cache,
|
|
bool is_initial_load,
|
|
const MutableCFOptions& mutable_cf_options,
|
|
size_t max_file_size_for_l0_meta_pin,
|
|
const ReadOptions& read_options);
|
|
|
|
//============APIs only used by VersionEditHandlerPointInTime ============//
|
|
|
|
// Creates a save point for the Version that has been built so far. Subsequent
|
|
// VersionEdits applied to the builder will not affect the Version in this
|
|
// save point. VersionBuilder currently only supports creating one save point,
|
|
// so when `CreateOrReplaceSavePoint` is called again, the previous save point
|
|
// is cleared. `ClearSavePoint` can be called explicitly to clear
|
|
// the save point too.
|
|
void CreateOrReplaceSavePoint();
|
|
|
|
// The builder can find all the files to build a `Version`. Or if
|
|
// `allow_incomplete_valid_version_` is true and the version history is never
|
|
// edited in an atomic group, and only a suffix of L0 SST files and their
|
|
// associated blob files are missing.
|
|
// From the users' perspective, missing a suffix of L0 files means missing the
|
|
// user's most recently written data. So the remaining available files still
|
|
// presents a valid point in time view, although for some previous time.
|
|
// This validity check result will be cached and reused if the Version is not
|
|
// updated between two validity checks.
|
|
bool ValidVersionAvailable();
|
|
|
|
bool HasMissingFiles() const;
|
|
|
|
// When applying a sequence of VersionEdit, intermediate files are the ones
|
|
// that are added and then deleted. The caller should clear this intermediate
|
|
// files tracking after calling this API. So that the tracking for subsequent
|
|
// VersionEdits can start over with a clean state.
|
|
std::vector<std::string>& GetAndClearIntermediateFiles();
|
|
|
|
// Clearing all the found files in this Version.
|
|
void ClearFoundFiles();
|
|
|
|
// Save the Version in the save point to the provided `vstorage`.
|
|
// Non-OK status will be returned if there is not a valid save point.
|
|
Status SaveSavePointTo(VersionStorageInfo* vstorage) const;
|
|
|
|
// Load all the table handlers for the Version in the save point.
|
|
// Non-OK status will be returned if there is not a valid save point.
|
|
Status LoadSavePointTableHandlers(InternalStats* internal_stats,
|
|
int max_threads,
|
|
bool prefetch_index_and_filter_in_cache,
|
|
bool is_initial_load,
|
|
const MutableCFOptions& mutable_cf_options,
|
|
size_t max_file_size_for_l0_meta_pin,
|
|
const ReadOptions& read_options);
|
|
|
|
void ClearSavePoint();
|
|
|
|
//======= End of APIs only used by VersionEditPointInTime==========//
|
|
|
|
private:
|
|
class Rep;
|
|
std::unique_ptr<Rep> savepoint_;
|
|
std::unique_ptr<Rep> rep_;
|
|
};
|
|
|
|
// A wrapper of version builder which references the current version in
|
|
// constructor and unref it in the destructor.
|
|
// Both of the constructor and destructor need to be called inside DB Mutex.
|
|
class BaseReferencedVersionBuilder {
|
|
public:
|
|
explicit BaseReferencedVersionBuilder(
|
|
ColumnFamilyData* cfd, VersionEditHandler* version_edit_handler = nullptr,
|
|
bool track_found_and_missing_files = false,
|
|
bool allow_incomplete_valid_version = false);
|
|
BaseReferencedVersionBuilder(
|
|
ColumnFamilyData* cfd, Version* v,
|
|
VersionEditHandler* version_edit_handler = nullptr,
|
|
bool track_found_and_missing_files = false,
|
|
bool allow_incomplete_valid_version = false);
|
|
~BaseReferencedVersionBuilder();
|
|
VersionBuilder* version_builder() const { return version_builder_.get(); }
|
|
|
|
private:
|
|
std::unique_ptr<VersionBuilder> version_builder_;
|
|
Version* version_;
|
|
};
|
|
} // namespace ROCKSDB_NAMESPACE
|