Summary:
Add a new mutable DB option `verify_manifest_content_on_close` (default: false).
When enabled, on DB close the MANIFEST file is read back and all records are
validated (CRC checksums via log::Reader and logical content via
VersionEdit::DecodeFrom). If corruption is detected, a fresh MANIFEST is written
from in-memory state using the existing LogAndApply recovery path.
This complements the existing size validation in VersionSet::Close() with content
validation, reusing the same manifest reading pattern as VersionSet::Recover().
Implementation plan:
## Part 1: New DB Option — verify_manifest_content_on_close
- A new mutable bool DB option (default: false) that can be dynamically toggled
via SetDBOptions() at runtime, following the pattern of other mutable manifest
options like max_manifest_file_size.
- Propagation: SetDBOptions() -> DBImpl::mutable_db_options_ ->
versions_->UpdatedMutableDbOptions() -> VersionSet::verify_manifest_content_on_close_
## Part 2: Core Implementation — Content Validation in VersionSet::Close()
- Inserted after existing size check, before closed_ = true
- Opens manifest as SequentialFileReader, creates log::Reader with checksum=true
- Loops ReadRecord with WALRecoveryMode::kAbsoluteConsistency, decodes each
record as VersionEdit
- On corruption: fires OnIOError listeners, logs error, calls LogAndApply with
empty edit to trigger manifest rewrite from in-memory state
- If manifest can't be opened for reading: logs warning, doesn't fail close
## Part 3: Unit Tests (in version_set_test.cc)
- ManifestContentValidationOnClose_Clean: enable option, normal close, verify
no manifest rotation
- ManifestContentValidationOnClose_CorruptRecord: enable option, corrupt manifest
via SyncPoint, verify rotation occurs and DB reopens cleanly
- ManifestContentValidationOnClose_Disabled: default off, verify content
validation does not run
- ManifestContentValidationOnClose_SizeCheckFails: truncate manifest so size
check fails first, verify recovery via size-check path
## What Happens If a Corruption is Detected
If corruption was detected, four things happen:
1. **Notify listeners** — Fires `OnIOError` on all registered event listeners
(from db_options_->listeners) so monitoring/alerting systems can observe
the corruption event. Uses `FileOperationType::kVerify` to categorize it.
2. **Permit unchecked errors** — `PermitUncheckedError()` silences RocksDB's
debug-mode assertion that every `IOStatus` must be inspected. These statuses
are informational-only here; the real recovery is via `LogAndApply`.
3. **Log the error** — Writes a `ROCKS_LOG_ERROR` message with the filename
for operational visibility (grep-able in production logs).
4. **Rewrite the manifest via `LogAndApply`** — This is the actual recovery.
`LogAndApply` is called with an empty `VersionEdit` (no changes). Internally,
`LogAndApply` detects that the current `descriptor_log_` is null (it was
reset at line 5551, or by the previous `LogAndApply` in the size-check
path) and creates a brand-new MANIFEST file. It serializes the entire
current in-memory LSM state — all column families, all levels, all file
metadata, sequence numbers, etc. — into this new file. It then atomically
updates the `CURRENT` file pointer to reference the new MANIFEST.
This works because the in-memory state was built from the original manifest
during `DB::Open()` and has been kept fully up to date through all
subsequent operations (flushes, compactions, etc.) during the DB's lifetime.
The on-disk manifest is essentially a journal of changes; `LogAndApply`
with an empty edit produces a fresh, compacted snapshot of that state.
## Flow Diagram of Manifest Content Validation
VersionSet::Close()
│
├─ Close descriptor_log_ and check size
│ └─ Size mismatch? → LogAndApply (rewrite manifest)
│
├─ Content validation (if s.ok() && option enabled)
│ ├─ Open manifest for sequential reading
│ │ └─ Can't open? → WARN log, continue
│ │
│ ├─ For each record:
│ │ ├─ ReadRecord (CRC32 check, kAbsoluteConsistency)
│ │ └─ DecodeFrom (VersionEdit logical check)
│ │
│ └─ Corruption detected?
│ ├─ Notify OnIOError listeners
│ ├─ LOG_ERROR
│ └─ LogAndApply (rewrite manifest from in-memory state)
│
└─ closed_ = true; return s;
## How This Relates to the Existing Size Check
The existing size check (lines 5556-5582) and the new content validation are
complementary:
| Check | What it catches | How it checks |
|----------------|-----------------------------------------|----------------------------|
| Size check | Truncation, partial writes, extra bytes | Compare expected vs actual file size |
| Content check | Bit-rot, silent corruption, bad records | CRC32 + VersionEdit decode |
The size check catches gross corruption (file too short or too long). The
content check catches subtle corruption where the file is the right size but
individual bytes have been flipped (e.g., storage media bit-rot, buggy
filesystem, incomplete block write).
Both recovery paths use the same mechanism: `LogAndApply` with an empty
`VersionEdit` to rewrite the manifest from in-memory state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/14451
Reviewed By: xingbowang
Differential Revision: D96004906
Pulled By: dannyhchen
fbshipit-source-id: 0b0ecdada3a74e97d2cadbba2091b8b577f1d684