RocksDB corruption - any way to fix? #1107

Open
opened 2025-10-08 23:33:11 +00:00 by mcronce · 3 comments

(Without throwing away all my existing data and starting over; I'm expecting some loss at this point.)

After a power outage, I was getting this error on startup:

Critical error starting server: I/O error: Corruption: no next_file_number entry in MANIFEST The file /data/db/MANIFEST-11265429 may be corrupted

I backed up the (corrupted) database before attempting any repair, so starting from that point again is possible. (And I've done it several times.)

After a repair completes (CONDUWUIT_ROCKSDB_REPAIR=true), with CONDUWUIT_ROCKSDB_RECOVERY_MODE set to 1, 2, or 3, I get the following error:

Critical error starting server: I/O error: Invalid argument: Column families not opened: roomid_lasttypingupdate, userid_lastpresenceupdate, typingid_userid

Allowing it to restart after that, it either hangs indefinitely (seemingly rewriting the database into new files over and over, based on watching strace and the DB directory) or results in continuwuity coming up healthy, but with the database fully wiped.

Any additional troubleshooting I can perform or tips for possible repair would be extremely appreciated. If I were to lose those three column families - I'm assuming they represent last typing updates, last presence updates, and user typing relations - I wouldn't be particularly heartbroken, but private (unfederated) room event histories, space/room hierarchy, users, etc are things I'm hoping to be able to recover, at least mostly.

Running Continuwuity v0.5.0-rc6 in a container, in case it's relevant

(Without throwing away _all_ my existing data and starting over; I'm expecting _some_ loss at this point.) After a power outage, I was getting this error on startup: ```Critical error starting server: I/O error: Corruption: no next_file_number entry in MANIFEST The file /data/db/MANIFEST-11265429 may be corrupted``` I backed up the (corrupted) database before attempting any repair, so starting from that point again is possible. (And I've done it several times.) After a repair completes (`CONDUWUIT_ROCKSDB_REPAIR=true`), with `CONDUWUIT_ROCKSDB_RECOVERY_MODE` set to `1`, `2`, or `3`, I get the following error: ```Critical error starting server: I/O error: Invalid argument: Column families not opened: roomid_lasttypingupdate, userid_lastpresenceupdate, typingid_userid``` Allowing it to restart after that, it either hangs indefinitely (seemingly rewriting the database into new files over and over, based on watching `strace` and the DB directory) or results in continuwuity coming up healthy, but with the database fully wiped. Any additional troubleshooting I can perform or tips for possible repair would be extremely appreciated. If I were to lose those three column families - I'm assuming they represent last typing updates, last presence updates, and user typing relations - I wouldn't be particularly heartbroken, but private (unfederated) room event histories, space/room hierarchy, users, etc are things I'm hoping to be able to recover, at least mostly. Running Continuwuity v0.5.0-rc6 in a container, in case it's relevant
Owner

If none of the recovery modes allowed you to get going again, it's highly likely that you have severe data loss and what you could get back even from a successful recovery at this point might not even be of worth.

Your best bet is to restore to a backup from before the corruption point, although from the tone of your issue I'm assuming you don't have one of those. You can try using some external rocksdb tools to tinker with the db - a quick search of the error message you get after a repair completes seems to indicate that the columns roomid_lasttypingupdate, userid_lastpresenceupdate, and typingid_userid don't exist, so you could try manually creating some empty columns and retrying the repair, but I don't know enough about rocksdb myself to help you more than that :(

If none of the recovery modes allowed you to get going again, it's highly likely that you have severe data loss and what you could get back even from a successful recovery at this point might not even be of worth. Your best bet is to restore to a backup from before the corruption point, although from the tone of your issue I'm assuming you don't have one of those. You can try using some external rocksdb tools to tinker with the db - a quick search of the error message you get after a repair completes seems to indicate that the columns `roomid_lasttypingupdate`, `userid_lastpresenceupdate`, and `typingid_userid` don't exist, so you could try manually creating some empty columns and retrying the repair, but I don't know enough about rocksdb myself to help you more than that :(
Author

Nope, no working backups. I don't know much about rocksdb, and a couple brief searches over the years never yielded any results for backup utilities that wouldn't require shutting the server down.

It's hard to imagine that the bulk of the data is cooked - it's not like days/weeks/months old files were being written to at the time of the power loss. It would only be recent data and (by the looks of things) that MANIFEST file. Hoping that there's a rocksdb expert floating around here somewhere :)

Nope, no working backups. I don't know much about rocksdb, and a couple brief searches over the years never yielded any results for backup utilities that wouldn't require shutting the server down. It's hard to imagine that the bulk of the data is cooked - it's not like days/weeks/months old files were being written to at the time of the power loss. It would only be recent data and (by the looks of things) that `MANIFEST` file. Hoping that there's a rocksdb expert floating around here somewhere :)
Owner

a couple brief searches over the years never yielded any results for backup utilities that wouldn't require shutting the server down.

There's built-in support for online backups by sending the !admin server backup-database command to the admin room (possibly on a crontab) - see https://continuwuity.org/maintenance#backups for future reference.

Hoping that there's a rocksdb expert floating around here somewhere :)

Feel free to mention it in either #main:continuwuity.org and/or our dev room at #dev:continuwuity.org, there's generally more eyes there than the issue tracker (no guarantees though, rocksdb is a little bit of magic scribed in a GitHub wiki of all things)

> a couple brief searches over the years never yielded any results for backup utilities that wouldn't require shutting the server down. There's built-in support for online backups by sending the `!admin server backup-database` command to the admin room (possibly on a crontab) - see https://continuwuity.org/maintenance#backups for future reference. > Hoping that there's a rocksdb expert floating around here somewhere :) Feel free to mention it in either [#main:continuwuity.org](https://matrix.to/#/#main:continuwuity.org) and/or our dev room at [#dev:continuwuity.org](https://matrix.to/#/#dev:continuwuity.org), there's generally more eyes there than the issue tracker (no guarantees though, rocksdb is a little bit of magic scribed in a GitHub wiki of all things)
Sign in to join this conversation.
No milestone
No project
No assignees
2 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
continuwuation/continuwuity#1107
No description provided.