perf: refactor the service manager to achieve significantly faster server initialization #1482
Closed
gamesguru
wants to merge 27 commits from
gamesguru/continuwuity:guru/fix/cli-optimization/faster-startup-safer-shutdown into main
pull from: gamesguru/continuwuity:guru/fix/cli-optimization/faster-startup-safer-shutdown
merge into: continuwuation:main
continuwuation:main
continuwuation:renovate/node-patch-updates
continuwuation:renovate/https-github.com-taiki-e-install-action-digest
continuwuation:nex/feat/policy-servers-2-electric-boogaloo
continuwuation:ginger/ruma-upstreaming
continuwuation:aranje/illegal-car-mods
continuwuation:renovate/https-github.com-softprops-action-gh-release-3.x
continuwuation:renovate/serde-saphyr-0.x
continuwuation:renovate/pin-dependencies
continuwuation:jade/tls-backends
continuwuation:renovate/rand_core-0.x
continuwuation:ginger/email-fixes
continuwuation:jade/changelog-labels
continuwuation:nex/fix/v12-publishing
continuwuation:jade/build-info
continuwuation:jade/purge-sync-tokens
continuwuation:ginger/terms-and-conditions
continuwuation:ginger/remove-sliding-sync-proxy
continuwuation:nex/fix/pusher-association
continuwuation:ginger/email-support
continuwuation:jade/community-guidelines
continuwuation:nex/fix/federation-format
continuwuation:jade/git-deps-updates
continuwuation:jade/changelog-check
continuwuation:jade/rust-1-92
continuwuation:ginger/password-reset
continuwuation:nex/experiment/push-gateway-logs
continuwuation:ginger/msc3575-obliteration
continuwuation:nex/feat/block-busted-rooms
continuwuation:nex/fix/informative-startup-errs
continuwuation:ginger/no-left-room-initial-sync
continuwuation:jade/docker-entrypoint
continuwuation:jade/dehydrated-devices
continuwuation:ginger/complement-fixes
continuwuation:nex/fix/stale-destination-cache
continuwuation:nex/experiment/sync-mutex
continuwuation:tcpipuk/docker-docs
continuwuation:jade/snafu
continuwuation:jade/rand-update
continuwuation:nex/stateres-refactor
continuwuation:ginger/779-in-troubleshooting
continuwuation:jade/liveit-guide
continuwuation:jade/http3
continuwuation:nex/feat/admin-hide-empty-rooms
continuwuation:ginger/oobe
continuwuation:nex/fix/debian-thingy
continuwuation:jade/ldap-admin-check
continuwuation:nex/fix/remote-restricted-joins
continuwuation:nex/feat/msc4406-sender-ignored
continuwuation:jade/deadlock-detection
continuwuation:nex/feat/room-shutdown
continuwuation:jade/get-started
continuwuation:jade/docs-guide
continuwuation:ginger/fix-local-invites
continuwuation:nex/fix/tpi
continuwuation:nex/feat/room-deletion
continuwuation:nex/feat/msc4322-media-redaction
continuwuation:ginger/stitched-order
continuwuation:ginger/deps/update-rspress
continuwuation:jade/admin-announce-improvements
continuwuation:ginger/xtask-improvements
continuwuation:jade/improve-admin-config-display
continuwuation:nex/fix/better-stateres-error-logs
continuwuation:jade/sender-timeouts
continuwuation:nex/feat/custom-v12-room-ids
continuwuation:ginger/update-metadata
continuwuation:nex/feat/admin-force-logout
continuwuation:tom/max-perf-docs
continuwuation:nex/fix/invalid-appservice-reg
continuwuation:nex/feat/antispam
continuwuation:nex/feat/account-locking
continuwuation:jade/logging-cleanup
continuwuation:jade/remove-legacy-appservice-auth
continuwuation:nex/fix/key-query
continuwuation:jade/update-prek
continuwuation:nex/fix/room-summaries
continuwuation:ginger/restrict-admin-commands
continuwuation:ginger/enable-console-by-default
continuwuation:jade/tag-fixes
continuwuation:jade/otlp
continuwuation:nex/meta/pull-req-template
continuwuation:nex/fix/fed-invite-compliance
continuwuation:nex/feat/build-commit
continuwuation:nex/feat/join-logging
continuwuation:jade/mailmap-updates
continuwuation:jade/hack-ci-tmp
continuwuation:jade/v12-stable
continuwuation:jade/relations
continuwuation:ginger/database-refactor
continuwuation:jade/fix-ldap-uiaa
continuwuation:nex/fix/validation
continuwuation:ginger/nuke-invalid-msc4133-fields-in-migration
continuwuation:ginger/downgrade-artifact-actions
continuwuation:oddlid/reload-fix
continuwuation:jade/fix-assert
continuwuation:ginger/sync-v3-cleanup
continuwuation:ginger/remove-absolute-action-urls
continuwuation:jade/website
continuwuation:nex/fix/backoff
continuwuation:ginger/fix-mdbook-for-0.5
continuwuation:ginger/no-docker-on-prs
continuwuation:backport/v0.5.0-rc.8-1
continuwuation:nex/fed-improvements
continuwuation:jade/rust-1.90
continuwuation:jade/mirror-dockerhub
continuwuation:jade/clippy-fixes
continuwuation:jade/fix-support
continuwuation:jade/clean-images
continuwuation:jade/wal-compression-type
continuwuation:jade/flake-clone
continuwuation:ginger/upload-rpms-on-schedule
continuwuation:nex/fix/incoming-fetch
continuwuation:nex/fix/upgrade
continuwuation:tom/ci-fedora-rpm
continuwuation:jade/ci-release-fix
continuwuation:jade/rocksdb-10-5
continuwuation:ginger/fix-msc4133-migration
continuwuation:ginger/migrate-busted-tz
continuwuation:hydra/public
continuwuation:nex/feat/manual-extremities
continuwuation:nex/feat/async-media
continuwuation:nex/feat/fast-joins-hack-do-not-use-DO-NOT-USE
continuwuation:nex/feat/better-logging
continuwuation:trigger-ci-so-latest-isnt-on-illegal-car-mods
continuwuation:nex/feat/pins-backfill
continuwuation:jade/tuwunel-2025-06-old
continuwuation:jade/ai-slop-db-docs
continuwuation:nex/fix-create-auth
continuwuation:jade/version-stats
continuwuation:jade/read-receipts
continuwuation:jade/rust-toolchain-no-targets
continuwuation:jade/logging-features
continuwuation:jade/syncv5-typing
continuwuation:jade/msc2815
continuwuation:morguldir/see-eye
continuwuation:jade/css-small-screen
continuwuation:nex/wip-751
continuwuation:tuwunel-rebase
continuwuation:test
continuwuation:oddlid/rename-admin-room-bot
continuwuation:strawberry/nix-ci-stuff
continuwuation:strawberry/valgrind
continuwuation:phonemain
continuwuation:strawberry/morgs-snake-sync-jason-main
continuwuation:newer-media-endpoints
continuwuation:folly-coroutines-async-io
continuwuation:federation-retry-timer-port
continuwuation:bad-attempt-at-extracting-homeserver-signing-key
continuwuation:room-deletion-attempt-do-not-use
No reviewers
Labels
Clear labels
Blocked
This pull request or issue is currently blocked from being merged/closed
Bug
Something isn't working as intended
Changelog
Added
Changelog entry added
Changelog
Missing
No changelog when one is needed
Changelog
None
Changelog is unnecesary for this change
Cherry-picking
Commits picked from other conduit projects
Database
This requires or includes changes to the database
Dependencies
Something dependency related
Dependencies/Renovate
Automatic dependency upgrades by Renovate
Difficulty
Easy
Low difficulty to implement - touches few parts of the codebase, low complexity
Difficulty
Hard
High difficulty to implement - touches many parts of the codebase, high complexity
Difficulty
Medium
Medium difficulty to implement - touches more parts of the codebase, higher complexity
Documentation
Improvements or additions to documentation
Enhancement
New feature or request
Good first issue
Good for newcomers
Help wanted
Additional eyes and keyboards are required for this one
Inherited
Issues that have been inhereted from the project pre-fork
Matrix/Administration
Features pertaining to homeserver administration
Matrix/Appservices
Features pertaining to the appservice API
Matrix/Auth
Features pertaining to authentication
Matrix/Client
Features pertaining to client-to-server interactions
Matrix/Core
Issues relating to core matrix functionality, such as state resolution and PDU formats
Matrix/E2EE
Issues related to end to end encryption
Matrix/Federation
Features pertaining to server-to-server interactions
Matrix/Hydra
Issues related to room version 12 and related changes (temporary label)
Matrix/MSC
Features pertaining to unstable matrix features
Matrix/Media
Features pertaining to media interactions
Matrix/T&S
Changes or issues related to trust & safety tooling
Merge
This PR is ready to be merged
Merge/Manual
This PR should be manually merged
Merge/Squash
This PR should be squashed when it is merged
Meta
Related to housekeeping, maintenance, or other repo-meta.
Meta/CI
Issues related to CI changes
Meta/Packaging
Packaging
Priority
Blocking
This issue is blocking the next release
Priority
High
This issue is very important
Priority
Low
This issue is of a rather low priority
Security
This item is related to general security
Status
Confirmed
This issue has enough information and is confirmed
Status
Duplicate
This issue or pull request already exists
Status
Invalid
This issue doesn't seem right
Status
Needs Investigation
This issue needs further investigation
Support
Questions or support requests
Wont fix
This will not be worked on
old/ci/cd
Ci/CD
Archived
old/rust
Pull requests that update Rust code
Archived
No labels
Blocked
Bug
Changelog
Added
Changelog
Missing
Changelog
None
Cherry-picking
Database
Dependencies
Dependencies/Renovate
Difficulty
Easy
Difficulty
Hard
Difficulty
Medium
Documentation
Enhancement
Good first issue
Help wanted
Inherited
Matrix/Administration
Matrix/Appservices
Matrix/Auth
Matrix/Client
Matrix/Core
Matrix/E2EE
Matrix/Federation
Matrix/Hydra
Matrix/MSC
Matrix/Media
Matrix/T&S
Merge
Merge/Manual
Merge/Squash
Meta
Meta/CI
Meta/Packaging
Priority
Blocking
Priority
High
Priority
Low
Security
Status
Confirmed
Status
Duplicate
Status
Invalid
Status
Needs Investigation
Support
Wont fix
old/ci/cd
old/rust
Milestone
Clear milestone
No items
No milestone
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
4 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference
continuwuation/continuwuity!1482
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "gamesguru/continuwuity:guru/fix/cli-optimization/faster-startup-safer-shutdown"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Update (3/14): All systems go, console Ctrl+C missing fixed with:
cargo build --profile release --features default,consolebusted due to moving some things to another PR, especiallyCtrl+Cis badly bustedUpdate (3/5/26): Going to remove the work regarding the shutdown, since I realized there's a better way that I don't have time for at the moment. The startup refactor I feel pretty confident in. The amount of speedup depends on your hardware... may be only 3x faster on a powerful setup, but could be close to 100x on systems that are completely choking up.
Refactor the service manager to achieve 30x faster server initialization and avoiding full database scans by implementing index-aware presence resets on a background thread.
And fix the shutdown manager by terminating workers in the correct order of their dependency hierarchy, thereby preventing shutdown hangs and resulting in an extremely fast shutdown sequence which gives special preference only to the RocksDB worker and the integrity of the database.
Pull request checklist:
mainbranch, and the branch is named something other thanmain.myself, if applicable. This includes ensuring code compiles.
authorizationwas missing" b3e4f9d97eRefactor the service manager to achieve 30x faster server initializationto feat: refactor the service manager to achieve 30x faster server initialization@ -10,3 +10,3 @@events::{AnyGlobalAccountDataEventContent, AnyRoomAccountDataEventContent,GlobalAccountDataEventType, RoomAccountDataEventType,RoomAccountDataEventType,FWIW, I believe this change is covered by #1479, so you may want to rebase depending on the outcome of that one.
feat: refactor the service manager to achieve 30x faster server initializationto WIP: feat: refactor the service manager to achieve 30x faster server initializationCommits affecting the below files will be removed to other branches. It is advised to use the
gitCLI when reviewing my PRs, especially large ones.@lveneris I'm not able to see your comment due to some mobile bug. But referring back to my email...
I did manual testing to verify approximately correct behavior.
The commit about the admin service was added (perhaps only on my nightly build branch?) as a direct result of unexpected behavior.
As far as I recall, router is king. He holds the database handles, which in turn hold handles to sub processes calling out to the DB driver.
The code you are referencing occurs at the end, not start. So I'm not understanding your specific concern... That the database must continue to function after closing the app? Or that this permanently corrupts the database?
The proper, safe way, according to a rust expert from a bulliten board system involves both traceability and parallelism refactors, and likely some changes in DB driver behavior
All mine really achieves in the present form on this PR is the avoidance of deadlocks at shutdown time due to congestion along the dependency chain from various services still trying to operate.
It achieved consistently quick shutdowns perhaps at the expense of, if your computer loses power immediately after, you may need to rebuild from the WAL next boot, which is painfully slow and mildly risky
@gamesguru That wasn't a mobile bug - I reviewed your PR from mobile, didn't see the full filenames when I was looking over things because my phone is small, noticed my mistake within a minute of posting my review, and promptly retracted it. Not the highlight of my career, that's for sure. Apologies for the misunderstanding.
I will go over this on a more appropriate device at a later date. When do you plan to remove / relocate the changes you have deemed irrelevant to this PR?
@lveneris
Ok no worries man. I've done the same thing, but usually not leaving an email record of it 😅
You don't have to hang your head in shame, next time just edit the comment. "Edit: whoops realized that was at the end not start! Carry on, I'll see if i notice anything else here."
We're all human.. you don't know how many hours I spent staring at the screen for this one, how many builds i cut where i added log statements that didn't tell me what I hoped, or how many of those builds crapped out after 10 minutes from missed formatting changes or basic syntax... any participation here is good man. Thanks for commenting. Gave me a chance to explain a bit better something that probably left a lot of people a bit unclear.
@ -2806,2 +2805,3 @@fn default_client_shutdown_timeout() -> u64 { 10 }fn default_sender_shutdown_timeout() -> u64 { 5 }fn default_sender_shutdown_timeout() -> u64 { 3 }There's really no point in waiting longer than 3 seconds after corking. I'm not sure that 5 seconds guarantees any safer shutdown.
Some people have slower or more overloaded devices.
My device is quite overloaded, 150,000 users and 1 GB of ram. Yet waiting 1 second in my experiments changed nothing about the state of the RocksDB handlers compared to even 120 seconds.
Regardless, it's going into a separate PR for shutdown cleanup logic.
@ -29,3 +29,3 @@opts.set_max_subcompactions(num_threads::<u32>(config)?);opts.set_avoid_unnecessary_blocking_io(true);opts.set_max_file_opening_threads(0);opts.set_max_file_opening_threads(num_threads::<i32>(config)?);Supposedly allows for faster ingestion of the WAL on startup, as well as runtime parallelism for read ops. I'm not sure what the default behavior of zero did.
This appears undocumented.
Internally or at RocksDB? See, for example, their C# wrapper:
Whoever set it to zero, I'm not sure where they got the idea to use zero.
@ -373,0 +374,4 @@"Admin command handler is not yet loaded. The server may still be booting or \the admin module failed to load.",));};This was necessary, as interrupting the program during the initialization sequence led to some pretty horrible crashes.
@ -61,0 +67,4 @@} else {None};Presence updates are directly related to this PR.
A dumb, full Database scan was the primary culprit for slow startups (see below comment on removed code block with call chains).
@ -185,4 +196,0 @@.list_local_users().map(ToOwned::to_owned).collect::<Vec<OwnedUserId>>().awaitI determined this was the ultimate source of the problem. It scanned every user, a notoriously slow raw query.
This only scans local users -
pub fn list_local_users(&self) -> impl Stream<Item = &UserId> + Send + '_ {- so is unlikely to be the culprit.
The issue is the function call chain does not behave as the English names of the methods would suggest!
It actually reads all users into memory (incredibly expensive and slow, basically a full raw query), and only then filters based on local or not.
Let me ask you this Jade, how often do you restart your server? Have you ever made an effort to run it from a debugger and pause it on the very obnoxiously slow startup?
Please give it a try if you're still skeptical about this PR. Then build this PR and try it out yourself, your jaw will absolutely drop at the startup improvements.
@ -135,4 +131,0 @@.insert(Manager::new(self)).clone().start().await?;This was similarly a bit naively implemented.
Prolly is also related to the bug in the
--read-onlyoption to the CLI, which was ripped out of the code recently.@ -229,3 +229,3 @@if let Some(key_ids) = missing.get_mut(server) {key_ids.retain(|key_id| key_exists(&server_keys, key_id));key_ids.retain(|key_id| !key_exists(&server_keys, key_id));This is a logical error and performance improvement I would like to get in ASAP.
To get a fix in sooner, only include that fix, with absolutely no other changes. Without that it will be delayed by the slowest thing in the PR
This fix is not important to me, as it doesn't really speed things up as much as I hoped. It has anyways been pruned from here and merged separately...
What is important is getting my boot times down from half an hour to 10 seconds.
Please expedite the review process for this PR. I'm completely frustrated with the startup time of half an hour. It's absurd!
WIP: feat: refactor the service manager to achieve 30x faster server initializationto feat: refactor the service manager to achieve 30x faster server initialization@ -142,0 +136,4 @@let manager = {let mut lock = self.manager.lock().await;let manager = Manager::new(self);_ = lock.insert(Arc::clone(&manager));this is admittedly unnecessary, and part of the problem. Since we're doing it everywhere, it's hard to trace the DB handles or flush them properly at shut down.
Think i'll revert this part.
i simplified this.
@ -144,4 +148,0 @@_ = self.presence.ping_presence(&self.globals.server_user, &ruma::presence::PresenceState::Online).await;However, this, again, was rather slow and needs help.
feat: refactor the service manager to achieve 30x faster server initializationto perf: refactor the service manager to achieve significantly faster server initializationI'm going to re-test this. Since there were significant changes requested, it's effectively untested code.
Luckily this is the one branch that starts up quickly and doesn't take half an hour, phew!
perf: refactor the service manager to achieve significantly faster server initializationto wip: perf: refactor the service manager to achieve significantly faster server initializationwip: perf: refactor the service manager to achieve significantly faster server initializationto perf: refactor the service manager to achieve significantly faster server initializationthis is the output from testing today. Notice it only takes 6 seconds to open the database, and 2 more seconds to open the socket. And 4 more seconds to clear the presence updates.
Prior to this PR (the threaded DB init and avoiding the full users table scan), it took 2-10 minutes for each step.
not much of a flamegraph but i can't replicate this on debug builds. Here's logs showing excess of a minute (
19:06:41->19:07:46)... which is already way longer than I've seen on my branches.Seems the issue gets worse the more I use the release binary, and has a tendency to diminish as I run my allegedly healthier WAL'd, parallel, index-based, etc etc branches.
flamegraph2.svgnote... my main domain nutra tk only has ~80,000 linked users thru rooms. The mdev nutra tk domain, which sadly is even more impossible to test due to being on a sync tokenless v19 schema, has closer to ~150,000 and that is where the issue was absolutely more than twice as bad (i.e., more than linear trend imho).
Closed due to moderation action
Pull request closed