perf: refactor the service manager to achieve significantly faster server initialization #1482

gamesguru · 2026-03-03T07:22:56Z

gamesguru commented

2026-03-03 07:22:56 +00:00

Update (3/14): All systems go, console Ctrl+C missing fixed with: cargo build --profile release --features default,console
~~busted due to moving some things to another PR, especially Ctrl+C is badly busted~~

Update (3/5/26): Going to remove the work regarding the shutdown, since I realized there's a better way that I don't have time for at the moment. The startup refactor I feel pretty confident in. The amount of speedup depends on your hardware... may be only 3x faster on a powerful setup, but could be close to 100x on systems that are completely choking up.

Refactor the service manager to achieve 30x faster server initialization and avoiding full database scans by implementing index-aware presence resets on a background thread.

And fix the shutdown manager by terminating workers in the correct order of their dependency hierarchy, thereby preventing shutdown hangs and resulting in an extremely fast shutdown sequence which gives special preference only to the RocksDB worker and the integrity of the database.

Pull request checklist:

This pull request targets the main branch, and the branch is named something other than
main.
I have written an appropriate pull request title and my description is clear.
I understand I am responsible for the contents of this pull request.
I have followed the [contributing guidelines][c1]:
- My contribution follows the [code style][c2], if applicable.
- I ran [pre-commit checks][c1pc] before opening/drafting this pull request.
- I have [tested my contribution][c1t] (or proof-read it for documentation-only changes)
  myself, if applicable. This includes ensuring code compiles.
- My commit messages follow the [commit message format][c1cm] and are descriptive.
- I have written a [news fragment][n1] for this PR, if applicable.

Update (3/14): All systems go, console Ctrl+C missing fixed with: `cargo build --profile release --features default,console` ~~busted due to moving some things to another PR, especially `Ctrl+C` is badly busted~~ ------------------------------------ Update (3/5/26): Going to remove the work regarding the **shutdown**, since I realized there's a better way that I don't have time for at the moment. The **startup** refactor I feel pretty confident in. The amount of speedup depends on your hardware... may be only 3x faster on a powerful setup, but could be close to 100x on systems that are completely choking up. ------------------------------------ Refactor the service manager to achieve 30x faster server initialization and avoiding full database scans by implementing index-aware presence resets on a background thread. And fix the shutdown manager by terminating workers in the correct order of their dependency hierarchy, thereby preventing shutdown hangs and resulting in an extremely fast shutdown sequence which gives special preference only to the RocksDB worker and the integrity of the database. **Pull request checklist:**  - [x] This pull request targets the `main` branch, and the branch is named something other than `main`. - [x] I have written an appropriate pull request title and my description is clear. - [x] I understand I am responsible for the contents of this pull request. - I have followed the [contributing guidelines][c1]: - [x] My contribution follows the [code style][c2], if applicable. - [x] I ran [pre-commit checks][c1pc] before opening/drafting this pull request. - [x] I have [tested my contribution][c1t] (or proof-read it for documentation-only changes) myself, if applicable. This includes ensuring code compiles. - [x] My commit messages follow the [commit message format][c1cm] and are descriptive. - [x] I have written a [news fragment][n1] for this PR, if applicable.

gamesguru added 12 commits

2026-03-03 07:22:56 +00:00

feat: efforts to debug/resolve sluggish startup tendencies 45ca3925b5

fix: missing logic inversion for acquired keys (should speed up room joins) 8cd60f5878

fix: shut the services down in a logical, safer, and more expeditious manner b28f883e0f

feat: parallelize DB opening, improve warn log. 4a8eca27d0

fix: shutdown cleanup logic d353fc0e6c

fix: attempt to completely safely close hanging database handlers/refs 7f4dab9482

fix better handling of shutdown logic and closure of database references e872bcdfed

fix: PresenceService leak, which should release hold on GlobalsService and DB 5969e87206

fix: adjust log levels 221776973a

fix: add additional info around warning log "Missing Authorization header.: Header of type authorization was missing" b3e4f9d97e

fix: restrict startup presence reset to local users 9518cc7b30

fix: lint warnings from my branch and main

Documentation / Build and Deploy Documentation (pull_request) Has been skipped

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Failing after 3m0s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 20m58s

Details

1ce0ffbaed

gamesguru added 3 commits

2026-03-03 07:59:02 +00:00

perf: allow 3 concurrent DNS lookups per request (no more sequential SRV->A->AAAA bottlenecks) ab63a15e0c

style: improve loggings of parallel DNS timeouts if admins care to know e018b9ae5b

fix: restore less ominous warning log from guru/nightly

Documentation / Build and Deploy Documentation (pull_request) Has been skipped

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m44s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 21m36s

Details

6a4f968f5f

gamesguru changed title from ~~Refactor the service manager to achieve 30x faster server initialization~~ to feat: refactor the service manager to achieve 30x faster server initialization

2026-03-03 08:47:50 +00:00

julian45 reviewed

2026-03-03 12:19:05 +00:00

src/api/client/account_data.rs

					
				@ -10,3 +10,3 @@

					events::{

						AnyGlobalAccountDataEventContent, AnyRoomAccountDataEventContent,

						GlobalAccountDataEventType, RoomAccountDataEventType,

						RoomAccountDataEventType,

FWIW, I believe this change is covered by #1479, so you may want to rebase depending on the outcome of that one.

gamesguru marked this conversation as resolved

Jade changed title from ~~feat: refactor the service manager to achieve 30x faster server initialization~~ to WIP: feat: refactor the service manager to achieve 30x faster server initialization

2026-03-03 18:32:36 +00:00

gamesguru added 1 commit

2026-03-04 17:50:15 +00:00

Merge branch 'main' into guru/fix/cli-optimization/faster-startup-safer-shutdown

Documentation / Build and Deploy Documentation (pull_request) Has been skipped

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 4m19s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 36m19s

Details

de9b531955

gamesguru commented

2026-03-04 18:56:41 +00:00

Commits affecting the below files will be removed to other branches. It is advised to use the git CLI when reviewing my PRs, especially large ones.

# setup
git remote add ellis-gg https://forgejo.ellis.link/gamesguru/continuwuity.git
git fetch --all

# configure lgb alias in .gitconfig
git config --global alias.lgb "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%aN>%Creset%n' --abbrev-commit --date=relative"

# view raw branch history
git lgb ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown

# view full diff against main (using '...' to see changes Diverged from main)
git diff origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown

# view diff excluding the undesired files
IGNORE_FILES_LIST=(
  src/api/router/auth.rs
  src/api/server/utils.rs
  src/core/config/mod.rs
  src/database/engine/db_opts.rs
  src/service/resolver/actual.rs
  src/service/resolver/dns.rs
  src/service/server_keys/acquire.rs
)
IGNORE="${IGNORE_FILES_LIST[@]/#/:^}"

git diff origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown -- $IGNORE
# "stat" only diff (+/- count per file)
git diff --stat origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown -- $IGNORE

Commits affecting the below files will be removed to other branches. It is advised to use the `git` CLI when reviewing my PRs, especially large ones. ```shell # setup git remote add ellis-gg https://forgejo.ellis.link/gamesguru/continuwuity.git git fetch --all # configure lgb alias in .gitconfig git config --global alias.lgb "log --graph --pretty=format:'%Cred%h%Creset -%C(yellow)%d%Creset %s %Cgreen(%cr) %C(bold blue)<%aN>%Creset%n' --abbrev-commit --date=relative" # view raw branch history git lgb ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown # view full diff against main (using '...' to see changes Diverged from main) git diff origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown # view diff excluding the undesired files IGNORE_FILES_LIST=( src/api/router/auth.rs src/api/server/utils.rs src/core/config/mod.rs src/database/engine/db_opts.rs src/service/resolver/actual.rs src/service/resolver/dns.rs src/service/server_keys/acquire.rs ) IGNORE="${IGNORE_FILES_LIST[@]/#/:^}" git diff origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown -- $IGNORE # "stat" only diff (+/- count per file) git diff --stat origin/main...ellis-gg/guru/fix/cli-optimization/faster-startup-safer-shutdown -- $IGNORE ```

lveneris reviewed

2026-03-06 12:11:54 +00:00

gamesguru commented

2026-03-06 12:44:29 +00:00

@lveneris I'm not able to see your comment due to some mobile bug. But referring back to my email...

I did manual testing to verify approximately correct behavior.

The commit about the admin service was added (perhaps only on my nightly build branch?) as a direct result of unexpected behavior.

As far as I recall, router is king. He holds the database handles, which in turn hold handles to sub processes calling out to the DB driver.

The code you are referencing occurs at the end, not start. So I'm not understanding your specific concern... That the database must continue to function after closing the app? Or that this permanently corrupts the database?

@lveneris I'm not able to see your comment due to some mobile bug. But referring back to my email... I did manual testing to verify approximately correct behavior. The commit about the admin service was added (perhaps only on my nightly build branch?) as a direct result of unexpected behavior. As far as I recall, router is king. He holds the database handles, which in turn hold handles to sub processes calling out to the DB driver. The code you are referencing occurs at the end, not start. So I'm not understanding your specific concern... That the database must continue to function after closing the app? Or that this permanently corrupts the database?

gamesguru commented

2026-03-06 12:49:27 +00:00

The proper, safe way, according to a rust expert from a bulliten board system involves both traceability and parallelism refactors, and likely some changes in DB driver behavior

gamesguru commented

2026-03-06 12:52:11 +00:00

All mine really achieves in the present form on this PR is the avoidance of deadlocks at shutdown time due to congestion along the dependency chain from various services still trying to operate.

It achieved consistently quick shutdowns perhaps at the expense of, if your computer loses power immediately after, you may need to rebuild from the WAL next boot, which is painfully slow and mildly risky

All mine really achieves in the present form on this PR is the avoidance of deadlocks at shutdown time due to congestion along the dependency chain from various services still trying to operate. It achieved consistently quick shutdowns perhaps at the expense of, if your computer loses power immediately after, you may need to rebuild from the WAL next boot, which is painfully slow and mildly risky

lveneris commented

2026-03-06 13:02:18 +00:00

@gamesguru That wasn't a mobile bug - I reviewed your PR from mobile, didn't see the full filenames when I was looking over things because my phone is small, noticed my mistake within a minute of posting my review, and promptly retracted it. Not the highlight of my career, that's for sure. Apologies for the misunderstanding.

I will go over this on a more appropriate device at a later date. When do you plan to remove / relocate the changes you have deemed irrelevant to this PR?

@gamesguru That wasn't a mobile bug - I reviewed your PR from mobile, didn't see the full filenames when I was looking over things because my phone is small, noticed my mistake within a minute of posting my review, and promptly retracted it. Not the highlight of my career, that's for sure. Apologies for the misunderstanding. I will go over this on a more appropriate device at a later date. When do you plan to remove / relocate the changes you have deemed irrelevant to this PR?

gamesguru commented

2026-03-07 21:24:03 +00:00

@lveneris
Ok no worries man. I've done the same thing, but usually not leaving an email record of it 😅

You don't have to hang your head in shame, next time just edit the comment. "Edit: whoops realized that was at the end not start! Carry on, I'll see if i notice anything else here."

We're all human.. you don't know how many hours I spent staring at the screen for this one, how many builds i cut where i added log statements that didn't tell me what I hoped, or how many of those builds crapped out after 10 minutes from missed formatting changes or basic syntax... any participation here is good man. Thanks for commenting. Gave me a chance to explain a bit better something that probably left a lot of people a bit unclear.

@lveneris Ok no worries man. I've done the same thing, but usually not leaving an email record of it 😅 You don't have to hang your head in shame, next time just edit the comment. "Edit: whoops realized that was at the end not start! Carry on, I'll see if i notice anything else here." We're all human.. you don't know how many hours I spent staring at the screen for this one, how many builds i cut where i added log statements that didn't tell me what I hoped, or how many of those builds crapped out after 10 minutes from missed formatting changes or basic syntax... any participation here is good man. Thanks for commenting. Gave me a chance to explain a bit better something that probably left a lot of people a bit unclear.

gamesguru added 3 commits

2026-03-09 05:54:31 +00:00

Merge remote-tracking branch 'origin/main' into guru/fix/cli-optimization/faster-startup-safer-shutdown e7a24819a9

shelve: some dns logging and possible improvements (have advanced upstream on nightly) 7509b0b83f

shelve(logging): add some details to wimpy auth log

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

82b5b6f22a

gamesguru added 1 commit

2026-03-09 05:55:28 +00:00

shelve(logging): log details added to server/utils

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

b5201828e4

gamesguru reviewed

2026-03-09 05:56:52 +00:00

src/core/config/mod.rs Outdated

					
				@ -2806,2 +2805,3 @@

				fn default_client_shutdown_timeout() -> u64 { 10 }

				fn default_sender_shutdown_timeout() -> u64 { 5 }

				fn default_sender_shutdown_timeout() -> u64 { 3 }

There's really no point in waiting longer than 3 seconds after corking. I'm not sure that 5 seconds guarantees any safer shutdown.

Some people have slower or more overloaded devices.

My device is quite overloaded, 150,000 users and 1 GB of ram. Yet waiting 1 second in my experiments changed nothing about the state of the RocksDB handlers compared to even 120 seconds.

Regardless, it's going into a separate PR for shutdown cleanup logic.

My device is quite overloaded, 150,000 users and 1 GB of ram. Yet waiting 1 second in my experiments changed nothing about the state of the RocksDB handlers compared to even 120 seconds. Regardless, it's going into a separate PR for shutdown cleanup logic.

gamesguru marked this conversation as resolved

gamesguru reviewed

2026-03-09 05:57:49 +00:00

src/database/engine/db_opts.rs

					
				@ -29,3 +29,3 @@

					opts.set_max_subcompactions(num_threads::<u32>(config)?);

					opts.set_avoid_unnecessary_blocking_io(true);

					opts.set_max_file_opening_threads(0);

					opts.set_max_file_opening_threads(num_threads::<i32>(config)?);

Supposedly allows for faster ingestion of the WAL on startup, as well as runtime parallelism for read ops. I'm not sure what the default behavior of zero did.

This appears undocumented.

👍 1

Internally or at RocksDB? See, for example, their C# wrapper:

        /// <summary>
        /// If max_open_files is -1, DB will open all files on DB::Open(). You can
        /// use this option to increase the number of threads used to open the files.
        /// Default: 16
        /// </summary>
        /// <param name="value"></param>
        /// <returns></returns>
        public DbOptions SetMaxFileOpeningThreads(int value)
        {
            Native.Instance.rocksdb_options_set_max_file_opening_threads(Handle, value);
            return this;
        }

Whoever set it to zero, I'm not sure where they got the idea to use zero.

Internally or at RocksDB? See, for example, their C# wrapper: ```csharp /// <summary> /// If max_open_files is -1, DB will open all files on DB::Open(). You can /// use this option to increase the number of threads used to open the files. /// Default: 16 /// </summary> /// <param name="value"></param> /// <returns></returns> public DbOptions SetMaxFileOpeningThreads(int value) { Native.Instance.rocksdb_options_set_max_file_opening_threads(Handle, value); return this; } ``` Whoever set it to zero, I'm not sure where they got the idea to use zero.

gamesguru reviewed

2026-03-09 05:59:09 +00:00

src/service/admin/mod.rs

					
				@ -373,0 +374,4 @@

								"Admin command handler is not yet loaded. The server may still be booting or \

								 the admin module failed to load.",

							));

						};

This was necessary, as interrupting the program during the initialization sequence led to some pretty horrible crashes.

gamesguru reviewed

2026-03-09 06:00:10 +00:00

src/service/presence/mod.rs Outdated

					
				@ -61,0 +67,4 @@

						} else {

							None

						};

Presence updates are directly related to this PR.

A dumb, full Database scan was the primary culprit for slow startups (see below comment on removed code block with call chains).

Presence updates are directly related to this PR. A dumb, full Database scan was the primary culprit for slow startups (see below comment on removed code block with call chains).

gamesguru reviewed

2026-03-09 06:00:48 +00:00

src/service/presence/mod.rs Outdated

					
				@ -185,4 +196,0 @@

							.list_local_users()

							.map(ToOwned::to_owned)

							.collect::<Vec<OwnedUserId>>()

							.await

I determined this was the ultimate source of the problem. It scanned every user, a notoriously slow raw query.

This only scans local users -

pub fn list_local_users(&self) -> impl Stream<Item = &UserId> + Send + '_ {

- so is unlikely to be the culprit.

This only scans local users - https://forgejo.ellis.link/continuwuation/continuwuity/src/commit/7207398a9e1cdd7a100d745845062c5460cdcf0b/src/service/users/mod.rs#L361 - so is unlikely to be the culprit.

The issue is the function call chain does not behave as the English names of the methods would suggest!

It actually reads all users into memory (incredibly expensive and slow, basically a full raw query), and only then filters based on local or not.

Let me ask you this Jade, how often do you restart your server? Have you ever made an effort to run it from a debugger and pause it on the very obnoxiously slow startup?

Please give it a try if you're still skeptical about this PR. Then build this PR and try it out yourself, your jaw will absolutely drop at the startup improvements.

The issue is the function call chain does not behave as the English names of the methods would suggest! It actually reads all users into memory (incredibly expensive and slow, basically a full raw query), and only then filters based on local or not. Let me ask you this Jade, how often do you restart your server? Have you ever made an effort to run it from a debugger and pause it on the very obnoxiously slow startup? Please give it a try if you're still skeptical about this PR. Then build this PR and try it out yourself, your jaw will absolutely drop at the startup improvements.

gamesguru reviewed

2026-03-09 06:03:20 +00:00

src/service/services.rs Outdated

					
				@ -135,4 +131,0 @@

							.insert(Manager::new(self))

							.clone()

							.start()

							.await?;

This was similarly a bit naively implemented.

Prolly is also related to the bug in the --read-only option to the CLI, which was ripped out of the code recently.

This was similarly a bit naively implemented. Prolly is also related to the bug in the `--read-only` option to the CLI, which was ripped out of the code recently.

gamesguru reviewed

2026-03-09 06:03:52 +00:00

src/service/server_keys/acquire.rs Outdated

					
				@ -229,3 +229,3 @@

					if let Some(key_ids) = missing.get_mut(server) {

						key_ids.retain(|key_id| key_exists(&server_keys, key_id));

						key_ids.retain(|key_id| !key_exists(&server_keys, key_id));

This is a logical error and performance improvement I would like to get in ASAP.

To get a fix in sooner, only include that fix, with absolutely no other changes. Without that it will be delayed by the slowest thing in the PR

This fix is not important to me, as it doesn't really speed things up as much as I hoped. It has anyways been pruned from here and merged separately...

What is important is getting my boot times down from half an hour to 10 seconds.

Please expedite the review process for this PR. I'm completely frustrated with the startup time of half an hour. It's absurd!

This fix is not important to me, as it doesn't really speed things up as much as I hoped. It has anyways been pruned from here and merged separately... What is important is getting my boot times down from half an hour to 10 seconds. Please expedite the review process for this PR. I'm completely frustrated with the startup time of half an hour. It's absurd!

gamesguru marked this conversation as resolved

gamesguru changed title from ~~WIP: feat: refactor the service manager to achieve 30x faster server initialization~~ to feat: refactor the service manager to achieve 30x faster server initialization

2026-03-09 06:14:38 +00:00

gamesguru added 2 commits

2026-03-09 08:43:17 +00:00

shelve: initial attempt at graceful shutdown manager, and some timeout adjustments e1a479e196

lint fixes/formatting

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

2110ea455b

gamesguru reviewed

2026-03-09 08:44:42 +00:00

src/service/services.rs Outdated

					
				@ -142,0 +136,4 @@

						let manager = {

							let mut lock = self.manager.lock().await;

							let manager = Manager::new(self);

							_ = lock.insert(Arc::clone(&manager));

this is admittedly unnecessary, and part of the problem. Since we're doing it everywhere, it's hard to trace the DB handles or flush them properly at shut down.

Think i'll revert this part.

this is admittedly unnecessary, and part of the problem. Since we're doing it everywhere, it's hard to trace the DB handles or flush them properly at shut down. Think i'll revert this part.

i simplified this.

gamesguru marked this conversation as resolved

gamesguru reviewed

2026-03-09 08:45:26 +00:00

src/service/services.rs Outdated

					
				@ -144,4 +148,0 @@

							_ = self

								.presence

								.ping_presence(&self.globals.server_user, &ruma::presence::PresenceState::Online)

								.await;

However, this, again, was rather slow and needs help.

nex added the

Database

Matrix/Administration

labels

2026-03-09 16:12:01 +00:00

nex changed title from ~~feat: refactor the service manager to achieve 30x faster server initialization~~ to perf: refactor the service manager to achieve significantly faster server initialization

2026-03-09 16:12:28 +00:00

gamesguru added 1 commit

2026-03-14 00:51:57 +00:00

resolve ginger's migration conflicts in services.rs

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

eae9e322a8

gamesguru added 1 commit

2026-03-14 01:13:51 +00:00

shelve(sep-PR): Revert "fix: missing logic inversion for acquired keys (should speed up room joins)"

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

842277af59

This reverts commit 8cd60f5878.

gamesguru added 1 commit

2026-03-14 01:39:59 +00:00

remove dangling info log

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

518df29bcc

gamesguru added 1 commit

2026-03-14 15:13:04 +00:00

Merge remote-tracking branch 'origin/main' into guru/fix/cli-optimization/faster-startup-safer-shutdown

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

eaaf40dafb

gamesguru commented

2026-03-14 15:31:21 +00:00

I'm going to re-test this. Since there were significant changes requested, it's effectively untested code.

Luckily this is the one branch that starts up quickly and doesn't take half an hour, phew!

I'm going to re-test this. Since there were significant changes requested, it's effectively untested code. Luckily this is the one branch that starts up quickly and doesn't take half an hour, phew!

gamesguru changed title from ~~perf: refactor the service manager to achieve significantly faster server initialization~~ to wip: perf: refactor the service manager to achieve significantly faster server initialization

2026-03-14 21:44:27 +00:00

gamesguru changed title from ~~wip: perf: refactor the service manager to achieve significantly faster server initialization~~ to perf: refactor the service manager to achieve significantly faster server initialization

2026-03-15 01:23:59 +00:00

gamesguru commented

2026-03-15 02:19:58 +00:00

this is the output from testing today. Notice it only takes 6 seconds to open the database, and 2 more seconds to open the socket. And 4 more seconds to clear the presence updates.

Prior to this PR (the threaded DB init and avoiding the full users table scan), it took 2-10 minutes for each step.

02:06:56.642  INFO conduwuit::server: 0.5.6 (60c3438d) server_name=nutra.tk database_path="/var/lib/conduwuit" log_levels=info
02:07:02.759  INFO open: Opened database. columns=95 sequence=471854674 time=6.069012619s
02:07:02.884  INFO services: Starting services...
02:07:02.884  INFO services: Running database migrations...
02:07:02.970  INFO migrations: Starting media startup integrity check.
02:07:04.448  INFO migrations: Finished media startup integrity check in 1.4780084 seconds.
02:07:04.448  INFO migrations: Loaded RocksDB database with schema version 18
02:07:04.448  INFO services: Starting service manager...
02:07:04.448  INFO services: Starting service workers...
02:07:04.450  INFO services: Services startup complete.
02:07:04.455  INFO unix: Listening at "/var/lib/conduwuit/conduwuit.sock"
02:07:08.246  INFO presence: Presence reset complete: 5766 users reset to offline.

this is the output from testing today. Notice it only takes 6 seconds to open the database, and 2 more seconds to open the socket. And 4 more seconds to clear the presence updates. Prior to this PR (the threaded DB init and avoiding the full users table scan), it took 2-10 minutes for each step. ```log 02:06:56.642 INFO conduwuit::server: 0.5.6 (60c3438d) server_name=nutra.tk database_path="/var/lib/conduwuit" log_levels=info 02:07:02.759 INFO open: Opened database. columns=95 sequence=471854674 time=6.069012619s 02:07:02.884 INFO services: Starting services... 02:07:02.884 INFO services: Running database migrations... 02:07:02.970 INFO migrations: Starting media startup integrity check. 02:07:04.448 INFO migrations: Finished media startup integrity check in 1.4780084 seconds. 02:07:04.448 INFO migrations: Loaded RocksDB database with schema version 18 02:07:04.448 INFO services: Starting service manager... 02:07:04.448 INFO services: Starting service workers... 02:07:04.450 INFO services: Services startup complete. 02:07:04.455 INFO unix: Listening at "/var/lib/conduwuit/conduwuit.sock" 02:07:08.246 INFO presence: Presence reset complete: 5766 users reset to offline. ```

gamesguru added 1 commit

2026-03-15 02:23:27 +00:00

Merge branch 'main' into guru/fix/cli-optimization/faster-startup-safer-shutdown

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

7a7fa08b5b

gamesguru commented

2026-03-15 20:01:44 +00:00

not much of a flamegraph but i can't replicate this on debug builds. Here's logs showing excess of a minute (19:06:41 -> 19:07:46)... which is already way longer than I've seen on my branches.

Seems the issue gets worse the more I use the release binary, and has a tendency to diminish as I run my allegedly healthier WAL'd, parallel, index-based, etc etc branches.

flamegraph2.svg

note... my main domain nutra tk only has ~80,000 linked users thru rooms. The mdev nutra tk domain, which sadly is even more impossible to test due to being on a sync tokenless v19 schema, has closer to ~150,000 and that is where the issue was absolutely more than twice as bad (i.e., more than linear trend imho).

conduwuit@vps16:~/continuwuity$ /usr/bin/conduwuit --maintenance
2026-03-15T19:06:22.756516Z  INFO conduwuit::server: 0.5.6 (2c723381) server_name=nutra.tk database_path="/var/lib/conduwuit" log_levels=info
2026-03-15T19:06:37.273266Z  INFO main:start:open: conduwuit_database::engine::open: Opened database. columns=95 sequence=475973614 time=14.418700934s
2026-03-15T19:06:41.126223Z  INFO main:start: conduwuit_service::migrations: Loaded RocksDB database with schema version 18
^\
2026-03-15T19:07:46.434424Z  WARN signal: conduwuit::signal: Received SIGQUIT
2026-03-15T19:07:46.467073Z  INFO main:stop: conduwuit_service::services: Shutting down services...
2026-03-15T19:07:46.478467Z  INFO main:stop: conduwuit_database::engine: Closing database... sequence=475973636
2026-03-15T19:07:46.497035Z  INFO main:stop: conduwuit_router::run: Shutdown complete.

not much of a flamegraph but i can't replicate this on debug builds. Here's logs showing excess of a minute (`19:06:41` -> `19:07:46`)... which is already way longer than I've seen on my branches. Seems the issue gets worse the more I use the release binary, and has a tendency to diminish as I run my allegedly healthier WAL'd, parallel, index-based, etc etc branches. `flamegraph2.svg` note... my main domain nutra tk only has ~80,000 linked users thru rooms. The mdev nutra tk domain, which sadly is even more impossible to test due to being on a sync tokenless v19 schema, has closer to ~150,000 and that is where the issue was absolutely more than twice as bad (i.e., more than linear trend imho). ```shell conduwuit@vps16:~/continuwuity$ /usr/bin/conduwuit --maintenance 2026-03-15T19:06:22.756516Z INFO conduwuit::server: 0.5.6 (2c723381) server_name=nutra.tk database_path="/var/lib/conduwuit" log_levels=info 2026-03-15T19:06:37.273266Z INFO main:start:open: conduwuit_database::engine::open: Opened database. columns=95 sequence=475973614 time=14.418700934s 2026-03-15T19:06:41.126223Z INFO main:start: conduwuit_service::migrations: Loaded RocksDB database with schema version 18 ^\ 2026-03-15T19:07:46.434424Z WARN signal: conduwuit::signal: Received SIGQUIT 2026-03-15T19:07:46.467073Z INFO main:stop: conduwuit_service::services: Shutting down services... 2026-03-15T19:07:46.478467Z INFO main:stop: conduwuit_database::engine: Closing database... sequence=475973636 2026-03-15T19:07:46.497035Z INFO main:stop: conduwuit_router::run: Shutdown complete. ```

flamegraph2.svg

22 KiB