feat: Support processing concurrent background transactions #1428

nex · 2026-02-21T03:35:15Z

nex commented

2026-02-21 03:35:15 +00:00

This pull request fixes an issue where servers that send you transactions with events for large rooms may time out, causing them to back off on your server. This can result in things like missed encryption keys, sometimes for weeks on end. This also causes huge amounts of repeated work, which can mean your server may be burning through CPU and RAM for no real reason.
It fixes this by adding some new constraints:

Servers can no longer send more than one transaction to us concurrently (healthy servers only send one transaction at a time)
Incoming transactions are immediately cast to a background task which can be resumed by a later request should the first one time out
Incoming transactions requests are now almost truly replay-safe, meaning the same transaction request being sent more than once will always* (*cache is in-memory so flushed on restart) return the same response
There is now a customizable global limit on how many transactions your server will process concurrently before it starts rejecting new ones for being overloaded, which will massively alleviate the thundering herd effect when returning from an outage
Incoming events are now properly sorted before being processed. This usually will have no effect, but will improve reliability if the sending server did not sort the events before sending them to you.

This is a huge performance and reliability improvement for servers that are in large or old rooms that may take a long time to process.

Pull request checklist:

This pull request targets the main branch, and the branch is named something other than
main.
I have written an appropriate pull request title and my description is clear.
I understand I am responsible for the contents of this pull request.
I have followed the contributing guidelines:
- My contribution follows the code style, if applicable.
- I ran pre-commit checks before opening/drafting this pull request.
- I have tested my contribution (or proof-read it for documentation-only changes)
  myself, if applicable. This includes ensuring code compiles.
- My commit messages follow the commit message format and are descriptive.
- I have written a news fragment for this PR, if applicable.

This pull request fixes an issue where servers that send you transactions with events for large rooms may time out, causing them to back off on your server. This can result in things like missed encryption keys, sometimes for weeks on end. This also causes huge amounts of repeated work, which can mean your server may be burning through CPU and RAM for no real reason. It fixes this by adding some new constraints: 1. Servers can no longer send more than one transaction to us concurrently (healthy servers only send one transaction at a time) 2. Incoming transactions are immediately cast to a background task which can be resumed by a later request should the first one time out 3. Incoming transactions requests are now almost truly replay-safe, meaning the same transaction request being sent more than once will always* (\*cache is in-memory so flushed on restart) return the same response 4. There is now a customizable global limit on how many transactions your server will process concurrently before it starts rejecting new ones for being overloaded, which will massively alleviate the thundering herd effect when returning from an outage 5. Incoming events are now properly sorted before being processed. This usually will have no effect, but will improve reliability if the sending server did not sort the events before sending them to you. This is a huge performance and reliability improvement for servers that are in large or old rooms that may take a long time to process. **Pull request checklist:**  - [x] This pull request targets the `main` branch, and the branch is named something other than `main`. - [x] I have written an appropriate pull request title and my description is clear. - [x] I understand I am responsible for the contents of this pull request. - I have followed the [contributing guidelines][c1]: - [x] My contribution follows the [code style][c2], if applicable. - [x] I ran [pre-commit checks][c1pc] before opening/drafting this pull request. - [x] I have [tested my contribution][c1t] (or proof-read it for documentation-only changes) myself, if applicable. This includes ensuring code compiles. - [x] My commit messages follow the [commit message format][c1cm] and are descriptive. - [x] I have written a [news fragment][n1] for this PR, if applicable.  [c1]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md [c2]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/docs/development/code_style.mdx [c1pc]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#pre-commit-checks [c1t]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#running-tests-locally [c1cm]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#commit-messages [n1]: https://towncrier.readthedocs.io/en/stable/tutorial.html#creating-news-fragments

nex added this to the (deleted) milestone

2026-02-21 03:35:15 +00:00

nex added the

Enhancement

Matrix/Federation

labels

2026-02-21 03:35:15 +00:00

nex self-assigned this

2026-02-21 03:35:15 +00:00

nex force-pushed nex/feat/better-inbound-txn-handle from 879da73d90

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

Documentation / Build and Deploy Documentation (pull_request) Has been cancelled

Details

Update flake hashes / update-flake-hashes (pull_request) Successful in 1m10s

Details

to 4c506df99f

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m28s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Failing after 6m9s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 15m37s

Details

2026-02-21 03:39:06 +00:00

Compare

nex commented

2026-02-21 03:40:31 +00:00

Warning to prospective testers and anyone tempted to review this before I mark it as ready: while this does work (and it instantly shows massive improvements, I've got it deployed to my main server rn) the caches are unbounded in size and the channels are potentially leaky with a capacity to deadlock until I re-add proper error handling to the internal handle call. There should not be explosions but have a fire extinguisher on standby.

nex requested review from Owners

2026-02-21 03:41:28 +00:00

nex changed title from ~~WIP: feat: Support processing concurrent background transactions~~ to feat: Support processing concurrent background transactions

2026-02-21 19:35:35 +00:00

Jade reviewed

2026-02-21 19:55:45 +00:00

src/api/server/send.rs Outdated

					
				@ -67,0 +72,4 @@

						return Ok(response);

					}

					// Or are currently processing it

					if let Some(receiver) = services.transaction_ids.get_active_federation_txn(&txn_key) {

There's technically a race condition here but I don't care that much

It seems to only result in an error rather than duplicate transactions,

Yeah, that's something I accounted for. The chances of this race condition actually being triggered are slim to none, and because of the way get_active_federation_txn holds the lock anyway, it'd just result in the duplicate being rejected.

Yeah, that's something I accounted for. The chances of this race condition actually being triggered are slim to none, and because of the way `get_active_federation_txn` holds the lock anyway, it'd just result in the duplicate being rejected.

Jade marked this conversation as resolved

nex force-pushed nex/feat/better-inbound-txn-handle from b8f22af642

Update flake hashes / update-flake-hashes (pull_request) Successful in 24s

Details

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m51s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m34s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 18m2s

Details

to 47fd9ea6ed

Documentation / Build and Deploy Documentation (pull_request) Successful in 3m7s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 7m26s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 32m36s

Details

2026-02-21 20:57:44 +00:00

Compare

nex reviewed

2026-02-21 23:29:56 +00:00

src/api/server/send.rs Outdated

					
				@ -209,0 +301,4 @@

						if let Err(e) = services.server.check_running() {

							debug_warn!("Server shutting down, returning partial transaction results: {e}");

							results.push((event_id, Err(e)));

							results.extend(event_ids.map(|id| (id, Err(err!("Server is shutting down")))));

We should not allow partial failures in this condition - servers don't use returned transaction errors for anything but debugging at the moment, so this will induce event loss. A 5XX error (semantically 503) should be used to get the sender to re-try the whole transaction - re-processing part of the transaction is better than losing part of it.

Ah, didn't know that.

Should be better now

Jade marked this conversation as resolved

nex reviewed

2026-02-21 23:31:49 +00:00

src/service/transaction_ids/mod.rs Outdated

					
				@ -6,0 +27,4 @@

				/// Minimum interval between cache cleanup runs.

				/// Exists to prevent thrashing when the cache is full of things that can't be

				/// cleared

				const CLEANUP_INTERVAL_SECS: u64 = 30;

60 seconds would make my brain happier but i don't think it matters that much

:3

Jade marked this conversation as resolved

Jade force-pushed nex/feat/better-inbound-txn-handle from 05f8536a18

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m27s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m13s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 29m23s

Details

to 914a8ab2eb

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m50s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m16s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 22m39s

Details

2026-02-22 01:21:13 +00:00

Compare

nex reviewed

2026-02-22 01:39:25 +00:00

src/api/server/send.rs Outdated

					
				@ -131,0 +222,4 @@

				/// Converts a TransactionError into an appropriate HTTP error response.

				fn transaction_error_to_response(err: &TransactionError) -> Error {

					match err {

						| TransactionError::ShuttingDown => Error::Request(

Isn't a match with one arm a bit redundant

This is meant to be if we actually get any other error types. Not sure if that'll actually happen tho

nex marked this conversation as resolved

nex reviewed

2026-02-22 01:42:29 +00:00

src/service/transaction_ids/mod.rs Outdated

					
				@ -6,0 +37,4 @@

				impl fmt::Display for TransactionError {

					fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {

						match self {

							| Self::ShuttingDown => write!(f, "Server is shutting down"),

again a single-arm match feels redundant

nex marked this conversation as resolved

nex force-pushed nex/feat/better-inbound-txn-handle from 914a8ab2eb

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m50s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m16s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 22m39s

Details

to 92351df925

Documentation / Build and Deploy Documentation (pull_request) Successful in 2m50s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 4m59s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

2026-02-23 16:36:51 +00:00

Compare

nex reviewed

2026-02-23 16:52:23 +00:00

nex left a comment

I am nex and I approve this pull request ✅

nex added 1 commit

2026-02-23 16:55:07 +00:00

chore: Add news frag

Documentation / Build and Deploy Documentation (pull_request) Successful in 2m9s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m44s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 24m11s

Details

d4481b07ac

ginger requested changes

2026-02-23 16:55:38 +00:00

Dismissed

ginger left a comment

very solid PR in general :3 just a few nits. also consider renaming the service to just transactions

very solid PR in general :3 just a few nits. also consider renaming the service to just `transactions`

src/api/server/send.rs Outdated

					
				@ -89,0 +110,4 @@

				async fn wait_for_result(

					mut recv: Receiver<WrappedTransactionResponse>,

				) -> Result<send_transaction_message::v1::Response> {

					if tokio::time::timeout(Duration::from_secs(50), recv.changed())

why 50 seconds 🧌

Synapse only waits for 60 seconds, elsewhere (and in my own patches) I use 55 seconds, but there needs to be a few seconds of leeway to return the response within the deadline. Since a timeout means the sender will just retry the now de-duplicated transaction immediately after, this doesn't really have any issues

Synapse only waits for 60 seconds, elsewhere (and in my own patches) I use 55 seconds, but there needs to be a few seconds of leeway to *return* the response within the deadline. Since a timeout means the sender will just retry the now de-duplicated transaction immediately after, this doesn't really have any issues

Or more concisely: waiting the exact amount of time we think the sender is also going to wait may result in them disconnecting before we can finish returning the response to them

sounds good 👍

ginger marked this conversation as resolved

src/api/server/send.rs Outdated

					
				@ -131,0 +216,4 @@

					// Send the error to any waiters

					sender

						.send(Some(Err(err)))

						.expect("couldn't send error to channel");

this probably shouldn't be an expect(), it could panic if (for example) there's only one waiter and it hits the timeout and stops listening. I think this should be a let _ =

this probably shouldn't be an expect(), it could panic if (for example) there's only one waiter and it hits the timeout and stops listening. I think this should be a `let _ = `

The channels are buffered aren't they? Not sure why sending to a sender with no receiver would cause a panic, but maybe I haven't read the docs enough

https://docs.rs/tokio/latest/tokio/sync/watch/struct.Sender.html#method.send:~:text=This%20method%20fails%20if%20the%20channel%20is%20closed%2C%20which%20is%20the%20case%20when%20every%20receiver%20has%20been%20dropped%2E

yeah okay I just didn't read the docs for it then lmao

nex marked this conversation as resolved

src/api/server/send.rs Outdated

					
				@ -172,0 +276,4 @@

				/// dependencies, however it is ultimately the sender's responsibility to send

				/// them in a processable order, so this is just a best effort attempt. It does

				/// not account for power levels or other tie breaks.

				async fn build_local_dag(

this will panic if the CanonicalJsonObject is shaped weird, is that a good thing?

this will panic if the `CanonicalJsonObject` is shaped weird, is that a good thing?

don't send weird shaped objects

yes but what if someone does

their transaction will fail

![image](/attachments/2e3083e0-8d6c-4078-96b0-6625990c31ea)

image.png

55 KiB

nex marked this conversation as resolved

src/service/transaction_ids/mod.rs Outdated

					
				@ -54,0 +192,4 @@

						let max_active_txns = self.services.config.max_concurrent_inbound_transactions;

						// Check if we're at capacity

						if state.len() >= max_active_txns

could a semaphore be used for this capacity logic?

Wouldn't a semaphore imply waiting for a free slot rather than rejecting when there's no more free slots

shrug. I guess you know what you're doing

ginger marked this conversation as resolved

src/service/transaction_ids/mod.rs Outdated

					
				@ -54,0 +243,4 @@

							.send(Some(Ok(response)))

							.expect("couldn't send response to channel");

						// explicitly close

inconsistent comment capitalization 🤨

nex marked this conversation as resolved

nex added 1 commit

2026-02-23 17:22:45 +00:00

fix: Don't panic if nobody's listening

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m38s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled

Details

8702f55cf5

nex added 1 commit

2026-02-23 17:25:26 +00:00

chore: Fix incorrect capitalisation

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m35s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m36s

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 17m42s

Details

d311b87579

I didn't realise I agreed to take an English class with @ginger while
working on this server lol

ginger approved these changes

2026-02-23 17:28:56 +00:00

nex added 1 commit

2026-02-23 17:44:41 +00:00

chore: Refactor transaction_ids -> transactions

Documentation / Build and Deploy Documentation (pull_request) Successful in 1m32s

Details

Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m24s

Details

Documentation / Build and Deploy Documentation (push) Successful in 3m58s

Details

Checks / Prek / Pre-commit & Formatting (push) Successful in 6m38s

Details

Release Docker Image / Build linux-amd64 (release) (push) Has been cancelled

Details

Release Docker Image / Build linux-arm64 (release) (push) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (push) Has been cancelled

Details

Release Docker Image / Create Multi-arch Release Manifest (push) Has been cancelled

Details

Release Docker Image / Build linux-amd64 (max-perf) (push) Has been cancelled

Details

Release Docker Image / Build linux-arm64 (max-perf) (push) Has been cancelled

Details

Release Docker Image / Create Max-Perf Manifest (push) Has been cancelled

Details

Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 41m27s

Details

558262dd1f

nex merged commit 558262dd1f into main

2026-02-23 17:48:12 +00:00

nex deleted branch nex/feat/better-inbound-txn-handle

2026-02-23 17:48:12 +00:00

nex referenced this pull request

2026-04-27 21:34:45 +00:00

fix: Don't drop transactions with more than one PDU in a room #1711

Sign in to join this conversation.

No reviewers

No labels

Dependencies/Renovate

Matrix/Administration

Rows
Columns