feat: Support processing concurrent background transactions #1428

Merged
nex merged 19 commits from nex/feat/better-inbound-txn-handle into main 2026-02-23 17:48:12 +00:00
Owner

This pull request fixes an issue where servers that send you transactions with events for large rooms may time out, causing them to back off on your server. This can result in things like missed encryption keys, sometimes for weeks on end. This also causes huge amounts of repeated work, which can mean your server may be burning through CPU and RAM for no real reason.
It fixes this by adding some new constraints:

  1. Servers can no longer send more than one transaction to us concurrently (healthy servers only send one transaction at a time)
  2. Incoming transactions are immediately cast to a background task which can be resumed by a later request should the first one time out
  3. Incoming transactions requests are now almost truly replay-safe, meaning the same transaction request being sent more than once will always* (*cache is in-memory so flushed on restart) return the same response
  4. There is now a customizable global limit on how many transactions your server will process concurrently before it starts rejecting new ones for being overloaded, which will massively alleviate the thundering herd effect when returning from an outage
  5. Incoming events are now properly sorted before being processed. This usually will have no effect, but will improve reliability if the sending server did not sort the events before sending them to you.

This is a huge performance and reliability improvement for servers that are in large or old rooms that may take a long time to process.

Pull request checklist:

  • This pull request targets the main branch, and the branch is named something other than
    main.
  • I have written an appropriate pull request title and my description is clear.
  • I understand I am responsible for the contents of this pull request.
  • I have followed the contributing guidelines:
<!-- In order to help reviewers know what your pull request does at a glance, you should ensure that 1. Your PR title is a short, single sentence describing what you changed 2. You have described in more detail what you have changed, why you have changed it, what the intended effect is, and why you think this will be beneficial to the project. If you have made any potentially strange/questionable design choices, but didn't feel they'd benefit from code comments, please don't mention them here - after opening your pull request, go to "files changed", and click on the "+" symbol in the line number gutter, and attach comments to the lines that you think would benefit from some clarification. --> This pull request fixes an issue where servers that send you transactions with events for large rooms may time out, causing them to back off on your server. This can result in things like missed encryption keys, sometimes for weeks on end. This also causes huge amounts of repeated work, which can mean your server may be burning through CPU and RAM for no real reason. It fixes this by adding some new constraints: 1. Servers can no longer send more than one transaction to us concurrently (healthy servers only send one transaction at a time) 2. Incoming transactions are immediately cast to a background task which can be resumed by a later request should the first one time out 3. Incoming transactions requests are now almost truly replay-safe, meaning the same transaction request being sent more than once will always* (\*cache is in-memory so flushed on restart) return the same response 4. There is now a customizable global limit on how many transactions your server will process concurrently before it starts rejecting new ones for being overloaded, which will massively alleviate the thundering herd effect when returning from an outage 5. Incoming events are now properly sorted before being processed. This usually will have no effect, but will improve reliability if the sending server did not sort the events before sending them to you. This is a huge performance and reliability improvement for servers that are in large or old rooms that may take a long time to process. **Pull request checklist:** <!-- You need to complete these before your PR can be considered. If you aren't sure about some, feel free to ask for clarification in #dev:continuwuity.org. --> - [x] This pull request targets the `main` branch, and the branch is named something other than `main`. - [x] I have written an appropriate pull request title and my description is clear. - [x] I understand I am responsible for the contents of this pull request. - I have followed the [contributing guidelines][c1]: - [x] My contribution follows the [code style][c2], if applicable. - [x] I ran [pre-commit checks][c1pc] before opening/drafting this pull request. - [x] I have [tested my contribution][c1t] (or proof-read it for documentation-only changes) myself, if applicable. This includes ensuring code compiles. - [x] My commit messages follow the [commit message format][c1cm] and are descriptive. - [x] I have written a [news fragment][n1] for this PR, if applicable<!--(can be done after hitting open!)-->. <!-- Notes on these requirements: - While not required, we encourage you to sign your commits with GPG or SSH to attest the authenticity of your changes. - While we allow LLM-assisted contributions, we do not appreciate contributions that are low quality, which is typical of machine-generated contributions that have not had a lot of love and care from a human. Please do not open a PR if all you have done is asked ChatGPT to tidy up the codebase with a +-100,000 diff. - In the case of code style violations, reviewers may leave review comments/change requests indicating what the ideal change would look like. For example, a reviewer may suggest you lower a log level, or use `match` instead of `if/else` etc. - In the case of code style violations, pre-commit check failures, minor things like typos/spelling errors, and in some cases commit format violations, reviewers may modify your branch directly, typically by making changes and adding a commit. Particularly in the latter case, a reviewer may rebase your commits to squash "spammy" ones (like "fix", "fix", "actually fix"), and reword commit messages that don't satisfy the format. - Pull requests MUST pass the `Checks` CI workflows to be capable of being merged. This can only be bypassed in exceptional circumstances. If your CI flakes, let us know in matrix:r/dev:continuwuity.org. - Pull requests have to be based on the latest `main` commit before being merged. If the main branch changes while you're making your changes, you should make sure you rebase on main before opening a PR. Your branch will be rebased on main before it is merged if it has fallen behind. - We typically only do fast-forward merges, so your entire commit log will be included. Once in main, it's difficult to get out cleanly, so put on your best dress, smile for the cameras! --> [c1]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md [c2]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/docs/development/code_style.mdx [c1pc]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#pre-commit-checks [c1t]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#running-tests-locally [c1cm]: https://forgejo.ellis.link/continuwuation/continuwuity/src/branch/main/CONTRIBUTING.md#commit-messages [n1]: https://towncrier.readthedocs.io/en/stable/tutorial.html#creating-news-fragments
nex added this to the next milestone 2026-02-21 03:35:15 +00:00
nex self-assigned this 2026-02-21 03:35:15 +00:00
Adds two new in-memory maps to the service in to prepare for better handlers
feat: Instrument process_inbound_transaction
Some checks failed
Update flake hashes / update-flake-hashes (pull_request) Successful in 1m11s
Documentation / Build and Deploy Documentation (pull_request) Successful in 2m5s
Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled
Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled
791d2d7387
fix: Remove duplicate fields from logs
Some checks failed
Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled
Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled
Documentation / Build and Deploy Documentation (pull_request) Has been cancelled
Update flake hashes / update-flake-hashes (pull_request) Successful in 1m10s
879da73d90
nex force-pushed nex/feat/better-inbound-txn-handle from 879da73d90
Some checks failed
Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled
Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled
Documentation / Build and Deploy Documentation (pull_request) Has been cancelled
Update flake hashes / update-flake-hashes (pull_request) Successful in 1m10s
to 4c506df99f
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m28s
Checks / Prek / Pre-commit & Formatting (pull_request) Failing after 6m9s
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 15m37s
2026-02-21 03:39:06 +00:00
Compare
Author
Owner

Warning to prospective testers and anyone tempted to review this before I mark it as ready: while this does work (and it instantly shows massive improvements, I've got it deployed to my main server rn) the caches are unbounded in size and the channels are potentially leaky with a capacity to deadlock until I re-add proper error handling to the internal handle call. There should not be explosions but have a fire extinguisher on standby.

Warning to prospective testers and anyone tempted to review this before I mark it as ready: while this does work (and it instantly shows massive improvements, I've got it deployed to my main server rn) the caches are unbounded in size and the channels are potentially leaky with a capacity to deadlock until I re-add proper error handling to the internal handle call. There should not be explosions but have a fire extinguisher on standby.
nex requested review from Owners 2026-02-21 03:41:28 +00:00
feat: Warn when server is overloaded
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m26s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m6s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 27m36s
71a59af286
feat: Attempt to build localised DAG before processing PDUs
All checks were successful
Update flake hashes / update-flake-hashes (pull_request) Successful in 24s
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m51s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m34s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 18m2s
b8f22af642
nex changed title from WIP: feat: Support processing concurrent background transactions to feat: Support processing concurrent background transactions 2026-02-21 19:35:35 +00:00
@ -67,0 +72,4 @@
return Ok(response);
}
// Or are currently processing it
if let Some(receiver) = services.transaction_ids.get_active_federation_txn(&txn_key) {
Owner

There's technically a race condition here but I don't care that much

There's technically a race condition here but I don't care that much
Owner

It seems to only result in an error rather than duplicate transactions,

It seems to only result in an error rather than duplicate transactions,
Author
Owner

Yeah, that's something I accounted for. The chances of this race condition actually being triggered are slim to none, and because of the way get_active_federation_txn holds the lock anyway, it'd just result in the duplicate being rejected.

Yeah, that's something I accounted for. The chances of this race condition actually being triggered are slim to none, and because of the way `get_active_federation_txn` holds the lock anyway, it'd just result in the duplicate being rejected.
Jade marked this conversation as resolved
nex force-pushed nex/feat/better-inbound-txn-handle from b8f22af642
All checks were successful
Update flake hashes / update-flake-hashes (pull_request) Successful in 24s
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m51s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m34s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 18m2s
to 47fd9ea6ed
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 3m7s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 7m26s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 32m36s
2026-02-21 20:57:44 +00:00
Compare
fix: Clean up cache, prevent several race conditions
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m15s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m1s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 18m27s
b11d1fdc82
We use one map which is only ever held for a short time.
refactor: Make federation transaction handling infalible
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m27s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m13s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 29m23s
05f8536a18
The only things that could fail anyway were:
1. sorting (we can ignore that
2. server shutdown (we can partially process
   the transaction and return errors for everything
   that's not processed)
@ -209,0 +301,4 @@
if let Err(e) = services.server.check_running() {
debug_warn!("Server shutting down, returning partial transaction results: {e}");
results.push((event_id, Err(e)));
results.extend(event_ids.map(|id| (id, Err(err!("Server is shutting down")))));
Author
Owner

We should not allow partial failures in this condition - servers don't use returned transaction errors for anything but debugging at the moment, so this will induce event loss. A 5XX error (semantically 503) should be used to get the sender to re-try the whole transaction - re-processing part of the transaction is better than losing part of it.

We should not allow partial failures in this condition - servers don't use returned transaction errors for anything but debugging at the moment, so this will induce event loss. A 5XX error (semantically 503) should be used to get the sender to re-try the whole transaction - re-processing part of the transaction is better than losing part of it.
Owner

Ah, didn't know that.

Ah, didn't know that.
Owner

Should be better now

Should be better now
Jade marked this conversation as resolved
@ -6,0 +27,4 @@
/// Minimum interval between cache cleanup runs.
/// Exists to prevent thrashing when the cache is full of things that can't be
/// cleared
const CLEANUP_INTERVAL_SECS: u64 = 30;
Author
Owner

60 seconds would make my brain happier but i don't think it matters that much

60 seconds would make my brain happier but i don't think it matters that much
Owner

:3

:3
Jade marked this conversation as resolved
Jade force-pushed nex/feat/better-inbound-txn-handle from 05f8536a18
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m27s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m13s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 29m23s
to 914a8ab2eb
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m50s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m16s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 22m39s
2026-02-22 01:21:13 +00:00
Compare
@ -131,0 +222,4 @@
/// Converts a TransactionError into an appropriate HTTP error response.
fn transaction_error_to_response(err: &TransactionError) -> Error {
match err {
| TransactionError::ShuttingDown => Error::Request(
Author
Owner

Isn't a match with one arm a bit redundant

Isn't a match with one arm a bit redundant
Owner

This is meant to be if we actually get any other error types. Not sure if that'll actually happen tho

This is meant to be if we actually get any other error types. Not sure if that'll actually happen tho
nex marked this conversation as resolved
@ -6,0 +37,4 @@
impl fmt::Display for TransactionError {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
match self {
| Self::ShuttingDown => write!(f, "Server is shutting down"),
Author
Owner

again a single-arm match feels redundant

again a single-arm match feels redundant
nex marked this conversation as resolved
nex force-pushed nex/feat/better-inbound-txn-handle from 914a8ab2eb
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m50s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m16s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 22m39s
to 92351df925
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 2m50s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 4m59s
Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled
2026-02-23 16:36:51 +00:00
Compare
nex left a comment
Author
Owner

I am nex and I approve this pull request

I am nex and I approve this pull request ✅
chore: Add news frag
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 2m9s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m44s
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 24m11s
d4481b07ac
ginger requested changes 2026-02-23 16:55:38 +00:00
Dismissed
ginger left a comment
Owner

very solid PR in general :3 just a few nits. also consider renaming the service to just transactions

very solid PR in general :3 just a few nits. also consider renaming the service to just `transactions`
@ -89,0 +110,4 @@
async fn wait_for_result(
mut recv: Receiver<WrappedTransactionResponse>,
) -> Result<send_transaction_message::v1::Response> {
if tokio::time::timeout(Duration::from_secs(50), recv.changed())
Owner

why 50 seconds 🧌

why 50 seconds 🧌
Author
Owner

Synapse only waits for 60 seconds, elsewhere (and in my own patches) I use 55 seconds, but there needs to be a few seconds of leeway to return the response within the deadline. Since a timeout means the sender will just retry the now de-duplicated transaction immediately after, this doesn't really have any issues

Synapse only waits for 60 seconds, elsewhere (and in my own patches) I use 55 seconds, but there needs to be a few seconds of leeway to *return* the response within the deadline. Since a timeout means the sender will just retry the now de-duplicated transaction immediately after, this doesn't really have any issues
Author
Owner

Or more concisely: waiting the exact amount of time we think the sender is also going to wait may result in them disconnecting before we can finish returning the response to them

Or more concisely: waiting the exact amount of time we think the sender is also going to wait may result in them disconnecting before we can finish returning the response to them
Owner

sounds good 👍

sounds good 👍
ginger marked this conversation as resolved
@ -131,0 +216,4 @@
// Send the error to any waiters
sender
.send(Some(Err(err)))
.expect("couldn't send error to channel");
Owner

this probably shouldn't be an expect(), it could panic if (for example) there's only one waiter and it hits the timeout and stops listening. I think this should be a let _ =

this probably shouldn't be an expect(), it could panic if (for example) there's only one waiter and it hits the timeout and stops listening. I think this should be a `let _ = `
Author
Owner

The channels are buffered aren't they? Not sure why sending to a sender with no receiver would cause a panic, but maybe I haven't read the docs enough

The channels are buffered aren't they? Not sure why sending to a sender with no receiver would cause a panic, but maybe I haven't read the docs enough
Owner
https://docs.rs/tokio/latest/tokio/sync/watch/struct.Sender.html#method.send:~:text=This%20method%20fails%20if%20the%20channel%20is%20closed%2C%20which%20is%20the%20case%20when%20every%20receiver%20has%20been%20dropped%2E
Author
Owner

yeah okay I just didn't read the docs for it then lmao

yeah okay I just didn't read the docs for it then lmao
nex marked this conversation as resolved
@ -172,0 +276,4 @@
/// dependencies, however it is ultimately the sender's responsibility to send
/// them in a processable order, so this is just a best effort attempt. It does
/// not account for power levels or other tie breaks.
async fn build_local_dag(
Owner

this will panic if the CanonicalJsonObject is shaped weird, is that a good thing?

this will panic if the `CanonicalJsonObject` is shaped weird, is that a good thing?
Author
Owner

don't send weird shaped objects

don't send weird shaped objects
Owner

yes but what if someone does

yes but what if someone does
Author
Owner

their transaction will fail

their transaction will fail
Author
Owner

image

![image](/attachments/2e3083e0-8d6c-4078-96b0-6625990c31ea)
nex marked this conversation as resolved
@ -54,0 +192,4 @@
let max_active_txns = self.services.config.max_concurrent_inbound_transactions;
// Check if we're at capacity
if state.len() >= max_active_txns
Owner

could a semaphore be used for this capacity logic?

could a semaphore be used for this capacity logic?
Author
Owner

Wouldn't a semaphore imply waiting for a free slot rather than rejecting when there's no more free slots

Wouldn't a semaphore imply waiting for a free slot rather than rejecting when there's no more free slots
Owner

shrug. I guess you know what you're doing

shrug. I guess you know what you're doing
ginger marked this conversation as resolved
@ -54,0 +243,4 @@
.send(Some(Ok(response)))
.expect("couldn't send response to channel");
// explicitly close
Owner

inconsistent comment capitalization 🤨

inconsistent comment capitalization 🤨
nex marked this conversation as resolved
fix: Don't panic if nobody's listening
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m38s
Checks / Prek / Pre-commit & Formatting (pull_request) Has been cancelled
Checks / Prek / Clippy and Cargo Tests (pull_request) Has been cancelled
8702f55cf5
chore: Fix incorrect capitalisation
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m35s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m36s
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 17m42s
d311b87579
I didn't realise I agreed to take an English class with @ginger while
working on this server lol
ginger approved these changes 2026-02-23 17:28:56 +00:00
chore: Refactor transaction_ids -> transactions
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m32s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 2m24s
Documentation / Build and Deploy Documentation (push) Successful in 3m58s
Checks / Prek / Pre-commit & Formatting (push) Successful in 6m38s
Release Docker Image / Build linux-amd64 (release) (push) Has been cancelled
Release Docker Image / Build linux-arm64 (release) (push) Has been cancelled
Checks / Prek / Clippy and Cargo Tests (push) Has been cancelled
Release Docker Image / Create Multi-arch Release Manifest (push) Has been cancelled
Release Docker Image / Build linux-amd64 (max-perf) (push) Has been cancelled
Release Docker Image / Build linux-arm64 (max-perf) (push) Has been cancelled
Release Docker Image / Create Max-Perf Manifest (push) Has been cancelled
Checks / Prek / Clippy and Cargo Tests (pull_request) Successful in 41m27s
558262dd1f
nex merged commit 558262dd1f into main 2026-02-23 17:48:12 +00:00
nex deleted branch nex/feat/better-inbound-txn-handle 2026-02-23 17:48:12 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
continuwuation/continuwuity!1428
No description provided.