WIP: Improve federation backoff logic #1182

Draft
nex wants to merge 1 commit from nex/fix/backoff into main
Owner

This PR will fix up some of the federation sender code to fix the issue where servers get banished to backoff purgatory for an undefined amount of time which may also be different depending on which sender worker fails.
This will not fix issues like lookup errors poisoning the destinations cache, that is a separate issue.

Things to do:

  • Expose the transaction statuses to the entire sender service so that external callers can mutate them
  • Reset backoffs when clearing caches
  • Command to force reset specific server name's backoff & destination?
  • Reset server backoff when we receive a successful transaction from it (mirrors highly desired Synapse behaviour)
This PR will fix up some of the federation sender code to fix the issue where servers get banished to backoff purgatory for an undefined amount of time which may also be different depending on which sender worker fails. This will not fix issues like lookup errors poisoning the destinations cache, that is a separate issue. Things to do: - [ ] Expose the transaction statuses to the entire sender service so that external callers can mutate them - [ ] Reset backoffs when clearing caches - [ ] Command to force reset specific server name's backoff & destination? - [ ] Reset server backoff when we receive a successful transaction from it (mirrors highly desired Synapse behaviour)
nex added this to the 0.5.0 milestone 2025-11-19 19:15:10 +00:00
nex self-assigned this 2025-11-19 19:15:10 +00:00
feat(wip;sender): Centralise status registries
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Failing after 12s
Update flake hashes / update-flake-hashes (pull_request) Successful in 18s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m10s
Release Docker Image / Build linux-amd64 (release) (pull_request) Successful in 3m1s
Release Docker Image / Build linux-arm64 (release) (pull_request) Successful in 3m47s
Release Docker Image / Create Multi-arch Release Manifest (pull_request) Failing after 15s
Release Docker Image / Build linux-amd64 (max-perf) (pull_request) Successful in 2m36s
Release Docker Image / Build linux-arm64 (max-perf) (pull_request) Successful in 2m31s
Release Docker Image / Create Max-Perf Manifest (pull_request) Failing after 13s
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 18m6s
97c13d8c77
Cargo.toml Outdated
@ -556,4 +556,1 @@
[workspace.dependencies.resolv-conf]
version = "0.7.5"
Author
Owner

where'd resolv-conf go wtf

where'd resolv-conf go wtf
@ -101,6 +103,7 @@ impl crate::Service for Service {
federation: args.depend::<federation::Service>("federation"),
},
channels: (0..num_senders).map(|_| loole::unbounded()).collect(),
statuses: vec![sender::CurTransactionStatus::new(); num_senders],
Owner

This is still splitting into multiple statuses different per worker?

This is still splitting into multiple statuses different per worker?
Author
Owner

Yes, we need that for concurrent sending. The status registry tracks the state of a destination per-worker, if one is actively sending and the registry is global, then all other workers will refuse to send until that one completes

Yes, we need that for concurrent sending. The status registry tracks the state of a destination per-worker, if one is actively sending and the registry is global, then all other workers will refuse to send until that one completes
ginger force-pushed nex/fix/backoff from 97c13d8c77
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Failing after 12s
Update flake hashes / update-flake-hashes (pull_request) Successful in 18s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 3m10s
Release Docker Image / Build linux-amd64 (release) (pull_request) Successful in 3m1s
Release Docker Image / Build linux-arm64 (release) (pull_request) Successful in 3m47s
Release Docker Image / Create Multi-arch Release Manifest (pull_request) Failing after 15s
Release Docker Image / Build linux-amd64 (max-perf) (pull_request) Successful in 2m36s
Release Docker Image / Build linux-arm64 (max-perf) (pull_request) Successful in 2m31s
Release Docker Image / Create Max-Perf Manifest (pull_request) Failing after 13s
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 18m6s
to ae595dd0d1
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 57s
Update flake hashes / update-flake-hashes (pull_request) Successful in 17s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 1m31s
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 10m15s
2025-11-21 17:11:15 +00:00
Compare
Some checks failed
Documentation / Build and Deploy Documentation (pull_request) Successful in 57s
Update flake hashes / update-flake-hashes (pull_request) Successful in 17s
Checks / Prek / Pre-commit & Formatting (pull_request) Successful in 1m31s
Required
Details
Checks / Prek / Clippy and Cargo Tests (pull_request) Failing after 10m15s
Required
Details
This pull request is marked as a work in progress.
This branch is out-of-date with the base branch
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin nex/fix/backoff:nex/fix/backoff
git switch nex/fix/backoff
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
continuwuation/continuwuity!1182
No description provided.