perf: Attempt to prevent people joining known busted rooms #1503
Open
nex
wants to merge 3 commits from
nex/feat/block-busted-rooms into main
pull from: nex/feat/block-busted-rooms
merge into: continuwuation:main
continuwuation:main
continuwuation:renovate/lock-file-maintenance
continuwuation:aranje/illegal-car-mods
continuwuation:renovate/recaptcha-verify-0.x
continuwuation:jade/community-guidelines
continuwuation:nex/experiment/push-gateway-logs
continuwuation:ginger/msc3575-obliteration
continuwuation:ginger/password-reset
continuwuation:renovate/docker-setup-qemu-action-4.x
continuwuation:renovate/rand_core-0.x
continuwuation:renovate/serde_html_form-0.x
continuwuation:renovate/reqwest-0.x
continuwuation:renovate/github-actions-non-major
continuwuation:nex/fix/informative-startup-errs
continuwuation:ginger/no-left-room-initial-sync
continuwuation:nex/feat/policy-servers-2-electric-boogaloo
continuwuation:jade/docker-entrypoint
continuwuation:jade/dehydrated-devices
continuwuation:ginger/complement-fixes
continuwuation:nex/fix/stale-destination-cache
continuwuation:nex/experiment/sync-mutex
continuwuation:tcpipuk/docker-docs
continuwuation:jade/snafu
continuwuation:jade/rand-update
continuwuation:nex/stateres-refactor
continuwuation:ginger/779-in-troubleshooting
continuwuation:jade/liveit-guide
continuwuation:jade/http3
continuwuation:nex/feat/admin-hide-empty-rooms
continuwuation:ginger/oobe
continuwuation:nex/fix/debian-thingy
continuwuation:jade/ldap-admin-check
continuwuation:nex/fix/remote-restricted-joins
continuwuation:nex/feat/msc4406-sender-ignored
continuwuation:jade/deadlock-detection
continuwuation:nex/feat/room-shutdown
continuwuation:jade/get-started
continuwuation:jade/docs-guide
continuwuation:ginger/fix-local-invites
continuwuation:nex/fix/tpi
continuwuation:nex/feat/room-deletion
continuwuation:nex/feat/msc4322-media-redaction
continuwuation:ginger/stitched-order
continuwuation:jade/build-info
continuwuation:ginger/deps/update-rspress
continuwuation:jade/admin-announce-improvements
continuwuation:ginger/xtask-improvements
continuwuation:jade/improve-admin-config-display
continuwuation:nex/fix/better-stateres-error-logs
continuwuation:jade/sender-timeouts
continuwuation:nex/feat/custom-v12-room-ids
continuwuation:ginger/update-metadata
continuwuation:nex/feat/admin-force-logout
continuwuation:tom/max-perf-docs
continuwuation:nex/fix/invalid-appservice-reg
continuwuation:nex/feat/antispam
continuwuation:nex/feat/account-locking
continuwuation:jade/logging-cleanup
continuwuation:jade/remove-legacy-appservice-auth
continuwuation:nex/fix/key-query
continuwuation:jade/update-prek
continuwuation:nex/fix/room-summaries
continuwuation:ginger/restrict-admin-commands
continuwuation:ginger/enable-console-by-default
continuwuation:jade/tag-fixes
continuwuation:jade/otlp
continuwuation:nex/meta/pull-req-template
continuwuation:nex/fix/fed-invite-compliance
continuwuation:nex/feat/build-commit
continuwuation:nex/feat/join-logging
continuwuation:jade/mailmap-updates
continuwuation:jade/hack-ci-tmp
continuwuation:jade/v12-stable
continuwuation:jade/relations
continuwuation:ginger/database-refactor
continuwuation:jade/fix-ldap-uiaa
continuwuation:nex/fix/validation
continuwuation:ginger/nuke-invalid-msc4133-fields-in-migration
continuwuation:ginger/downgrade-artifact-actions
continuwuation:oddlid/reload-fix
continuwuation:jade/fix-assert
continuwuation:ginger/sync-v3-cleanup
continuwuation:ginger/remove-absolute-action-urls
continuwuation:jade/website
continuwuation:nex/fix/backoff
continuwuation:ginger/fix-mdbook-for-0.5
continuwuation:ginger/no-docker-on-prs
continuwuation:backport/v0.5.0-rc.8-1
continuwuation:nex/fed-improvements
continuwuation:jade/rust-1.90
continuwuation:jade/mirror-dockerhub
continuwuation:jade/clippy-fixes
continuwuation:jade/fix-support
continuwuation:jade/clean-images
continuwuation:jade/wal-compression-type
continuwuation:jade/flake-clone
continuwuation:ginger/upload-rpms-on-schedule
continuwuation:nex/fix/incoming-fetch
continuwuation:nex/fix/upgrade
continuwuation:tom/ci-fedora-rpm
continuwuation:jade/ci-release-fix
continuwuation:jade/rocksdb-10-5
continuwuation:ginger/fix-msc4133-migration
continuwuation:ginger/migrate-busted-tz
continuwuation:hydra/public
continuwuation:nex/feat/manual-extremities
continuwuation:nex/feat/async-media
continuwuation:nex/feat/fast-joins-hack-do-not-use-DO-NOT-USE
continuwuation:nex/feat/better-logging
continuwuation:trigger-ci-so-latest-isnt-on-illegal-car-mods
continuwuation:nex/feat/pins-backfill
continuwuation:jade/tuwunel-2025-06-old
continuwuation:jade/ai-slop-db-docs
continuwuation:nex/fix-create-auth
continuwuation:jade/version-stats
continuwuation:jade/read-receipts
continuwuation:jade/rust-toolchain-no-targets
continuwuation:jade/logging-features
continuwuation:jade/syncv5-typing
continuwuation:jade/msc2815
continuwuation:jade/purge-sync-tokens
continuwuation:morguldir/see-eye
continuwuation:jade/css-small-screen
continuwuation:nex/wip-751
continuwuation:tuwunel-rebase
continuwuation:test
continuwuation:oddlid/rename-admin-room-bot
continuwuation:strawberry/nix-ci-stuff
continuwuation:strawberry/valgrind
continuwuation:phonemain
continuwuation:strawberry/morgs-snake-sync-jason-main
continuwuation:newer-media-endpoints
continuwuation:folly-coroutines-async-io
continuwuation:federation-retry-timer-port
continuwuation:bad-attempt-at-extracting-homeserver-signing-key
continuwuation:room-deletion-attempt-do-not-use
No reviewers
Labels
Clear labels
This pull request or issue is currently blocked from being merged/closed
Something isn't working as intended
Commits picked from other conduit projects
This requires or includes changes to the database
Something dependency related
Automatic dependency upgrades by Renovate
Low difficulty to implement - touches few parts of the codebase, low complexity
High difficulty to implement - touches many parts of the codebase, high complexity
Medium difficulty to implement - touches more parts of the codebase, higher complexity
Improvements or additions to documentation
New feature or request
Good for newcomers
Additional eyes and keyboards are required for this one
Issues that have been inhereted from the project pre-fork
Features pertaining to homeserver administration
Features pertaining to the appservice API
Features pertaining to authentication
Features pertaining to client-to-server interactions
Issues relating to core matrix functionality, such as state resolution and PDU formats
Features pertaining to server-to-server interactions
Issues related to room version 12 and related changes (temporary label)
Features pertaining to unstable matrix features
Features pertaining to media interactions
Changes or issues related to trust & safety tooling
Related to housekeeping, maintenance, or other repo-meta.
Issues related to CI changes
Packaging
This issue is blocking the next release
This issue is very important
This issue is of a rather low priority
This item is related to general security
This issue has enough information and is confirmed
This issue or pull request already exists
This issue doesn't seem right
This issue needs further investigation
Questions or support requests
This will not be worked on
Ci/CD
Pull requests that update Rust code
Blocked
This pull request or issue is currently blocked from being merged/closed
Bug
Something isn't working as intended
Cherry-picking
Commits picked from other conduit projects
Database
This requires or includes changes to the database
Dependencies
Something dependency related
Dependencies/Renovate
Automatic dependency upgrades by Renovate
Difficulty
Easy
Low difficulty to implement - touches few parts of the codebase, low complexity
Difficulty
Hard
High difficulty to implement - touches many parts of the codebase, high complexity
Difficulty
Medium
Medium difficulty to implement - touches more parts of the codebase, higher complexity
Documentation
Improvements or additions to documentation
Enhancement
New feature or request
Good first issue
Good for newcomers
Help wanted
Additional eyes and keyboards are required for this one
Inherited
Issues that have been inhereted from the project pre-fork
Matrix/Administration
Features pertaining to homeserver administration
Matrix/Appservices
Features pertaining to the appservice API
Matrix/Auth
Features pertaining to authentication
Matrix/Client
Features pertaining to client-to-server interactions
Matrix/Core
Issues relating to core matrix functionality, such as state resolution and PDU formats
Matrix/E2EE
Matrix/Federation
Features pertaining to server-to-server interactions
Matrix/Hydra
Issues related to room version 12 and related changes (temporary label)
Matrix/MSC
Features pertaining to unstable matrix features
Matrix/Media
Features pertaining to media interactions
Matrix/T&S
Changes or issues related to trust & safety tooling
Meta
Related to housekeeping, maintenance, or other repo-meta.
Meta/CI
Issues related to CI changes
Meta/Packaging
Packaging
Priority
Blocking
This issue is blocking the next release
Priority
High
This issue is very important
Priority
Low
This issue is of a rather low priority
Security
This item is related to general security
Status
Confirmed
This issue has enough information and is confirmed
Status
Duplicate
This issue or pull request already exists
Status
Invalid
This issue doesn't seem right
Status
Needs Investigation
This issue needs further investigation
Support
Questions or support requests
To-Merge
Wont fix
This will not be worked on
old/ci/cd
Ci/CD
Archived
old/rust
Pull requests that update Rust code
Archived
No labels
Blocked
Bug
Cherry-picking
Database
Dependencies
Dependencies/Renovate
Difficulty
Easy
Difficulty
Hard
Difficulty
Medium
Documentation
Enhancement
Good first issue
Help wanted
Inherited
Matrix/Administration
Matrix/Appservices
Matrix/Auth
Matrix/Client
Matrix/Core
Matrix/E2EE
Matrix/Federation
Matrix/Hydra
Matrix/MSC
Matrix/Media
Matrix/T&S
Meta
Meta/CI
Meta/Packaging
Priority
Blocking
Priority
High
Priority
Low
Security
Status
Confirmed
Status
Duplicate
Status
Invalid
Status
Needs Investigation
Support
To-Merge
Wont fix
old/ci/cd
old/rust
Projects
Clear projects
No items
No project
Assignees
Clear assignees
No assignees
7 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".
No due date set.
Dependencies
No dependencies set.
Reference
continuwuation/continuwuity!1503
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "nex/feat/block-busted-rooms"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
As people keep setting up a server and immediately start trying to join rooms that have caused performance issues for even some of the beefiest servers in the network, this PR introduces a more drastic measure to prevent people footgunning - a list of room IDs are now hardcoded to be blocked, which prevents even admins joining them, unless a config option is enabled.
This is necessary since people keep trying to join, for example, the Matrix Community space, and being unable to do so, or being able to do so, but later having their machines absolutely crushed trying to resolve the room's state some time later (see: pretty much everyone on the maintainer team, federated.nexus, even some big name public deployments have recently started banning this room). Once joined, leaving itself is a difficult process, and simply participating in the room is enough to cause performance issues, which is terrible for anyone who is just getting started.
Pull request checklist:
mainbranch, and the branch is named something other thanmain.myself, if applicable. This includes ensuring code compiles.
@ -79,0 +87,4 @@if !services.config.allow_joining_broken_rooms&& BROKEN_ROOM_IDS.contains(&room_id.as_str()){return Err!(Request(Forbidden("This room is too complex.")));we may want to add a new section to the FAQ and have this error message link to it
it could probably also do with being more specific, maybe something like "This room is known to (be broken / cause issues)."
i would hazard a guess that most users will not understand what "too complex" means for a room
"This room is too complex" is easy enough to find in documentation without overloading the actual error response with too much information. "This room is known to be broken" sounds scary and also isn't exactly true (these rooms aren't broken, just so complex that they're essentially broken) and "this room is known to cause issues" is too vague. "too complex" describes the problem precisely.
Perhaps "This room is known to cause issues due to being too complex"?
That is not a very fun compromise. I agree that complexity is a precise description, but it still sounds odd to unfamiliar ears. I would lean more towards "This room's history is too complex" to be clear about what exactly complexity entails in this instance.
@ -1522,0 +1534,4 @@# forgo your right to complain about any slowdowns or inflated resource# usage you encounter.##allow_joining_broken_rooms = falsecouldn't it be better if admins could tweak the list? removing or adding individual rooms? We just gave them the default list as guidance.
!admin rooms moderation ban-roomexists for user-configurable room bans; this is merely meant to provide a "default" set that prevents new users joining rooms that will just destroy their server while they're none the wiser.as in brick their server? or just require they evict a user in a race condition?
i agree there is overlap between the functions
ban-roomand the proposed room filter list. But unless there's a pattern of complete bricking, i'm against taking control away from admins completely. If they want to erase or comment out our recommendations, they will learn about race conditions and log spam.I saw a similar thing indented deep in the code, a hard-coded rule. We should probably find out why these rooms break. Like if this room breaks because it's v5, we can adjust our interface in general for all room v5s. If it's just a bad room, we can add it to the proposed list. I am not sure we should be hard-coding edge cases throughout the production code, I would think it's better to leave them configurable.
perhaps we could even just combine the two lists before even feeding it to the server, and treat ban/disable as the same
Pretty much.
I'm not sure what control this takes away? It simply prevents people joining a room without realising it will blow up their server. It can be turned on and off at will, and is independent from the runtime-configured bans, which are typically for a different purpose.
The problem is this is the current system and it is resulting in almost daily people coming into our main room and complaining of slow joins / slow server / high CPU & RAM usage / insane disk usage inflation. We clearly need something to prevent people who are uneducated on the rooms they're joining from blowing up their server and not knowing until it's too late.
It's because of state resets and/or insanely deep auth chains
We can't blanket affect like this - some v5 rooms (for example) are perfectly fine, whereas some are practically unusable. There's no one-size-fits-all :(
This is basically a last-resort. I don't want to do this either but I'm not seeing another option.
I still don't understand this - you can manually ban and unban rooms with the relevant room moderation commands, that is configuration. Why would you add to the hardcoded list if you can just... use the admin command for the same effect?
When I said this room breaks bc v5 i meant the quoted rust block giving a custom legacy route to the
nhekoroom. IF ALL v5 rooms need that legacy route, handle it as such... don't hard-code one or two common or popular cases. Right? Does that make sense?I hadn't considered the impact of having other servers with race conditions join ours. Well, that still feels like something we should be filtering on our end. And we can't control if it afflicts conduwuit, tuwunel, or synapse. So we need a solution from our side, imo.
I would rather see a redundant configuration value (as silly as it feels) than have hard-coded strings (cough tech debt) living throughout the code.
There could be tons of people already in a race condition and their admins just never check the logs.
Based on my federation metrics, very few people even adopt the latest version. I think we can do a much better job investigating this and come up with a long-term solution, sooner and before we will see significant return on these hard-coded rules.
The route you highlighted isn't a "legacy" route, that's just the unstable path for
/_matrix/client/v1/room_summary/{roomId}, presumably just missing a route definition in ruwumaI disagree with the notion of "tech debt" here - redundant configuration leaves room for error due to additional moving parts, and there's very little "tech debt" when this is basically just an append-only array of strings, which ideally won't even need updating all that often.
This is primarily targeting new deployments, not existing ones - and nobody in their right mind is deploying outdated versions. Also, 0.5.6 has been out less than a week. People often have auto-updates set up, which means there will probably be a delay in doing so.
I'll happily accept alternatives later down the line in another PR or whatever, but right now we need something now, and the team lacks the time to investigate this more than "we know what the issue is". Given it's taking down even huge servers that aren't even continuwuity, I fear this may not be something we can "fix".
I see upwards of half of peers are not even on 0.5.5 yet.
I'm not explaining myself the best here, ugh. I'm going to log in my nightly account and explain some points in the morning hopefully.
I'm okay if this gets merged as is, but I would rather not have you depend on others submitting "PRs down the line." The original author really needs to be the one most committed to following up with a better implementation.
@ -61,0 +67,4 @@"!MBrxZRUoApYYjmyion:t2bot.io", // Old t2bot room - insane auth chain depths"izahlpcyIDeymNjiOd:matrix.debian.social", // #debian-next:matrix.debian.social"!mefQhZzgTaxNCNzAeK:kde.org", // KDE user help"!OTxETzuhBDbnPqBqbP:kde.org", // KDE spaceoh hell yeah homie, i'm gonna join them all on my nightly account 🙌 LFG keep it maintained
Please avoid hard-coding configuration values in production rust code. Brainstorm an approach which uses dynamic configuration if possible.
also, please request a complement test report from me. my PR is not yet merged, but i can get you test results in 15-20 minutes. This looks relatively harmless from a regression standpoint, but i want to start testing our code more please
There's no reason to run complement against this
@ -61,0 +62,4 @@"!iMZEhwCvbfeAYUxAjZ:t2l.io", // Matrix community space - insanely broken state"!OGEhHVWSdvArJzumhm:matrix.org", // Old Matrix HQ - huge room, very broken"!IemiTbwVankHTFiEoh:matrix.org", // Old Element Web - huge room, very broken"!brXHJeAtqliwNGqHQx:lossy.network", // NixOS space - frequent bug reports, huge stateIDK, I haven't had issues with the NixOS space, and I'd prefer to be able to use it. IMO either only ban the really really bad ones or make the banned room list configurable. I'd prefer to not need to patch the code for my builds.
Update to my previous reply: nevermind, I am not exactly sure how or why this is the actual NixOS space, but I was wrong. I don't think this should be banned either.
Considering the issue is likely to persist, I think it's desirable to have the list being configurable instead of hardcoded.
It's hard to see an outcome that everyone is happy with. The point of the PR is to have hard-coded maintainer-curated rooms banned by default. We don't have issues with one of them, but nex wouldn't put it here for no reason, so I assume lots of people have had issues with it. The best solution to this I can see is to make the banned rooms accessible from configuration, not just admin commands, and set the list from this PR as its default.
Now, that could introduce confusing behaviour, since there will be two separate points of configuration via the configuration and admin commands. My suggestion for that would be to make the configuration something like
bootstrap_banned_rooms, so the list pre-fills the banned rooms list the first time the server starts and it's admin commands from there onwards. Either that, or you could cut off the admin command access when the value is set in config, but I think the first option makes more sense.At the end of the day, it's up to the maintainers.
This all feels like over-complication and an introduction point for confusing behaviour and unhelpful error messages. If users want to take over this list, they can disable it and ban the rooms themselves.
Such approach I believe wouldn't account for
auto_deactivate_banned_room_attempts, the proposed mechanism is much earlier nope out that could've been used for "soft" banning complex rooms as well. In any case, I don't have a strong opinion about this PR from the server operator's point of view, I believe the maintainers could make the proper judgement.I would like more details of performance issues attached in the future. Unless you offer a better impl, it's hard to judge where the code's old sticking points were.
Whether that's a heatmap SVG visual or just some explanation or adding me to a group chat.
But words like "insanely broken" are inherently vague. On my stateless sync branch, I was able to fully load the 28K members (very shocking the Cinny and Element UI can actually handle lists that long) by adjusting state resolution logic regarding outlier PDUs.
I am left questioning whether the state of the room, or our resolution logic is the more broken thing.
Regardless, I think I'm not alone in this fugue, and we would all benefit immensely from more diagnostic details better outlining the exact root cause of the performance degredations. That would better put us in position to understand the PR and what it means/does.
@gamesguru wrote in #1503 (comment):
I mean, I think this PR is ultimately helpful, its known that these rooms cause a lot of issues, check your processing PDUs and you can often see PDUs from these rooms taking multiple minutes, and using up a lot of CPU. I just would like the list to be configurable, maybe. But, I guess Jade is right, and an admin can deactivate this measure and ban the rooms manually.
View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.