Updating profile information with a large account effectively crashes the server #1205

New issue

Open

opened 2025-11-27 23:49:29 +00:00 by nex · 1 comment

nex commented

2025-11-27 23:49:29 +00:00

Owner

However, my initial suspicions that it was the media fetching that was bringing my homeserver to a halt (I mean, I had 8000 incoming stalled connections all waiting on something, and media is generally larger than most API responses) turned out to be incorrect - using an MXC provided by another server still caused my homeserver to get absolutely hammered, bringing it down a third time, even with the 5 second sleep between each update.

After investigating my reverse proxy logs, I saw that there were thousands of (as mentioned, concurrent) requests to /_matrix/federation/v1/state_ids/... (fetching the state at the membership event I was sending), followed closely by a similar number of calls to /_matrix/federation/v1/get_missing_events/... (fetching the events that are missing). After blocking the state_ids endpoint in my reverse proxy, a bunch of servers started rejecting my new membership event, whereas another set of them continued to just try get_missing_events (presumably already had the required state locally? or something?). I blocked get_missing_events also, and it looks like the origins started fetching each event individually instead, which was much easier for my server to process.

I think we should investigate the performance of state_ids and get_missing_events, and potentially see if we could benefit from adding a cache or two in there?

The other day, I updated my global profile picture temporarily, which required sending a new membership event to over 800 rooms. For better or worse, continuwuity does not send these updates concurrently, in this case meaning that I had set off a chain reaction that would eventually result in my homeserver being DDOSsed by remote servers. I ended up adding in a sleep between each profile update in an attempt to combat this effect (https://forgejo.ellis.link/nex/continuwuity/commit/fb38be9c843e8bdedd0d654d9b43f651781cbd25, since commented out since I don't need it anymore). However, my initial suspicions that it was the media fetching that was bringing my homeserver to a halt (I mean, I had 8000 incoming stalled connections all waiting on *something*, and media is generally larger than most API responses) turned out to be incorrect - using an MXC provided by another server *still* caused my homeserver to get absolutely hammered, bringing it down a third time, even with the 5 second sleep between each update. After investigating my reverse proxy logs, I saw that there were thousands of (as mentioned, concurrent) requests to [`/_matrix/federation/v1/state_ids/...`](https://spec.matrix.org/v1.16/server-server-api/#get_matrixfederationv1state_idsroomid) (fetching the state at the membership event I was sending), followed closely by a similar number of calls to [`/_matrix/federation/v1/get_missing_events/...`](https://spec.matrix.org/v1.16/server-server-api/#post_matrixfederationv1get_missing_eventsroomid) (fetching the events that are missing). After blocking the `state_ids` endpoint in my reverse proxy, a bunch of servers started rejecting my new membership event, whereas another set of them continued to just try `get_missing_events` (presumably already had the required state locally? or something?). I blocked `get_missing_events` also, and it looks like the origins started fetching each event individually instead, which was *much* easier for my server to process. I think we should investigate the performance of `state_ids` and `get_missing_events`, and potentially see if we could benefit from adding a cache or two in there?