calling policy servers #857

Merged
nex merged 16 commits from nex/serving-policy into main 2025-07-23 23:27:11 +00:00
Owner

Allows us to call MSC4284 policy servers to determine if an event should be soft failed or not.

  • Call policy server on incoming event
  • Don't send redactions for spam events to clients
  • Prevent sending events that fail the policy checks
Allows us to call MSC4284 policy servers to determine if an event should be soft failed or not. - [X] Call policy server on incoming event - [x] Don't send redactions for spam events to clients - [x] Prevent sending events that fail the policy checks
nex self-assigned this 2025-06-20 00:55:45 +00:00
Author
Owner

There's some artifacts of some other branches in here, I'll want to filter those out before merge. Truly am a git mastermind

There's some artifacts of some other branches in here, I'll want to filter those out before merge. Truly am a git mastermind
Jade requested changes 2025-06-20 09:51:55 +00:00
Dismissed
Jade left a comment
Owner

It looks mostly good aside from what happened overnight in offtopic. If you want I can handle some of these changes for you.

It looks mostly good aside from what happened overnight in offtopic. If you want I can handle some of these changes for you.
@ -0,0 +59,4 @@
return Ok(());
},
};
if response.recommendation == "spam" {
Owner

Are there any other recommendations we should handle there? Or is it just spam so far?

Are there any other recommendations we should handle there? Or is it just spam so far?
Author
Owner

The MSC says spam and ok are the only two defined responses, however namespaced ones are possible. Synapse only actions if the recommendation is spam, so we do the same

The MSC says `spam` and `ok` are the only two defined responses, however namespaced ones are possible. Synapse only actions if the recommendation is `spam`, so we do the same
Jade marked this conversation as resolved
@ -45,3 +45,3 @@
}
debug!("Upgrading to timeline pdu");
debug!("Upgrading pdu {} from outlier to timeline pdu", incoming_pdu.event_id);
Owner

Use structured logging parameters at the start of the log macro rather than interpolating the dynamic value into the string.
Could probably add this as a code style guide.

Use structured logging parameters at the start of the log macro rather than interpolating the dynamic value into the string. Could probably add this as a code style guide.
Author
Owner

Could probably add this as a code style guide.

Would be a good idea given I've been doing this everywhere else too lol

> Could probably add this as a code style guide. Would be a good idea given I've been doing this everywhere else too lol
nex marked this conversation as resolved
@ -217,1 +219,4 @@
// 14-pre. If the event is not a state event, ask the policy server about it
if incoming_pdu.state_key.is_none()
&& incoming_pdu.sender().server_name() != self.services.globals.server_name()
Owner

Aren't we supposed to use the policy server to check our own events?

Aren't we supposed to use the policy server to check our own events?
Author
Owner

We aren't supposed to send federation requests to ourselves (or at least Synapse said they didn't do this, intentionally, but that was probably something to do with "unexpected behaviours")

We aren't supposed to send federation requests to ourselves (or at least Synapse said they didn't do this, intentionally, but that was probably something to do with "unexpected behaviours")
Owner

That's not what this is doing though, you're checking the sender of the event against our server name.
To avoid sending federation requests to ourselves, compare in src/service/rooms/event_handler/call_policyserv.rs:25 the via vs our server name.

That's not what this is doing though, you're checking the sender of the event against our server name. To avoid sending federation requests to ourselves, compare in src/service/rooms/event_handler/call_policyserv.rs:25 the via vs our server name.
Author
Owner

call_policyserv will be used to check ourselves later, so I can't put the check there. Upgrading an outlier PDU should only happen for remote events anyway afaik, so comparing the sender's homeserver to ours before calling the check is just a guard rail

call_policyserv will be used to check ourselves later, so I can't put the check there. Upgrading an outlier PDU should only happen for remote events anyway afaik, so comparing the sender's homeserver to ours before calling the check is just a guard rail
Owner

That's the function that has the federation request, so the guard should be in that function anyway.

call_policyserv will be used to check ourselves later, so I can't put the check there.

Are we misunderstanding each other?

  • Event sender
  • Policy server
  • Our server

Policy server != our server as self-federation isn't allowed
Current check is event sender != our server - but that check should never be false as upgrading an outlier PDU should only happen for remote events ??

That's the function that has the federation request, so the guard *should* be in that function anyway. > call_policyserv will be used to check ourselves later, so I can't put the check there. Are we misunderstanding each other? - Event sender - Policy server - Our server Policy server != our server as self-federation isn't allowed Current check is event sender != our server - but that check should never be false as upgrading an outlier PDU should only happen for remote events ??
Author
Owner

I haven't checked, but are you sure our own events can never be outliers? I'm inclined to keep the check just as a safety net since it doesn't exactly cost anything to compare, but if you're sure we've never got own-outliers I can remove it

I haven't checked, but are you sure our own events can never be outliers? I'm inclined to keep the check just as a safety net since it doesn't exactly cost anything to compare, but if you're sure we've never got own-outliers I can remove it
nex marked this conversation as resolved
@ -218,0 +222,4 @@
&& incoming_pdu.sender().server_name() != self.services.globals.server_name()
{
debug!("Checking policy server for event {}", incoming_pdu.event_id);
let policy = self.policyserv_check(&incoming_pdu, room_id);
Owner

We're missing a timeout here

We're missing a timeout here
nex marked this conversation as resolved
@ -218,0 +223,4 @@
{
debug!("Checking policy server for event {}", incoming_pdu.event_id);
let policy = self.policyserv_check(&incoming_pdu, room_id);
if let Err(e) = policy.await {
Owner

Can we send off the policy server check at an earlier point so that it's done in parallel to another check? Might require moving a part of this function into its own async function that can be async-joined

Can we send off the policy server check at an earlier point so that it's done in parallel to another check? Might require moving a part of this function into its own async function that can be async-joined
Author
Owner

not really sure if doing it in parallel is a good use of resources - if an event fails the auth checks because of its own auth events, there's no poing making an additional network request to see if it's spam, for example. It only needs to call the policy server if it'll potentially be sent to the client, so if it's not going to be sent to the client there's not much point checking.
Unless you think it's worth it anyway? I'm not actually sure how heavy the cost of checking is yet

not really sure if doing it in parallel is a good use of resources - if an event fails the auth checks because of its own auth events, there's no poing making an additional network request to see if it's spam, for example. It only needs to call the policy server if it'll potentially be sent to the client, so if it's not going to be sent to the client there's not much point checking. Unless you think it's worth it anyway? I'm not actually sure how heavy the cost of checking is yet
Owner

My intuition is that it is time-slow but resource-cheap, so we want to start it as early as possible. However rooting through the code here again, during normal operation the only thing in there that would be slow is state events, and this doesn't check those anyway, so ignore this for now

My intuition is that it is time-slow but resource-cheap, so we want to start it as early as possible. However rooting through the code here again, during normal operation the only thing in there that would be slow is state events, and this doesn't check those anyway, so ignore this for now
Author
Owner

State events should be checked eventually, but I want to battle test it on timeline events before I start soft-failing state events, since the latter is quite dangerous in comparison

State events *should* be checked eventually, but I want to battle test it on timeline events before I start soft-failing state events, since the latter is quite dangerous in comparison
nex marked this conversation as resolved
@ -701,0 +701,4 @@
if state_key.is_none() {
if prev_events.is_empty() {
warn!("Timeline event had zero prev_events, something broke.");
return Err!(Request(Unknown("Timeline event had zero prev_events.")));
Owner

Does this block belong in this PR? Should probably be tested separately

Does this block belong in this PR? Should probably be tested separately
Author
Owner

No, this is from the fix-create-auth branch, got caught up in the cherry pick

No, this is from the fix-create-auth branch, got caught up in the cherry pick
nex marked this conversation as resolved
Author
Owner

Note, the aforementioned offtopic explosion was a Meowlnir bug, not bug with this impl.

Note, the aforementioned offtopic explosion was a Meowlnir bug, not bug with this impl.
nex force-pushed nex/serving-policy from 67cfd5fd99
Some checks failed
Release Docker Image / define-variables (push) Successful in 3s
Rust Checks / Format (push) Successful in 1m4s
Rust Checks / Clippy (push) Failing after 2m30s
Documentation / Build and Deploy Documentation (pull_request) Successful in 37s
Release Docker Image / build-image (linux/arm64, linux-arm64) (push) Successful in 10m27s
Rust Checks / Cargo Test (push) Failing after 16m1s
Release Docker Image / build-image (linux/amd64, linux-amd64) (push) Successful in 10m52s
Release Docker Image / merge (push) Successful in 24s
to 5ec8e378c2
Some checks failed
Release Docker Image / define-variables (push) Successful in 2s
Checks / Prefligit / prefligit (push) Failing after 43s
Checks / Rust / Format (push) Successful in 40s
Documentation / Build and Deploy Documentation (pull_request) Successful in 41s
Checks / Prefligit / prefligit (pull_request) Failing after 26s
Checks / Rust / Clippy (push) Failing after 4m14s
Checks / Rust / Cargo Test (push) Failing after 4m39s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Failing after 8m35s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Failing after 8m44s
Release Docker Image / merge (push) Has been skipped
2025-06-30 16:17:06 +00:00
Compare
nex force-pushed nex/serving-policy from 5ec8e378c2
Some checks failed
Release Docker Image / define-variables (push) Successful in 2s
Checks / Prefligit / prefligit (push) Failing after 43s
Checks / Rust / Format (push) Successful in 40s
Documentation / Build and Deploy Documentation (pull_request) Successful in 41s
Checks / Prefligit / prefligit (pull_request) Failing after 26s
Checks / Rust / Clippy (push) Failing after 4m14s
Checks / Rust / Cargo Test (push) Failing after 4m39s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Failing after 8m35s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Failing after 8m44s
Release Docker Image / merge (push) Has been skipped
to ce94e241b3
Some checks failed
Release Docker Image / define-variables (push) Successful in 2s
Checks / Prefligit / prefligit (push) Successful in 31s
Checks / Rust / Format (push) Successful in 46s
Checks / Prefligit / prefligit (pull_request) Successful in 26s
Documentation / Build and Deploy Documentation (pull_request) Successful in 50s
Checks / Rust / Clippy (push) Failing after 2m49s
Checks / Rust / Cargo Test (push) Failing after 3m9s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Failing after 9m33s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Failing after 11m12s
Release Docker Image / merge (push) Has been skipped
2025-07-19 19:23:35 +00:00
Compare
nex force-pushed nex/serving-policy from cb4a7b5b2e
Some checks failed
Release Docker Image / merge (push) Blocked by required conditions
Release Docker Image / define-variables (push) Successful in 13s
Checks / Prefligit / prefligit (push) Successful in 15s
Checks / Prefligit / prefligit (pull_request) Successful in 34s
Documentation / Build and Deploy Documentation (pull_request) Successful in 42s
Checks / Rust / Format (push) Successful in 55s
Checks / Rust / Cargo Test (push) Has been cancelled
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Has been cancelled
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Has been cancelled
Checks / Rust / Clippy (push) Has been cancelled
to 977fddf4c5
Some checks failed
Release Docker Image / merge (push) Blocked by required conditions
Release Docker Image / define-variables (push) Successful in 5s
Checks / Prefligit / prefligit (push) Successful in 12s
Checks / Rust / Format (push) Successful in 36s
Checks / Prefligit / prefligit (pull_request) Successful in 29s
Documentation / Build and Deploy Documentation (pull_request) Successful in 44s
Checks / Rust / Clippy (push) Has been cancelled
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Has been cancelled
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Has been cancelled
Checks / Rust / Cargo Test (push) Has been cancelled
2025-07-19 22:52:10 +00:00
Compare
nex changed title from WIP: calling policy servers to calling policy servers 2025-07-19 22:54:34 +00:00
nex requested reviews from Jade, Owners 2025-07-19 22:54:36 +00:00
Jade approved these changes 2025-07-20 00:04:05 +00:00
Dismissed
@ -0,0 +10,4 @@
/// Returns Ok if the policy server allows the event
#[implement(super::Service)]
#[tracing::instrument(skip_all, level = "debug")]
pub async fn policyserv_check(&self, pdu: &PduEvent, room_id: &RoomId) -> Result {
Owner

I get the feeling that this should actually be a result(bool) - fail due to an error is different from rejection, and I don't think we're supposed to fail closed? For clarity at least.

I get the feeling that this should actually be a result(bool) - fail due to an error is different from rejection, and I don't think we're supposed to fail closed? For clarity at least.
Owner

And a comment more like

/// Checks with the room's policy server (if configured) whether the given event should
/// be allowed in the room.
///
/// Returns `Ok(())` if the event is permitted, or an error if the event is
/// denied as spam. Skips the check for policy server meta-events or if no
/// policy server is configured for the room.
And a comment more like ``` /// Checks with the room's policy server (if configured) whether the given event should /// be allowed in the room. /// /// Returns `Ok(())` if the event is permitted, or an error if the event is /// denied as spam. Skips the check for policy server meta-events or if no /// policy server is configured for the room. ```
Author
Owner

I don't think we're supposed to fail closed?

Ooh, good shout, it's supposed to be fail open. Oops.

> I don't think we're supposed to fail closed? Ooh, good shout, it's supposed to be fail open. Oops.
nex marked this conversation as resolved
@ -0,0 +25,4 @@
return Ok(());
};
let via = match policyserver.via {
Owner

via? This is not a via server for routing!

via? This is not a via server for routing!
Author
Owner

via is what it's called in the spec 🤷‍♀️

`via` is what it's called in the spec 🤷‍♀️
Owner

Spec being funny

Spec being funny
Jade marked this conversation as resolved
Jade force-pushed nex/serving-policy from 4f90379ac1
All checks were successful
Release Docker Image / define-variables (push) Successful in 2s
Checks / Prefligit / prefligit (push) Successful in 32s
Checks / Rust / Format (push) Successful in 43s
Checks / Prefligit / prefligit (pull_request) Successful in 28s
Documentation / Build and Deploy Documentation (pull_request) Successful in 47s
Checks / Rust / Clippy (push) Successful in 3m41s
Checks / Rust / Cargo Test (push) Successful in 4m21s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Successful in 10m25s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Successful in 20m16s
Release Docker Image / merge (push) Successful in 43s
to dd3ace92e9
All checks were successful
Checks / Prefligit / prefligit (push) Successful in 16s
Release Docker Image / define-variables (push) Successful in 4s
Checks / Prefligit / prefligit (pull_request) Successful in 33s
Documentation / Build and Deploy Documentation (pull_request) Successful in 1m6s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Successful in 17m1s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Successful in 17m41s
Release Docker Image / merge (push) Successful in 26s
Checks / Rust / Format (push) Successful in 48s
Checks / Rust / Clippy (push) Successful in 4m21s
Checks / Rust / Cargo Test (push) Successful in 4m52s
2025-07-20 20:01:47 +00:00
Compare
Author
Owner

TODO: customisable timeout for policy server call

TODO: customisable timeout for policy server call
nex force-pushed nex/serving-policy from 6ab209504d
Some checks failed
Release Docker Image / merge (push) Blocked by required conditions
Release Docker Image / define-variables (push) Successful in 3s
Checks / Prefligit / prefligit (push) Successful in 18s
Checks / Rust / Format (push) Successful in 41s
Documentation / Build and Deploy Documentation (pull_request) Successful in 46s
Checks / Prefligit / prefligit (pull_request) Successful in 19s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Has been cancelled
Checks / Rust / Cargo Test (push) Has been cancelled
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Has been cancelled
Checks / Rust / Clippy (push) Has been cancelled
to f335f45017
Some checks failed
Release Docker Image / merge (push) Blocked by required conditions
Release Docker Image / define-variables (push) Successful in 2s
Checks / Prefligit / prefligit (push) Successful in 11s
Checks / Rust / Format (push) Successful in 39s
Checks / Prefligit / prefligit (pull_request) Successful in 15s
Documentation / Build and Deploy Documentation (pull_request) Successful in 40s
Checks / Rust / Clippy (push) Successful in 4m4s
Checks / Rust / Cargo Test (push) Successful in 5m0s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Has been cancelled
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Has been cancelled
2025-07-23 16:49:14 +00:00
Compare
Author
Owner

I don't see anything left to do here. Requesting code review from @Jade and general feature review from testers @tcpipuk and @Aranjedeath

I don't see anything left to do here. Requesting code review from @Jade and general feature review from testers @tcpipuk and @Aranjedeath
@ -0,0 +12,4 @@
events::{StateEventType, room::policy::RoomPolicyEventContent},
};
/// Returns Ok if the policy server allows the event
Owner

outdated comment (see prev review on this line too)

outdated comment (see prev review on this line too)
Author
Owner

guh thought I caught that one

guh thought I caught that one
nex marked this conversation as resolved
@ -0,0 +104,4 @@
room_id = %room_id,
"Event was marked as spam by policy server",
);
return Err!(Request(Forbidden("Event was marked as spam by policy server")));
Owner

Ok(false)?? Right?

`Ok(false)`?? Right?
Author
Owner

I actually prefer opening +300-30 no-op feature PRs

I actually prefer opening +300-30 no-op feature PRs
nex marked this conversation as resolved
fix(policy-server): Return the correct result when an event is marked as spam
All checks were successful
Documentation / Build and Deploy Documentation (pull_request) Successful in 35s
Checks / Prefligit / prefligit (pull_request) Successful in 27s
Release Docker Image / define-variables (push) Successful in 7s
Checks / Prefligit / prefligit (push) Successful in 25s
Documentation / Build and Deploy Documentation (push) Successful in 31s
Checks / Rust / Format (push) Successful in 40s
Checks / Rust / Clippy (push) Successful in 4m57s
Checks / Rust / Cargo Test (push) Successful in 5m30s
Release Docker Image / build-image (linux/arm64, release, linux-arm64, base) (push) Successful in 12m49s
Release Docker Image / build-image (linux/amd64, release, linux-amd64, base) (push) Successful in 13m46s
Release Docker Image / merge (push) Successful in 25s
f32f60d056
@ -343,0 +344,4 @@
# servers should respond near instantly, however may slow down under
# load. If a policy server doesn't respond in a short amount of time, the
# room it is configured in may become unusable if this limit is set too
# high. 10 seconds is a good default, however dropping this to 3-5 seconds
Author
Owner

Does suggesting lower values make this sound more complicated than it needs to be?

Does suggesting lower values make this sound more complicated than it needs to be?
Owner

I would only provide tuning advice here if different values have been tested and we have something to recommend. Everybody else will take the default.

I would only provide tuning advice here if different values have been tested and we have something to recommend. Everybody else will take the default.
Author
Owner

Perhaps splitting stuff like this off into a dedicated tuning guide in the docs is more appropriate then, since it'll be easier to waffle about the pros and cons of stuff there rather than in docstrings

Perhaps splitting stuff like this off into a dedicated tuning guide in the docs is more appropriate then, since it'll be easier to waffle about the pros and cons of stuff there rather than in docstrings
Jade approved these changes 2025-07-23 17:44:41 +00:00
nex merged commit f32f60d056 into main 2025-07-23 23:27:11 +00:00
nex deleted branch nex/serving-policy 2025-07-23 23:27:11 +00:00
Jade added this to the v0.5.0-rc.7 milestone 2025-07-25 12:44:30 +00:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
3 participants
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
continuwuation/continuwuity!857
No description provided.