-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add redis based reliability reporting #778
base: master
Are you sure you want to change the base?
Conversation
This PR introduces tracking throughput for the database. It also introduces the PushReliability reporting skeleton. This will be fleshed out with full reporting later. Closes: #SYNC-4324
… feat/SYNC-4324_track_db
… feat/SYNC-4324_track_db
… feat/SYNC-4324_track_db
… feat/SYNC-4324_track_db
This adds a feature flag `reliable_report` that optionally enables Push message reliablity reporting. The report is done in two parts. The first part uses a Redis like storage system to note message states. This will require a regularly run "cleanup" script to sweep for expired messages and adust the current counts, as well log those states to some sequential logging friendly storage (e.g. common logging or steamed to a file). The clean-up script should be a singleton to prevent possible race conditions. The second component will write a record of the state transition times for tracked messages to a storage system that is indexed by the tracking_id. This will allow for more "in depth" analysis by external tooling. The idea being that reporting will be comprised of two parts: One part which shows active states of messages (with a log of prior states to show trends over time), and an optional "in-depth" record that could be used to show things like length of time in storage, overall success rates, survivability rates, etc. This patch also: * fixes a few typos * changes several methods that should consume Notifications, actually consume them. * convert from `tracking_id` to `reliability_id` * convert instance of specialized `Metrics` to generic Cadence (to make calls more consistent) * adds a `RELIABLE_REPORT` flag to testing. Closes: SYNC-4327
* alter `setup_bt` to include reliability family * alter config.yml for eventual integration test changes
… feat/SYNC-4327_redis
… feat/SYNC-4327_redis
5ff3bb9
to
3de403d
Compare
Well, that was less than entertaining, but at least educational. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration test changes LGTM!
… feat/SYNC-4327_redis
… feat/SYNC-4327_redis
… feat/SYNC-4327_redis
… feat/SYNC-4327_redis
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few nitpicks and cleanup to the recording code in this branch:
feat/SYNC-4327_redis...feat/SYNC-4327_redis-pjenvey
I'm still thinking about some of the state changes we discussed last week but that can happen in a later PR.
scripts/setup_bt.sh
Outdated
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_FAMILY "maxage=1s or maxversions=1" | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=1s or maxversions=1" | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $ROUTER_FAMILY maxversions=1 | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1" | |
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $RELIABILITY_FAMILY "maxage=60d or maxversions=1" |
scripts/setup_bt.sh
Outdated
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_FAMILY "maxage=1s or maxversions=1" | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=1s or maxversions=1" | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $ROUTER_FAMILY maxversions=1 | ||
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but log_report
calls the cell timestamp
an expiry
and also sets it to this 60d
from now -- so I don't understand. timestamp of 60d from now + 60d gcpolicy = 120d lifetime
let expiry = if subscription.reliability_id.is_some() { | ||
Some(timestamp + headers.ttl as u64) | ||
} else { | ||
None | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be worth if we're keeping this but I think we can kill it (my branch)
let expiry = if subscription.reliability_id.is_some() { | |
Some(timestamp + headers.ttl as u64) | |
} else { | |
None | |
}; | |
let expiry = subscription.reliability_id.map(|_| timestamp + headers.ttl as u64); |
/// The UTC expiration timestamp for this message | ||
pub expiry: Option<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An Option
is confusing as all notifs have an expiry. I'll argue the pre-computed expiry
field isn't necessary, calculating it on the fly is cheap (killed this in my branch)
// Record that we've sent the message out to APNS. | ||
// We can't set the state here because the notification isn't | ||
// mutable, but we are also essentially consuming the | ||
// notification nothing else should modify it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be mutable now that Router::route_notification
was changed to own it: the webpush router mutates it -- but you have a point here, with the notification being consumed it doesn't need to be modified here or in fcm -- does it even need to be modified in the webpush router?
… feat/SYNC-4327_redis
…topush-rs into feat/SYNC-4327_redis
This adds a feature flag
reliable_report
that optionally enables Pushmessage reliablity reporting. The report is done in two parts.
The first part uses a Redis like storage system to note message states.
This will require a regularly run "cleanup" script to sweep for expired
messages and adust the current counts, as well log those states to some
sequential logging friendly storage (e.g. common logging or steamed to
a file). The clean-up script should be a singleton to prevent possible
race conditions.
The second component will write a record of the state transition
times for tracked messages to a storage system that is indexed by the
tracking_id. This will allow for more "in depth" analysis by external
tooling.
The idea being that reporting will be comprised of two parts:
One part which shows active states of messages (with a log of prior
states to show trends over time), and an optional "in-depth" record
that could be used to show things like length of time in storage,
overall success rates, survivability rates, etc.
This patch also:
tracking_id
toreliability_id
Metrics
to generic Cadence (to make calls more consistent)RELIABLE_REPORT
flag to testing.Closes: SYNC-4327