Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add redis based reliability reporting #778

Open
wants to merge 48 commits into
base: master
Choose a base branch
from

Conversation

jrconlin
Copy link
Member

@jrconlin jrconlin commented Oct 17, 2024

This adds a feature flag reliable_report that optionally enables Push
message reliablity reporting. The report is done in two parts.
The first part uses a Redis like storage system to note message states.
This will require a regularly run "cleanup" script to sweep for expired
messages and adust the current counts, as well log those states to some
sequential logging friendly storage (e.g. common logging or steamed to
a file). The clean-up script should be a singleton to prevent possible
race conditions.

The second component will write a record of the state transition
times for tracked messages to a storage system that is indexed by the
tracking_id. This will allow for more "in depth" analysis by external
tooling.

The idea being that reporting will be comprised of two parts:
One part which shows active states of messages (with a log of prior
states to show trends over time), and an optional "in-depth" record
that could be used to show things like length of time in storage,
overall success rates, survivability rates, etc.

This patch also:

  • fixes a few typos
  • changes several methods that should consume Notifications, actually consume them.
  • convert from tracking_id to reliability_id
  • convert instance of specialized Metrics to generic Cadence (to make calls more consistent)
  • adds a RELIABLE_REPORT flag to testing.

Closes: SYNC-4327

This PR introduces tracking throughput for the database.

It also introduces the PushReliability reporting skeleton. This will be
fleshed out with full reporting later.

Closes: #SYNC-4324
This adds a feature flag `reliable_report` that optionally enables Push
message reliablity reporting. The report is done in two parts.
The first part uses a Redis like storage system to note message states.
This will require a regularly run "cleanup" script to sweep for expired
messages and adust the current counts, as well log those states to some
sequential logging friendly storage (e.g. common logging or steamed to
a file). The clean-up script should be a singleton to prevent possible
race conditions.

The second component will write a record of the state transition
times for tracked messages to a storage system that is indexed by the
tracking_id. This will allow for more "in depth" analysis by external
tooling.

The idea being that reporting will be comprised of two parts:
One part which shows active states of messages (with a log of prior
states to show trends over time), and an optional "in-depth" record
that could be used to show things like length of time in storage,
overall success rates, survivability rates, etc.

This patch also:
* fixes a few typos
* changes several methods that should consume Notifications, actually
  consume them.
* convert from `tracking_id` to `reliability_id`
* convert instance of specialized `Metrics` to generic Cadence (to make
  calls more consistent)
* adds a `RELIABLE_REPORT` flag to testing.

Closes: SYNC-4327
* alter `setup_bt` to include reliability family
* alter config.yml for eventual integration test changes
@jrconlin jrconlin marked this pull request as ready for review October 18, 2024 17:34
@jrconlin jrconlin requested review from pjenvey and taddes October 18, 2024 17:34
@jrconlin jrconlin force-pushed the feat/SYNC-4327_redis branch from 5ff3bb9 to 3de403d Compare December 3, 2024 22:19
@jrconlin jrconlin requested a review from pjenvey December 3, 2024 22:34
@jrconlin
Copy link
Member Author

jrconlin commented Dec 3, 2024

Well, that was less than entertaining, but at least educational.

@jrconlin jrconlin requested a review from Trinaa December 4, 2024 19:07
Makefile Outdated Show resolved Hide resolved
@jrconlin jrconlin requested a review from Trinaa December 6, 2024 23:26
Copy link
Contributor

@Trinaa Trinaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration test changes LGTM!

Makefile Outdated Show resolved Hide resolved
@jrconlin jrconlin requested a review from Trinaa December 17, 2024 18:32
@jrconlin jrconlin dismissed pjenvey’s stale review December 17, 2024 18:34

Changes applied

Copy link
Member

@pjenvey pjenvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few nitpicks and cleanup to the recording code in this branch:

feat/SYNC-4327_redis...feat/SYNC-4327_redis-pjenvey

I'm still thinking about some of the state changes we discussed last week but that can happen in a later PR.

cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_FAMILY "maxage=1s or maxversions=1"
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=1s or maxversions=1"
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $ROUTER_FAMILY maxversions=1
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1"
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $RELIABILITY_FAMILY "maxage=60d or maxversions=1"

cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_FAMILY "maxage=1s or maxversions=1"
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=1s or maxversions=1"
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $ROUTER_FAMILY maxversions=1
cbt -project $PROJECT -instance $INSTANCE setgcpolicy $TABLE_NAME $MESSAGE_TOPIC_FAMILY "maxage=60d or maxversions=1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but log_report calls the cell timestamp an expiry and also sets it to this 60d from now -- so I don't understand. timestamp of 60d from now + 60d gcpolicy = 120d lifetime

Comment on lines 81 to 85
let expiry = if subscription.reliability_id.is_some() {
Some(timestamp + headers.ttl as u64)
} else {
None
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be worth if we're keeping this but I think we can kill it (my branch)

Suggested change
let expiry = if subscription.reliability_id.is_some() {
Some(timestamp + headers.ttl as u64)
} else {
None
};
let expiry = subscription.reliability_id.map(|_| timestamp + headers.ttl as u64);

Comment on lines 33 to 34
/// The UTC expiration timestamp for this message
pub expiry: Option<u64>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An Option is confusing as all notifs have an expiry. I'll argue the pre-computed expiry field isn't necessary, calculating it on the fly is cheap (killed this in my branch)

Comment on lines +485 to +488
// Record that we've sent the message out to APNS.
// We can't set the state here because the notification isn't
// mutable, but we are also essentially consuming the
// notification nothing else should modify it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be mutable now that Router::route_notification was changed to own it: the webpush router mutates it -- but you have a point here, with the notification being consumed it doesn't need to be modified here or in fcm -- does it even need to be modified in the webpush router?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants