Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(proof-data-handler): exclude batches without object file in GCS #2980

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

pbeza
Copy link
Collaborator

@pbeza pbeza commented Sep 27, 2024

What ❔

/tee/proof_inputs endpoint no longer returns batches that have no corresponding object file in Google Cloud Storage for an extended period.

Why ❔

TEE's proof-data-handler on mainnet was flooded with warnings.

Since the recent mainnet's 24.25.0 redeployment, we've been flooded with warnings for the proof-data-handler on mainnet (the warnings are actually not fatal in this context):

Failed request with a fatal error

(...)

Blobs for batch numbers 490520 to 490555 not found in the object store. Marked as unpicked.

The issue is caused by the code behind the /tee/proof_inputs endpoint (which is equivalent to the /proof_generation_data endpoint) – it finds the next batch to send to the requesting tee-prover by looking for the first batch that has a corresponding object in the Google object store. As it skips over batches that don’t have the objects, it logs Failed request with a fatal error for each one (unless the skipped batch was successfully proven, in which case it doesn’t log the error). This happens with every request the tee-prover sends, which is why we're getting so much noise in the logs.

One possible solution is to flag the problematic batches as permanently_ignored, like Thomas did before on mainnet.

Checklist

  • PR title corresponds to the body of PR (we generate changelog entries from PRs).
  • Tests for the changes have been added / updated.
  • Documentation comments have been added / updated.
  • Code has been formatted via zk fmt and zk lint.

@pbeza pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch 3 times, most recently from f1b8ad3 to 65cc26e Compare September 30, 2024 11:22
@pbeza pbeza marked this pull request as ready for review September 30, 2024 12:02
@pbeza
Copy link
Collaborator Author

pbeza commented Sep 30, 2024

@popzxc, I remember you mentioned not to ask for code reviews this wave, but you're probably the most familiar with this code (along with @slowli). So, if you could make an exception this time, I’d really appreciate it. If you're busy, no worries – feel free to ignore, and I’ll ask @RomanBrodetski to find someone else. Thanks!

@pbeza
Copy link
Collaborator Author

pbeza commented Oct 1, 2024

Kindly ping @slowli @RomanBrodetski. I need a reviewer.

skip(f), // output request and store as a part of structured logs
fields(retries) // Will be recorded before returning from the function
)]
async fn retry_optional<T, Fut, F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT, this function is unused. It's not detected as such because of the instrument attribute.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dumb question: Why was this module moved to zksync_types? It has a lot of deps which are not usually necessary.

@@ -17,6 +19,17 @@ impl fmt::Display for TeeType {
}
}

/// Representation of a locked batch. Used in DAL to fetch details about the locked batch to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this belongs to the DAL domain, it makes sense to define this type in DAL. AFAICT, there are some TEE-related types defined there already (like TeeProofGenerationJobStatus).

};
let datetime_utc = Utc.from_utc_datetime(&locked_batch.created_at);
let duration = Utc::now().signed_duration_since(datetime_utc);
let status = if duration > ChronoDuration::days(10) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It may make sense to extract this value into a constant so that it's more visible.

let datetime_utc = Utc.from_utc_datetime(&locked_batch.created_at);
let duration = Utc::now().signed_duration_since(datetime_utc);
let status = if duration > ChronoDuration::days(10) {
TeeProofGenerationJobStatus::PermanentlyIgnored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Maybe it makes sense to log this status assignment, so that it is easier to debug?

Some((start, _)) => Some((start, batch_number)),
None => Some((batch_number, batch_number)),
};
let datetime_utc = Utc.from_utc_datetime(&locked_batch.created_at);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may make sense to store the timestamp as DateTime<Utc> in LockedBatch.

Copy link
Collaborator

@RomanBrodetski RomanBrodetski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pbeza to be honest I don't fully follow this solution. I understand what we are trying to do (mark older unresolved jobs as skipped), but I'm not sure I understand the Why here. We can discuss over a huddle or async

@@ -0,0 +1 @@
UPDATE tee_proof_generation_details SET status = 'permanently_ignore' WHERE status = 'permanently_ignored';
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need any logic in down migration. The current logic does not precisely rollback it either. I think for this migration I'd just keep down empty

@@ -17,6 +19,17 @@ impl fmt::Display for TeeType {
}
}

/// Representation of a locked batch. Used in DAL to fetch details about the locked batch to
/// determine whether it should be flagged as permanently ignored if it has no corresponding file in
/// the object store for an extended period.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding a comment, but I still don't understand what a "locked" batch is. Please elaborate

@pbeza
Copy link
Collaborator Author

pbeza commented Oct 8, 2024

JFYI: This PR is on hold because the code it is based on was recently radically redesigned/refactored here: #3017. This PR may be cherry-picked/revisited once #3017 is merged into main.

@pbeza pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch 17 times, most recently from 4ee505b to bfeddc9 Compare October 31, 2024 18:29
/tee/proof_inputs endpoint no longer returns batches that have no
corresponding object file in Google Cloud Storage for an extended
period.

Since the recent `mainnet`'s `24.25.0` redeployment, we've been
[flooded with warnings][warnings] for the `proof-data-handler` on
`mainnet` (the warnings are actually _not_ fatal in this context):

```
Failed request with a fatal error

(...)

Blobs for batch numbers 490520 to 490555 not found in the object store.
Marked as unpicked.
```

The issue was caused [by the code][code] behind the `/tee/proof_inputs`
[endpoint][endpoint_proof_inputs] (which is equivalent to the
`/proof_generation_data` [endpoint][endpoint_proof_generation_data]) –
it finds the next batch to send to the [requesting][requesting]
`tee-prover` by looking for the first batch that has a corresponding
object in the Google object store. As it skips over batches that don’t
have the objects, [it logs][logging] `Failed request with a fatal error`
for each one (unless the skipped batch was successfully proven, in which
case it doesn’t log the error). This happens with every
[request][request] the `tee-prover` sends, which is why we were getting
so much noise in the logs.

One possible solution was to manually flag the problematic batches as
`permanently_ignored`, like Thomas [did before][Thomas] on `mainnet`.
It was a quick and dirty workaround, but now we have a more automated
solution.

[warnings]: https://grafana.matterlabs.dev/goto/TjlaXQgHg?orgId=1
[code]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/tee_request_processor.rs#L35-L79
[endpoint_proof_inputs]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L96
[endpoint_proof_generation_data]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/node/proof_data_handler/src/lib.rs#L67
[requesting]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/bin/zksync_tee_prover/src/tee_prover.rs#L93
[logging]: https://github.com/matter-labs/zksync-era/blob/3f406c7d0c0e76d798c2d838abde57ca692822c0/core/lib/object_store/src/retries.rs#L56
[Thomas]: https://matter-labs-workspace.slack.com/archives/C05ANUCGCKV/p1725284962312929
@pbeza pbeza force-pushed the tee/flag-old-batches-as-permanently-ignored-automatically branch from bfeddc9 to cf2cf1d Compare October 31, 2024 19:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants