-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zebra full sync is taking at least 50% longer over the last 3 weeks #7618
Comments
Here is a full sync on the v1.2.0 release tag: If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current |
This comment was marked as resolved.
This comment was marked as resolved.
This was very slow and I cancelled it after 24 hours so we could run the other job. It hadn't verified any new blocks for 90 minutes. It's hard to conclude anything at this point, because testnet could be slow, or maybe our code could be slow.
This is running now, let's see if it's any faster. If it's faster, then it's probably our code. If it's just as slow, then it's probably testnet that's slow. |
This appears to be a testnet issue, so the testnet performance jobs are not going to give us any useful information:
https://github.com/ZcashFoundation/zebra/actions/runs/6305365448/job/17158137748#step:8:2273 |
The job is still running, but it looks like performance of v1.2.0 is about the same as it was 3 weeks ago. So it doesn't seem to be a problem with mainnet. So it seems to be a problem that was merged in the last 3 weeks, in either:
If we can find a short test that reproduces the problem, it will be much easier to track down. I'll also check the logs to see if they narrow it down. |
Block timeouts seem to be much more frequent in the failing job logs:
https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21125 |
The first block timeout is around 53%, so we might be able to use the checkpoint update sync job as a quick test for this issue:
|
I wasn't able to find an existing short test that narrows it down. The logs suggest that it's an issue in Zebra's production code, or possibly the Docker build. We just compiled So we need to make a list of PRs that are after the
|
It finished in 2 days 5.5 hours. I'd expect it to be slightly slower than |
I've pushed two branches to track the range of the performance issue. Here is the list of all the commits, please double-check I did it right: You'll need to click on "Load more commits" to get the commits after September 20.
Here are the possible PRs I found, in order from v1.2.0:
I tried to get all the possible PRs, even if it was unlikely, because missing the bug would be worse than having extras. Please add any PRs that I've missed, and check off any PRs you think are likely to have caused issues. Also feel free to add more comments. Then we can focus on those PRs. |
Another possible diagnostic method is performance analysis. We could compare flamegraphs from But we'd have to reproduce the issue on a smaller block range first. So we could try comparing the sync times of:
Then if there's a significant difference, we could do a flamegraph to see where the problem is. |
There was a significant change in CI performance between:
So that's where I'd suggest we start looking. https://dashboard.mergify.com/github/ZcashFoundation/repo/zebra/stats?queue=batched |
From ticket #7649:
Find the oldest zebrad tip cached disk: Open a PR that hard-codes that cached disk name just above this line: zebra/.github/workflows/deploy-gcp-tests.yml Line 359 in 04568d2
Then when CI runs on that branch it will sync all those blocks (currently 30K). You can also manually run a CI workflow, but you probably just want to restart the update sync job. If you want to keep that disk, open a PR that skips deleting it: |
@teor2345 It looks like it's gotten far slower near the beginning of the initial sync. Latest, https://github.com/ZcashFoundation/zebra/actions/runs/6351284888 (X-axis is height, Y-axis is v1.2.0, https://github.com/ZcashFoundation/zebra/actions/runs/6305243389/job/17119807358: const fs = require("fs");
const parse_logs = async (file_name) =>
new Promise((resolve, reject) => {
fs.readFile(`./${file_name}`, "utf8", (err, data) => {
if (err) {
reject(err);
}
resolve(
data
.split("\n")
.map((line) => ({
timestamp: Date.parse(line.split(" ")[0]),
line,
}))
.filter(
({ timestamp, line }) =>
timestamp != NaN && line.includes("current_height")
)
.map(({ timestamp, line }) => ({
timestamp: timestamp,
height: (
line.split("current_height")[1].split("Height(")[1] || ""
).split(")")[0],
}))
);
});
});
(async () => {
const logs = await parse_logs("logs.txt");
// const logs = await parse_logs("logs-1.2.0.txt");
let prev_timestamp = logs[0].timestamp;
let prev_height = 0;
const sync_rates = logs
.slice(1)
.map(({ timestamp, height }) => {
const time_diff = timestamp - prev_timestamp;
const height_diff = height - prev_height;
const sync_rate = time_diff / height_diff;
prev_timestamp = timestamp;
prev_height = height;
return {
height,
sync_rate,
timestamp,
time_diff,
height_diff,
};
})
.filter(
({ height, sync_rate, timestamp, time_diff, height_diff }) =>
time_diff > 0 &&
height > 0 &&
sync_rate > 0 &&
height_diff > 0
);
for (const { height, sync_rate, ...rest } of sync_rates) {
console.log(`${height}, ${sync_rate}`);
}
})(); |
That's interesting and not what I expected. But it's good news, because we only need to sync to 250,000 to see a significant difference. What's in the logs around the heights where the rate changes? Can you link to the logs you used for each graph? |
It is slow at the start. It took 5 hours to get to 350_000 for |
It looks like PR #7590 introduces the bug we're dealing with. The slowness shows within the first five minutes of syncing from scratch, and is introduced in 2dce686. At commit 2dce686 and four subsequent commits that I tried, Zebra syncs ~ 30_000 blocks in ~ 5 minutes. At commit 92d6da3 that directly precedes 2dce686 and at three previous commits that I tried, Zebra syncs ~ 100_000 blocks in ~ 5 minutes. |
From looking at the code, these might be the spots that cause the slowness:
|
Yep, we've had problems with database iterator performance before. I'm going to look up the rocksdb docs to see if there's any way to configure the database differently to avoid that (or code the iterator differently). PR #7531 also uses database iterators during an upgrade (but so did #7437), and #7602 runs some of that code every 5 minutes in CI, so we might need to fix that code as well. I wonder if it's the existence of an iterator that's the problem, or the number of iterators we're creating? |
This might be a tricky bug to diagnose, because we've been using zebra/zebra-state/src/service/finalized_state/zebra_db/block.rs Lines 65 to 67 in 04568d2
Edit: this is the only existing use of |
Here's one possible benchmark test for this code:
It's possible that the iterator only performs badly when we're creating, updating, or deleting column family entries. So we might need to do that in the test as well. I guess we'll find out when we write it. It might also help to call all the database read functions that use iterators, not just one of them. |
It looks like this bug only happens when:
So most column families aren't impacted: This is consistent with what we're seeing in these graphs:
The performance gradually decreases as the number of tombstones increases, then the tombstones are periodically cleared, and the performance improves again. Here's what I think we could do to fix it: PRs #7663 and #7392 fix this bug for the history and sprout trees. |
I did a local test of this fix, and Zebra is up to 1.2 million blocks in under 5 hours. But I don't know how fast v1.2.0 was on my machine. So I'd appreciate someone else checking. |
I can confirm PR #7663 brings the sync performance back to where it was at |
@teor2345 can I please get an estimate here |
Motivation
The scheduled full sync test timed out after 3 days. It usually takes just over 2 days.
The last time it ran successfully was 3 weeks ago.
There are two possible causes:
Logs
The failure is a timeout:
https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21231
GitHub Actions workflow CI Docker #6194 failed.
Event: schedule
Branch: main
Commit: b737ccf5709b2905c7c53ce91eea17dd753a6c03
Created by jayqi/failed-build-issue-action
The text was updated successfully, but these errors were encountered: