Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

github-actions · 2023-09-25T12:14:41Z

Motivation

The scheduled full sync test timed out after 3 days. It usually takes just over 2 days.

The last time it ran successfully was 3 weeks ago.

There are two possible causes:

mempool spam slowing down network communication
we merged a PR that slowed down syncing in the last 3 weeks

Logs

The failure is a timeout:
https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21231

GitHub Actions workflow CI Docker #6194 failed.

Event: schedule
Branch: main
Commit: b737ccf5709b2905c7c53ce91eea17dd753a6c03

^{Created by jayqi/failed-build-issue-action}

teor2345 · 2023-09-25T21:56:27Z

Here is a full sync on the v1.2.0 release tag:
https://github.com/ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch.
If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

teor2345 · 2023-09-26T23:22:12Z

Here is a testnet full sync on the main branch: ZcashFoundation/zebra/actions/runs/6305352427

This was very slow and I cancelled it after 24 hours so we could run the other job. It hadn't verified any new blocks for 90 minutes.

It's hard to conclude anything at this point, because testnet could be slow, or maybe our code could be slow.

And on the v1.2.0 tag: ZcashFoundation/zebra/actions/runs/6305365448

This is running now, let's see if it's any faster. If it's faster, then it's probably our code. If it's just as slow, then it's probably testnet that's slow.

teor2345 · 2023-09-27T00:33:36Z

And on the v1.2.0 tag: ZcashFoundation/zebra/actions/runs/6305365448

This is running now, let's see if it's any faster. If it's faster, then it's probably our code. If it's just as slow, then it's probably testnet that's slow.

This appears to be a testnet issue, so the testnet performance jobs are not going to give us any useful information:

2023-09-27T00:31:55.911968Z WARN {zebrad="6f50304" net="Test"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 51 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=9.475% current_height=Height(204000) network_upgrade=BeforeOverwinter time_since_last_state_block=51m target_block_spacing=1m 15s max_block_spacing=7m 30s is_syncer_stopped=false

https://github.com/ZcashFoundation/zebra/actions/runs/6305365448/job/17158137748#step:8:2273

teor2345 · 2023-09-27T22:41:39Z

Here is a full sync on the v1.2.0 release tag: ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch. If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

The job is still running, but it looks like performance of v1.2.0 is about the same as it was 3 weeks ago. So it doesn't seem to be a problem with mainnet.

So it seems to be a problem that was merged in the last 3 weeks, in either:

the new self-hosted runner machines
a CI workflow change
a Docker change
a Zebra Rust test harness change
a Zebra production Rust code change (network, syncer, parallel block/transaction validation, state block validation, or block writes)

If we can find a short test that reproduces the problem, it will be much easier to track down. I'll also check the logs to see if they narrow it down.

teor2345 · 2023-09-27T23:03:54Z

Block timeouts seem to be much more frequent in the failing job logs:

2023-09-25T11:46:49.547990Z WARN {zebrad="b737ccf" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1933729), hash: block::Hash("00000000013f070db690244f3e948c34b28d1de6a6b84e181150656e55462237") }

https://github.com/ZcashFoundation/zebra/actions/runs/6274009969/job/17038728970#step:9:21125

teor2345 · 2023-09-27T23:37:04Z

The first block timeout is around 53%, so we might be able to use the checkpoint update sync job as a quick test for this issue:

2023-09-23T22:30:32.3700397Z 2023-09-23T22:30:32.368714Z INFO zebra_state::service::finalized_state::disk_format::upgrade: checking new blocks were written in current database format running_version=25.2.2
2023-09-23T22:30:32.3702560Z 2023-09-23T22:30:32.368740Z INFO zebra_state::service::finalized_state::disk_format::upgrade: checking database format produced by new blocks in this instance is valid running_version=25.2.2
2023-09-23T22:30:33.6743143Z 2023-09-23T22:30:33.673053Z INFO zebra_state::service::finalized_state::disk_format::upgrade: database format is valid running_version=25.2.2 inital_disk_version=25.2.2
2023-09-23T22:30:56.1333929Z 2023-09-23T22:30:56.131826Z WARN {zebrad="b737ccf" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1200833), hash: block::Hash("00000000018529b73ef7c3f8fdd1edde2e31296d88b3a974af4eea789c278eaf") }
2023-09-23T22:30:56.1336562Z 2023-09-23T22:30:56.132577Z INFO {zebrad="b737ccf" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1200827))
2023-09-23T22:31:32.1775574Z 2023-09-23T22:31:32.176102Z INFO {zebrad="b737ccf" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=53.538% current_height=Height(1200827) network_upgrade=Canopy remaining_sync_blocks=1042134 time_since_last_state_block=8m

teor2345 · 2023-09-28T05:18:20Z

If we can find a short test that reproduces the problem, it will be much easier to track down. I'll also check the logs to see if they narrow it down.

I wasn't able to find an existing short test that narrows it down.

The logs suggest that it's an issue in Zebra's production code, or possibly the Docker build. We just compiled v1.2.0 with the latest Rust compiler, and ran it with the latest Google Cloud machines, so we know it's not that.

So we need to make a list of PRs that are after the v1.2.0 tag, up to and including commit b737ccf5709b2905c7c53ce91eea17dd753a6c03 last week, which change:

Zebra's production code (network, syncer, parallel block/transaction validation, state block validation, or block writes),
any dependencies they use, or
the Docker build.

mpguerra · 2023-09-28T12:41:17Z

@arya2 would you mind pairing with @teor2345 on this one? :)

teor2345 · 2023-09-29T05:06:28Z

Here is a full sync on the v1.2.0 release tag: ZcashFoundation/zebra/actions/runs/6305243389

If it finishes within about 2 days and 5 hours, then there's a serious performance bug in the current main branch. If it times out or takes 3 days, then there's a performance issue on the Zcash network, likely to do with mempool spam.

It finished in 2 days 5.5 hours. I'd expect it to be slightly slower than main because it doesn't have as many checkpoints, but it's actually much faster.

teor2345 · 2023-09-29T05:28:21Z

we need to make a list of PRs that are after the v1.2.0 tag, up to and including commit b737ccf5709b2905c7c53ce91eea17dd753a6c03 last week

I've pushed two branches to track the range of the performance issue. Here is the list of all the commits, please double-check I did it right:
known-good-perf...known-poor-perf

You'll need to click on "Load more commits" to get the commits after September 20.

we need to make a list of PRs which change:
Zebra's production code (network, syncer, parallel block/transaction validation, state block validation, or block writes),
any dependencies they use, or
the Docker build.

Here are the possible PRs I found, in order from v1.2.0:

I tried to get all the possible PRs, even if it was unlikely, because missing the bug would be worse than having extras.

Please add any PRs that I've missed, and check off any PRs you think are likely to have caused issues. Also feel free to add more comments. Then we can focus on those PRs.

teor2345 · 2023-09-29T05:37:09Z

Another possible diagnostic method is performance analysis. We could compare flamegraphs from v1.2.0 and known-poor-perf:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/user/tracing.md#flamegraphs

But we'd have to reproduce the issue on a smaller block range first. So we could try comparing the sync times of:

100K blocks from genesis
100K blocks from the checkpoint
the earliest cached state to the tip (currently 30K blocks)

Then if there's a significant difference, we could do a flamegraph to see where the problem is.

teor2345 · 2023-09-29T05:59:56Z

There was a significant change in CI performance between:

5-6 September (timezone unknown)
20-21 September

So that's where I'd suggest we start looking.

https://dashboard.mergify.com/github/ZcashFoundation/repo/zebra/stats?queue=batched

teor2345 · 2023-09-29T06:13:08Z

From ticket #7649:

What do you think of the steps in this comment: #7618 (comment)

Comparing the sync times for 100K blocks from genesis is a great idea, though I'm not sure how to check the sync times closer to the tip.

Find the oldest zebrad tip cached disk:
https://console.cloud.google.com/compute/images?hl=en&project=zfnd-dev-zebra&tab=images

Open a PR that hard-codes that cached disk name just above this line:

zebra/.github/workflows/deploy-gcp-tests.yml

Line 359 in 04568d2

echo "Selected Disk: $CACHED_DISK_NAME"

Then when CI runs on that branch it will sync all those blocks (currently 30K). You can also manually run a CI workflow, but you probably just want to restart the update sync job.

If you want to keep that disk, open a PR that skips deleting it:
https://github.com/ZcashFoundation/zebra/actions/workflows/delete-gcp-resources.yml
I have temporarily turned off that delete workflow. (It was scheduled to run in 50 minutes time.)

arya2 · 2023-10-03T00:39:14Z

@teor2345 It looks like it's gotten far slower near the beginning of the initial sync.

Latest, https://github.com/ZcashFoundation/zebra/actions/runs/6351284888 (X-axis is height, Y-axis is timestamp_diff / height_diff):

v1.2.0, https://github.com/ZcashFoundation/zebra/actions/runs/6305243389/job/17119807358:

const fs = require("fs");

const parse_logs = async (file_name) =>
  new Promise((resolve, reject) => {
    fs.readFile(`./${file_name}`, "utf8", (err, data) => {
      if (err) {
        reject(err);
      }

      resolve(
        data
          .split("\n")
          .map((line) => ({
            timestamp: Date.parse(line.split(" ")[0]),
            line,
          }))
          .filter(
            ({ timestamp, line }) =>
              timestamp != NaN && line.includes("current_height")
          )
          .map(({ timestamp, line }) => ({
            timestamp: timestamp,
            height: (
              line.split("current_height")[1].split("Height(")[1] || ""
            ).split(")")[0],
          }))
      );
    });
  });

(async () => {
  const logs = await parse_logs("logs.txt");
    // const logs = await parse_logs("logs-1.2.0.txt");

  let prev_timestamp = logs[0].timestamp;
  let prev_height = 0;
  const sync_rates = logs
    .slice(1)
    .map(({ timestamp, height }) => {
      const time_diff = timestamp - prev_timestamp;
      const height_diff = height - prev_height;
      const sync_rate = time_diff / height_diff;
      prev_timestamp = timestamp;
      prev_height = height;
      return {
        height,
        sync_rate,
        timestamp,
        time_diff,
        height_diff,
      };
    })
    .filter(
      ({ height, sync_rate, timestamp, time_diff, height_diff }) =>
        time_diff > 0 &&
        height > 0 &&
        sync_rate > 0 &&
        height_diff > 0
    );

  for (const { height, sync_rate, ...rest } of sync_rates) {
    console.log(`${height}, ${sync_rate}`);
  }
})();

teor2345 · 2023-10-03T01:17:23Z

@teor2345 It looks like it's gotten far slower near the beginning of the initial sync.

That's interesting and not what I expected. But it's good news, because we only need to sync to 250,000 to see a significant difference.

What's in the logs around the heights where the rate changes? Can you link to the logs you used for each graph?

upbqdn · 2023-10-03T17:58:53Z

It is slow at the start. It took 5 hours to get to 350_000 for main, but 50 minutes for v1.2.0.

upbqdn · 2023-10-03T18:07:11Z

I'm now running a sync for #7437 (cc61bd5) to see how it behaves.

upbqdn · 2023-10-03T18:32:56Z

#7437 (cc61bd5) looks good. Now I'm running #7566 7a7d79d.

upbqdn · 2023-10-03T19:25:43Z

It looks like PR #7590 introduces the bug we're dealing with. The slowness shows within the first five minutes of syncing from scratch, and is introduced in 2dce686. At commit 2dce686 and four subsequent commits that I tried, Zebra syncs ~ 30_000 blocks in ~ 5 minutes. At commit 92d6da3 that directly precedes 2dce686 and at three previous commits that I tried, Zebra syncs ~ 100_000 blocks in ~ 5 minutes.

upbqdn · 2023-10-03T19:40:07Z

From looking at the code, these might be the spots that cause the slowness:

teor2345 · 2023-10-03T20:02:18Z

Yep, we've had problems with database iterator performance before. I'm going to look up the rocksdb docs to see if there's any way to configure the database differently to avoid that (or code the iterator differently).

PR #7531 also uses database iterators during an upgrade (but so did #7437), and #7602 runs some of that code every 5 minutes in CI, so we might need to fix that code as well.

I wonder if it's the existence of an iterator that's the problem, or the number of iterators we're creating?

teor2345 · 2023-10-03T20:34:32Z

This might be a tricky bug to diagnose, because we've been using zs_last_key_value() in the tip() method for a very long time (but maybe it's always been slowing Zebra down, and we've just accepted that performance):

zebra/zebra-state/src/service/finalized_state/zebra_db/block.rs

Lines 65 to 67 in 04568d2

    
           pub fn tip(&self) -> Option<(block::Height, block::Hash)> { 
        
               let hash_by_height = self.db.cf_handle("hash_by_height").unwrap(); 
        
               self.db.zs_last_key_value(&hash_by_height)

Edit: this is the only existing use of zs_last_key_value() in v1.2.0. It is also used in the recently merged PR #7590 and the open PR #7636.

teor2345 · 2023-10-03T20:53:59Z

Here's one possible benchmark test for this code:

call a database read function that uses database iterators thousands of times, and see how long it takes

It's possible that the iterator only performs badly when we're creating, updating, or deleting column family entries. So we might need to do that in the test as well. I guess we'll find out when we write it.

It might also help to call all the database read functions that use iterators, not just one of them.

teor2345 · 2023-10-04T01:25:49Z

I opened PR #7663 to fix this issue, @arya2 or @upbqdn can you check if it fixes the performance issue?
(I'm not sure how to reproduce your tests or analysis.)

teor2345 · 2023-10-04T01:51:39Z

This might be a tricky bug to diagnose, because we've been using zs_last_key_value() in the tip() method for a very long time

It looks like this bug only happens when:

there are a lot of deleted keys still marked in the database, and
we're iterating through a range that might include those keys (called "tombstones").

So most column families aren't impacted:
https://github.com/ZcashFoundation/zebra/blob/main/book/src/dev/state-db-upgrades.md#current-state-database-format

This is consistent with what we're seeing in these graphs:

It looks like it's gotten far slower near the beginning of the initial sync.

Latest, ZcashFoundation/zebra/actions/runs/6351284888 (X-axis is height, Y-axis is timestamp_diff / height_diff):

The performance gradually decreases as the number of tombstones increases, then the tombstones are periodically cleared, and the performance improves again.

Here's what I think we could do to fix it:

PRs #7663 and #7392 fix this bug for the history and sprout trees.
Ticket #7664 will fix it for the address UTXO index (we don't look it up except in RPCs, so it isn't impacting performance all the time).

teor2345 · 2023-10-04T06:23:40Z

I opened PR #7663 to fix this issue, @arya2 or @upbqdn can you check if it fixes the performance issue? (I'm not sure how to reproduce your tests or analysis.)

I did a local test of this fix, and Zebra is up to 1.2 million blocks in under 5 hours. But I don't know how fast v1.2.0 was on my machine. So I'd appreciate someone else checking.

upbqdn · 2023-10-04T10:47:49Z

I can confirm PR #7663 brings the sync performance back to where it was at v1.2.0.

mpguerra · 2023-10-05T08:20:31Z

@teor2345 can I please get an estimate here

github-actions bot added the S-ci-fail-auto-issue Status: Auto opened issue for a CI failure. This label is auto-assigned. label Sep 25, 2023

teor2345 changed the title ~~main branch CI failed: schedule in CI Docker~~ Zebra full sync is taking at least 50% longer over the last 3 weeks Sep 25, 2023

teor2345 self-assigned this Sep 25, 2023

This comment was marked as resolved.

Sign in to view

This was referenced Sep 25, 2023

change(release): Add "wait for a full sync" to the release checklist and ticket template #7612

Merged

change(state): Only do continuous format checks in CI #7627

Merged

teor2345 added P-High 🔥 and removed P-Medium ⚡ labels Sep 26, 2023

teor2345 mentioned this issue Sep 27, 2023

ref(workflows): consolidate workflows based on their purpose #7616

Merged

6 tasks

mpguerra assigned arya2 Sep 28, 2023

teor2345 mentioned this issue Sep 29, 2023

Add a quick test for sampling initial sync performance #7649

Closed

teor2345 mentioned this issue Oct 3, 2023

fix(note-commitment-trees): Populate subtree fields in the NoteCommitmentTrees structs #7636

Merged

6 tasks

teor2345 mentioned this issue Oct 3, 2023

ref(docker): combine test and prod entrypoints into one #7660

Merged

6 tasks

This was referenced Oct 4, 2023

change(state): Stop using iterators on column families with many deletions #7663

Merged

Set iterator read bounds on deleting column families to avoid very slow syncs #7664

Closed

teor2345 mentioned this issue Oct 4, 2023

fix(security): fix concurrency issues in tree key formats, and CPU usage in genesis tree roots #7392

Merged

6 tasks

teor2345 closed this as completed in #7663 Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

github-actions bot commented Sep 25, 2023 •

edited by teor2345

Loading

teor2345 commented Sep 25, 2023

This comment was marked as resolved.

teor2345 commented Sep 26, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 28, 2023

mpguerra commented Sep 28, 2023

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023 •

edited by upbqdn

Loading

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023 •

edited

Loading

arya2 commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023 •

edited

Loading

upbqdn commented Oct 3, 2023

teor2345 commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 3, 2023

teor2345 commented Oct 4, 2023

teor2345 commented Oct 4, 2023 •

edited

Loading

teor2345 commented Oct 4, 2023

upbqdn commented Oct 4, 2023

mpguerra commented Oct 5, 2023

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

Zebra full sync is taking at least 50% longer over the last 3 weeks #7618

Comments

github-actions bot commented Sep 25, 2023 • edited by teor2345 Loading

Motivation

Logs

teor2345 commented Sep 25, 2023

This comment was marked as resolved.

teor2345 commented Sep 26, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 27, 2023

teor2345 commented Sep 28, 2023

mpguerra commented Sep 28, 2023

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023 • edited by upbqdn Loading

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023

teor2345 commented Sep 29, 2023 • edited Loading

arya2 commented Oct 3, 2023 • edited Loading

teor2345 commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023

upbqdn commented Oct 3, 2023 • edited Loading

upbqdn commented Oct 3, 2023

teor2345 commented Oct 3, 2023 • edited Loading

teor2345 commented Oct 3, 2023 • edited Loading

teor2345 commented Oct 3, 2023

teor2345 commented Oct 4, 2023

teor2345 commented Oct 4, 2023 • edited Loading

teor2345 commented Oct 4, 2023

upbqdn commented Oct 4, 2023

mpguerra commented Oct 5, 2023

github-actions bot commented Sep 25, 2023 •

edited by teor2345

Loading

teor2345 commented Sep 29, 2023 •

edited by upbqdn

Loading

teor2345 commented Sep 29, 2023 •

edited

Loading

arya2 commented Oct 3, 2023 •

edited

Loading

upbqdn commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 3, 2023 •

edited

Loading

teor2345 commented Oct 4, 2023 •

edited

Loading