precomputes Turbine trees speculatively before shreds arrive #5014

behzadnouri · 2025-02-16T23:08:57Z

Problem

Faster turbine tree calculation.

Summary of Changes

The commit maintains number of shreds observed in each slot within a rolling window.
If the channel is empty and there are no shreds to retransmit, it uses the idle cycles to speculatively precompute turbine tree for the slots which have received most number of shreds within the rolling window.

behzadnouri · 2025-02-17T17:31:00Z

retransmit-stage::{comput_turbine,total_time} per shred.

mainnet node. Left side is this code, right is master.

staked testnet node. Left side is this code, right is master.

alexpyattaev · 2025-02-17T21:00:45Z

What happens with CPU usage in the solRetransmit pool?

behzadnouri · 2025-02-17T21:57:30Z

What happens with CPU usage in the solRetransmit pool?

Roughly the same number of turbine trees are calculated anyways.
So almost whatever cpu is spent for speculative pre-calculation is saved when there is a cache-hit.

AshwinSekar · 2025-03-03T18:25:35Z

turbine/src/addr_cache.rs

+                break;
+            };
+            // Leave some capacity for the 2nd most frequent slot.
+            let count = count.min(num_shreds * 3 / 4);


Curious how you arrived to this, when testing did you see that we wasted time speculating trees on a big block that was not necessary?

Just some simple heuristics to start with.
The idea is usually you want to allocate most of your resources to the slot which is getting most number of shreds within the rolling window. But at the slot boundary you may get some shreds from the next slot, and so it might be leave some capacity to the next slot as well.

Open to testing other ideas but, again, I would rather have some basic core functionality in-place, and then iterate on top of that.

AshwinSekar · 2025-03-03T21:39:18Z

turbine/src/addr_cache.rs

+        };
+        std::iter::from_fn(move || {
+            // Find next missing code and data entries in the cache.
+            let code = code.find(|&k| matches!(self.code.get(k), None | Some(None)));


nit: perhaps it would be better to have a a positional mut code_index in this closure so we can avoid repeated calls to find.

yes, interleave is much better, switched to that, thanks.

AshwinSekar · 2025-03-03T22:07:09Z

turbine/src/addr_cache.rs

+        }
+        let mut entries: Vec<((Slot, CacheEntry), /*count:*/ usize)> = self
+            .cache
+            .drain()


might be better to avoid the drain here by collecting just the (count, slot) to find the cutoff, and retain instead of extend below

But then we need to allocate another hash-map to look-up which slots we want to retain, no?

Also, this cache is pretty small, like ~9 slots currently. So whatever we do it won't have a noticeable impact anyways.

AshwinSekar · 2025-03-03T22:15:16Z

turbine/src/addr_cache.rs

+        entry.last_shred_in_slot |= stats.last_shred_in_slot;
+        for (shred, root_distance, addrs) in std::mem::take(&mut stats.addrs) {
+            debug_assert_eq!(shred.slot(), slot);
+            entry.put(shred.shred_type(), shred.index(), (root_distance, addrs));


At this point why do we need to insert the shred into the cache? record is called after we've retransmitted the shred, so barring a duplicate shred we should not expect having to retransmit again right?

yeah, it is just if there is a duplicate shred.

AshwinSekar · 2025-03-03T22:17:08Z

turbine/src/retransmit_stage.rs

+    socket_addr_space: &SocketAddrSpace,
+    stats: &RetransmitStats,
+) -> Option<(/*root_distance:*/ u8, Cow<'a, [SocketAddr]>)> {
+    if let Some((root_distance, addrs)) = addr_cache.get(shred) {


At this point can we also remove shred from the cache?

We might need to keep it due to duplicate shreds.
Also, this is called within a thread-pool, so mutations at least need some locking etc.

AshwinSekar · 2025-03-03T22:21:55Z

turbine/src/addr_cache.rs

+            .iter()
+            .map(|(&slot, &count)| (count, slot))
+            .collect();
+        counts.sort_unstable();


Does it make sense to bias by how big the cache for this slot is as well? For instance if we initially see 1000 shreds for the current slot and there is some network delay we might end up wastefully computing many thousands of shreds for for the slot

Also thoughts on precomputing future slots? Based on the average shred count per block over our window it might be more beneficial to stop precomputing for a slot once its reached a certain size and instead speculate on upcoming slots.

we might end up wastefully computing many thousands of shreds for for the slot

The cache only precomputes up to EXTEND_BUFFER beyond the observed max_index_{code,data}.

Also thoughts on precomputing future slots?

Yeah, we can try out those ideas and see what works.
For now I just started with some basic heuristics.

I thing having some basic core functionality in-place would make it easier to iterate and test other ideas.

AshwinSekar · 2025-03-03T22:54:09Z

turbine/src/addr_cache.rs

+        let entry = self.get_cache_entry_mut(slot);
+        entry.max_index_code = entry.max_index_code.max(stats.max_index_code);
+        entry.max_index_data = entry.max_index_data.max(stats.max_index_data);
+        entry.last_shred_in_slot |= stats.last_shred_in_slot;


Could we have a metric here to report how many data shreds we precomputed vs how many were necessary based on this flag? Gives some insight on any wasted computation for future tuning.

sure, but can we do that in follow up PRs?
this code is already ~700 lines.

Also, the EXTEND_BUFFER limits how far the cache goes beyond the max_index_{code,data}.
With current values it does not go more than 512 indices beyond that.
(and if it has observed last_shred_in_slot it stops extending).

ah got it, i overlooked the part that it's limited by the max index

The commit maintains number of shreds observed in each slot within a rolling window. If the channel is empty and there are no shreds to retransmit, it uses the idle cycles to speculatively precompute turbine tree for slots which have received most number of shreds within the rolling window.

AshwinSekar

Looks great! Nice performance improvement

behzadnouri changed the title ~~precomputes retransmit addresses speculatively before shreds arrive~~ precomputes turbine tree speculatively before shreds arrive Feb 16, 2025

behzadnouri force-pushed the addr-cache-full branch 4 times, most recently from 46833b2 to 25261d1 Compare February 17, 2025 20:56

behzadnouri force-pushed the addr-cache-full branch from 25261d1 to 315029d Compare February 17, 2025 21:53

behzadnouri force-pushed the addr-cache-full branch from 315029d to cff20cb Compare February 17, 2025 22:01

behzadnouri changed the title ~~precomputes turbine tree speculatively before shreds arrive~~ precomputes Turbine trees speculatively before shreds arrive Feb 17, 2025

behzadnouri force-pushed the addr-cache-full branch from cff20cb to 336b24a Compare February 17, 2025 22:49

behzadnouri requested review from carllin, AshwinSekar and gregcusack February 18, 2025 00:00

behzadnouri force-pushed the addr-cache-full branch 2 times, most recently from 00203d9 to 888178e Compare February 18, 2025 01:37

behzadnouri force-pushed the addr-cache-full branch 2 times, most recently from af3a2e3 to d60edaa Compare February 27, 2025 16:15

AshwinSekar reviewed Mar 3, 2025

View reviewed changes

behzadnouri added 3 commits March 4, 2025 07:07

uses Box<[SocketAddr]> instead of Vec<SocketAddr>

3b47fbb

minor patches

6b2c139

behzadnouri force-pushed the addr-cache-full branch from d60edaa to 6b2c139 Compare March 4, 2025 13:28

behzadnouri requested a review from AshwinSekar March 4, 2025 13:56

AshwinSekar approved these changes Mar 4, 2025

View reviewed changes

behzadnouri merged commit 703677b into anza-xyz:master Mar 4, 2025
47 checks passed

behzadnouri deleted the addr-cache-full branch March 4, 2025 16:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

precomputes Turbine trees speculatively before shreds arrive #5014

precomputes Turbine trees speculatively before shreds arrive #5014

behzadnouri commented Feb 16, 2025 •

edited

Loading

behzadnouri commented Feb 17, 2025

alexpyattaev commented Feb 17, 2025

behzadnouri commented Feb 17, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025 •

edited

Loading

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 3, 2025

behzadnouri Mar 4, 2025

AshwinSekar Mar 4, 2025

AshwinSekar left a comment

precomputes Turbine trees speculatively before shreds arrive #5014

precomputes Turbine trees speculatively before shreds arrive #5014

Conversation

behzadnouri commented Feb 16, 2025 • edited Loading

Problem

Summary of Changes

behzadnouri commented Feb 17, 2025

alexpyattaev commented Feb 17, 2025

behzadnouri commented Feb 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshwinSekar Mar 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshwinSekar left a comment

Choose a reason for hiding this comment

behzadnouri commented Feb 16, 2025 •

edited

Loading

AshwinSekar Mar 3, 2025 •

edited

Loading