Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

precomputes Turbine trees speculatively before shreds arrive #5014

Merged
merged 3 commits into from
Mar 4, 2025

Conversation

behzadnouri
Copy link

@behzadnouri behzadnouri commented Feb 16, 2025

Problem

Faster turbine tree calculation.

Summary of Changes

The commit maintains number of shreds observed in each slot within a rolling window.
If the channel is empty and there are no shreds to retransmit, it uses the idle cycles to speculatively precompute turbine tree for the slots which have received most number of shreds within the rolling window.

@behzadnouri behzadnouri changed the title precomputes retransmit addresses speculatively before shreds arrive precomputes turbine tree speculatively before shreds arrive Feb 16, 2025
@behzadnouri
Copy link
Author

retransmit-stage::{comput_turbine,total_time} per shred.

mainnet node. Left side is this code, right is master.
retransmit_stage_mainnet_250217_1125

staked testnet node. Left side is this code, right is master.
retransmit_stage_testnet_250217_1126

@behzadnouri behzadnouri force-pushed the addr-cache-full branch 4 times, most recently from 46833b2 to 25261d1 Compare February 17, 2025 20:56
@alexpyattaev
Copy link

What happens with CPU usage in the solRetransmit pool?

@behzadnouri
Copy link
Author

What happens with CPU usage in the solRetransmit pool?

Roughly the same number of turbine trees are calculated anyways.
So almost whatever cpu is spent for speculative pre-calculation is saved when there is a cache-hit.

@behzadnouri behzadnouri changed the title precomputes turbine tree speculatively before shreds arrive precomputes Turbine trees speculatively before shreds arrive Feb 17, 2025
@behzadnouri behzadnouri force-pushed the addr-cache-full branch 2 times, most recently from 00203d9 to 888178e Compare February 18, 2025 01:37
@behzadnouri behzadnouri force-pushed the addr-cache-full branch 2 times, most recently from af3a2e3 to d60edaa Compare February 27, 2025 16:15
break;
};
// Leave some capacity for the 2nd most frequent slot.
let count = count.min(num_shreds * 3 / 4);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious how you arrived to this, when testing did you see that we wasted time speculating trees on a big block that was not necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some simple heuristics to start with.
The idea is usually you want to allocate most of your resources to the slot which is getting most number of shreds within the rolling window. But at the slot boundary you may get some shreds from the next slot, and so it might be leave some capacity to the next slot as well.

Open to testing other ideas but, again, I would rather have some basic core functionality in-place, and then iterate on top of that.

};
std::iter::from_fn(move || {
// Find next missing code and data entries in the cache.
let code = code.find(|&k| matches!(self.code.get(k), None | Some(None)));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps it would be better to have a a positional mut code_index in this closure so we can avoid repeated calls to find.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even something like this:

code.filter_map(|k| matches!(self.code.get(k), None | Some(None)).then_some((ShredType::Code, i))
  .interleave(
    data.filter_map(|k| matches!(self.data.get(k), None | Some(None)).then_some((ShredType::Data, i))
  )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, interleave is much better, switched to that, thanks.

}
let mut entries: Vec<((Slot, CacheEntry), /*count:*/ usize)> = self
.cache
.drain()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be better to avoid the drain here by collecting just the (count, slot) to find the cutoff, and retain instead of extend below

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But then we need to allocate another hash-map to look-up which slots we want to retain, no?

Also, this cache is pretty small, like ~9 slots currently. So whatever we do it won't have a noticeable impact anyways.

entry.last_shred_in_slot |= stats.last_shred_in_slot;
for (shred, root_distance, addrs) in std::mem::take(&mut stats.addrs) {
debug_assert_eq!(shred.slot(), slot);
entry.put(shred.shred_type(), shred.index(), (root_distance, addrs));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point why do we need to insert the shred into the cache? record is called after we've retransmitted the shred, so barring a duplicate shred we should not expect having to retransmit again right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it is just if there is a duplicate shred.

socket_addr_space: &SocketAddrSpace,
stats: &RetransmitStats,
) -> Option<(/*root_distance:*/ u8, Cow<'a, [SocketAddr]>)> {
if let Some((root_distance, addrs)) = addr_cache.get(shred) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point can we also remove shred from the cache?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to keep it due to duplicate shreds.
Also, this is called within a thread-pool, so mutations at least need some locking etc.

.iter()
.map(|(&slot, &count)| (count, slot))
.collect();
counts.sort_unstable();
Copy link

@AshwinSekar AshwinSekar Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to bias by how big the cache for this slot is as well? For instance if we initially see 1000 shreds for the current slot and there is some network delay we might end up wastefully computing many thousands of shreds for for the slot

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also thoughts on precomputing future slots? Based on the average shred count per block over our window it might be more beneficial to stop precomputing for a slot once its reached a certain size and instead speculate on upcoming slots.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might end up wastefully computing many thousands of shreds for for the slot

The cache only precomputes up to EXTEND_BUFFER beyond the observed max_index_{code,data}.

Also thoughts on precomputing future slots?

Yeah, we can try out those ideas and see what works.
For now I just started with some basic heuristics.

I thing having some basic core functionality in-place would make it easier to iterate and test other ideas.

let entry = self.get_cache_entry_mut(slot);
entry.max_index_code = entry.max_index_code.max(stats.max_index_code);
entry.max_index_data = entry.max_index_data.max(stats.max_index_data);
entry.last_shred_in_slot |= stats.last_shred_in_slot;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have a metric here to report how many data shreds we precomputed vs how many were necessary based on this flag? Gives some insight on any wasted computation for future tuning.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but can we do that in follow up PRs?
this code is already ~700 lines.

Also, the EXTEND_BUFFER limits how far the cache goes beyond the max_index_{code,data}.
With current values it does not go more than 512 indices beyond that.
(and if it has observed last_shred_in_slot it stops extending).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah got it, i overlooked the part that it's limited by the max index

The commit maintains number of shreds observed in each slot within a
rolling window. If the channel is empty and there are no shreds to
retransmit, it uses the idle cycles to speculatively precompute turbine tree
for slots which have received most number of shreds within the rolling
window.
Copy link

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Nice performance improvement

@behzadnouri behzadnouri merged commit 703677b into anza-xyz:master Mar 4, 2025
47 checks passed
@behzadnouri behzadnouri deleted the addr-cache-full branch March 4, 2025 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants