-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
precomputes Turbine trees speculatively before shreds arrive #5014
Conversation
46833b2
to
25261d1
Compare
What happens with CPU usage in the solRetransmit pool? |
25261d1
to
315029d
Compare
Roughly the same number of turbine trees are calculated anyways. |
315029d
to
cff20cb
Compare
cff20cb
to
336b24a
Compare
00203d9
to
888178e
Compare
af3a2e3
to
d60edaa
Compare
break; | ||
}; | ||
// Leave some capacity for the 2nd most frequent slot. | ||
let count = count.min(num_shreds * 3 / 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how you arrived to this, when testing did you see that we wasted time speculating trees on a big block that was not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some simple heuristics to start with.
The idea is usually you want to allocate most of your resources to the slot which is getting most number of shreds within the rolling window. But at the slot boundary you may get some shreds from the next slot, and so it might be leave some capacity to the next slot as well.
Open to testing other ideas but, again, I would rather have some basic core functionality in-place, and then iterate on top of that.
turbine/src/addr_cache.rs
Outdated
}; | ||
std::iter::from_fn(move || { | ||
// Find next missing code and data entries in the cache. | ||
let code = code.find(|&k| matches!(self.code.get(k), None | Some(None))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps it would be better to have a a positional mut code_index
in this closure so we can avoid repeated calls to find
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even something like this:
code.filter_map(|k| matches!(self.code.get(k), None | Some(None)).then_some((ShredType::Code, i))
.interleave(
data.filter_map(|k| matches!(self.data.get(k), None | Some(None)).then_some((ShredType::Data, i))
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, interleave is much better, switched to that, thanks.
} | ||
let mut entries: Vec<((Slot, CacheEntry), /*count:*/ usize)> = self | ||
.cache | ||
.drain() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be better to avoid the drain
here by collecting just the (count, slot)
to find the cutoff, and retain
instead of extend
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we need to allocate another hash-map to look-up which slots we want to retain, no?
Also, this cache is pretty small, like ~9 slots currently. So whatever we do it won't have a noticeable impact anyways.
entry.last_shred_in_slot |= stats.last_shred_in_slot; | ||
for (shred, root_distance, addrs) in std::mem::take(&mut stats.addrs) { | ||
debug_assert_eq!(shred.slot(), slot); | ||
entry.put(shred.shred_type(), shred.index(), (root_distance, addrs)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point why do we need to insert the shred into the cache? record
is called after we've retransmitted the shred, so barring a duplicate shred we should not expect having to retransmit again right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, it is just if there is a duplicate shred.
socket_addr_space: &SocketAddrSpace, | ||
stats: &RetransmitStats, | ||
) -> Option<(/*root_distance:*/ u8, Cow<'a, [SocketAddr]>)> { | ||
if let Some((root_distance, addrs)) = addr_cache.get(shred) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point can we also remove shred
from the cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might need to keep it due to duplicate shreds.
Also, this is called within a thread-pool, so mutations at least need some locking etc.
.iter() | ||
.map(|(&slot, &count)| (count, slot)) | ||
.collect(); | ||
counts.sort_unstable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to bias by how big the cache for this slot is as well? For instance if we initially see 1000
shreds for the current slot and there is some network delay we might end up wastefully computing many thousands of shreds for for the slot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also thoughts on precomputing future slots? Based on the average shred count per block over our window it might be more beneficial to stop precomputing for a slot once its reached a certain size and instead speculate on upcoming slots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might end up wastefully computing many thousands of shreds for for the slot
The cache only precomputes up to EXTEND_BUFFER
beyond the observed max_index_{code,data}
.
Also thoughts on precomputing future slots?
Yeah, we can try out those ideas and see what works.
For now I just started with some basic heuristics.
I thing having some basic core functionality in-place would make it easier to iterate and test other ideas.
let entry = self.get_cache_entry_mut(slot); | ||
entry.max_index_code = entry.max_index_code.max(stats.max_index_code); | ||
entry.max_index_data = entry.max_index_data.max(stats.max_index_data); | ||
entry.last_shred_in_slot |= stats.last_shred_in_slot; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a metric here to report how many data shreds we precomputed vs how many were necessary based on this flag? Gives some insight on any wasted computation for future tuning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but can we do that in follow up PRs?
this code is already ~700 lines.
Also, the EXTEND_BUFFER
limits how far the cache goes beyond the max_index_{code,data}
.
With current values it does not go more than 512 indices beyond that.
(and if it has observed last_shred_in_slot
it stops extending).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah got it, i overlooked the part that it's limited by the max index
The commit maintains number of shreds observed in each slot within a rolling window. If the channel is empty and there are no shreds to retransmit, it uses the idle cycles to speculatively precompute turbine tree for slots which have received most number of shreds within the rolling window.
d60edaa
to
6b2c139
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Nice performance improvement
Problem
Faster turbine tree calculation.
Summary of Changes
The commit maintains number of shreds observed in each slot within a rolling window.
If the channel is empty and there are no shreds to retransmit, it uses the idle cycles to speculatively precompute turbine tree for the slots which have received most number of shreds within the rolling window.