Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wip: test transport layer improvements #386

Closed
wants to merge 3 commits into from

Conversation

csegarragonz
Copy link
Collaborator

@csegarragonz csegarragonz commented Feb 27, 2024

This PR combines:

we will merge them separately, but ideally we want to have both PRs above green, and then measure the benefits (hopefully we are on par with #382)

@lgarithm
Copy link
Contributor

lgarithm commented Feb 28, 2024

benchmark results with #385 and #379

BGN ========================================bench_allreduce========================================
bench_all_reduce(np=4) took 0.0354s, total workload: 3.662MiB, rate: 0.101GiB/s
bench_all_reduce(np=4) took 0.0319s, total workload: 3.662MiB, rate: 0.112GiB/s
bench_all_reduce(np=4) took 0.0316s, total workload: 3.662MiB, rate: 0.113GiB/s
bench_all_reduce(np=4) took 0.0319s, total workload: 3.662MiB, rate: 0.112GiB/s
bench_all_reduce(np=4) took 0.0321s, total workload: 3.662MiB, rate: 0.112GiB/s
bench_all_reduce(np=4) took 0.0328s, total workload: 3.662MiB, rate: 0.109GiB/s
bench_all_reduce(np=4) took 0.0317s, total workload: 3.662MiB, rate: 0.113GiB/s
bench_all_reduce(np=4) took 0.0316s, total workload: 3.662MiB, rate: 0.113GiB/s
bench_all_reduce(np=4) took 0.0314s, total workload: 3.662MiB, rate: 0.114GiB/s
bench_all_reduce(np=4) took 0.0315s, total workload: 3.662MiB, rate: 0.113GiB/s
bench_all_reduce(np=4) took 0.2235s, total workload: 1.144GiB, rate: 5.117GiB/s
bench_all_reduce(np=4) took 0.2024s, total workload: 1.144GiB, rate: 5.652GiB/s
bench_all_reduce(np=4) took 0.2050s, total workload: 1.144GiB, rate: 5.579GiB/s
bench_all_reduce(np=4) took 0.2047s, total workload: 1.144GiB, rate: 5.586GiB/s
bench_all_reduce(np=4) took 0.2059s, total workload: 1.144GiB, rate: 5.556GiB/s
bench_all_reduce(np=4) took 0.2053s, total workload: 1.144GiB, rate: 5.570GiB/s
bench_all_reduce(np=4) took 0.2055s, total workload: 1.144GiB, rate: 5.565GiB/s
bench_all_reduce(np=4) took 0.2045s, total workload: 1.144GiB, rate: 5.594GiB/s
bench_all_reduce(np=4) took 0.2030s, total workload: 1.144GiB, rate: 5.633GiB/s
bench_all_reduce(np=4) took 0.2067s, total workload: 1.144GiB, rate: 5.533GiB/s
END ========================================bench_allreduce========================================

benchmark results with #385 and #379 and #387 (result of 98947e7212c60ad2e9124377eac4c25d9b825770)

BGN ========================================bench_allreduce========================================
bench_all_reduce(np=4) took 0.0370s, total workload: 3.662MiB, rate: 0.097GiB/s
bench_all_reduce(np=4) took 0.0337s, total workload: 3.662MiB, rate: 0.106GiB/s
bench_all_reduce(np=4) took 0.0343s, total workload: 3.662MiB, rate: 0.104GiB/s
bench_all_reduce(np=4) took 0.0333s, total workload: 3.662MiB, rate: 0.107GiB/s
bench_all_reduce(np=4) took 0.0342s, total workload: 3.662MiB, rate: 0.105GiB/s
bench_all_reduce(np=4) took 0.0335s, total workload: 3.662MiB, rate: 0.107GiB/s
bench_all_reduce(np=4) took 0.0341s, total workload: 3.662MiB, rate: 0.105GiB/s
bench_all_reduce(np=4) took 0.0340s, total workload: 3.662MiB, rate: 0.105GiB/s
bench_all_reduce(np=4) took 0.0336s, total workload: 3.662MiB, rate: 0.106GiB/s
bench_all_reduce(np=4) took 0.0337s, total workload: 3.662MiB, rate: 0.106GiB/s
bench_all_reduce(np=4) took 0.3165s, total workload: 1.144GiB, rate: 3.614GiB/s
bench_all_reduce(np=4) took 0.3176s, total workload: 1.144GiB, rate: 3.601GiB/s
bench_all_reduce(np=4) took 0.3104s, total workload: 1.144GiB, rate: 3.685GiB/s
bench_all_reduce(np=4) took 0.3175s, total workload: 1.144GiB, rate: 3.602GiB/s
bench_all_reduce(np=4) took 0.3138s, total workload: 1.144GiB, rate: 3.644GiB/s
bench_all_reduce(np=4) took 0.3176s, total workload: 1.144GiB, rate: 3.601GiB/s
bench_all_reduce(np=4) took 0.3115s, total workload: 1.144GiB, rate: 3.672GiB/s
bench_all_reduce(np=4) took 0.3146s, total workload: 1.144GiB, rate: 3.635GiB/s
bench_all_reduce(np=4) took 0.3102s, total workload: 1.144GiB, rate: 3.687GiB/s
bench_all_reduce(np=4) took 0.3150s, total workload: 1.144GiB, rate: 3.631GiB/s
END ========================================bench_allreduce========================================

@lgarithm
Copy link
Contributor

looks like it is on par with #382. What's their difference?

@csegarragonz
Copy link
Collaborator Author

csegarragonz commented Feb 28, 2024

@lgarithm that is great news! this PR is meant to be the same than #382 but is just a pure merge of #385 and #379 (both passing all the tests)

so before merging both of them i wanted to make sure that in the upstreaming process i did not mess up in any way

i am also re-building the production containers to triple-check if we can see some improvement (or no regression) in our production workload. after that, i will merge the two linked prs in.

@csegarragonz csegarragonz force-pushed the spinlock-plus-struct-test branch from d260e99 to b5b8123 Compare February 28, 2024 14:21
@csegarragonz csegarragonz changed the title wip: test spinlock + c-struct wip: test transport layer improvements Feb 28, 2024
@csegarragonz csegarragonz force-pushed the spinlock-plus-struct-test branch 2 times, most recently from 3db65f3 to 578c079 Compare February 28, 2024 14:48
@csegarragonz
Copy link
Collaborator Author

@lgarithm could you run the benchmarks with the latest commit i have added and include the results where it says PASTE HERE in the message above? (i have edited it slightly)

note that i've cleaned-up the commit history to make sure we know which commit introduces what, so this may affect your merging/rebasing.

@csegarragonz csegarragonz force-pushed the spinlock-plus-struct-test branch from 98947e7 to 578c079 Compare February 28, 2024 16:23
This commit moves the abstraction of a PTP message from a protobuf
object to a fixed-size C-struct with a heap pointer. The rationale
is that these PTP messages move through the system, and even when
careful it is challenging to keep track of the copies allocations
that protobuf is doing under the hood. In exchange, we are very
explicity about the copies and allocations we do.
@csegarragonz csegarragonz force-pushed the spinlock-plus-struct-test branch from 3f0f4d7 to b7f0042 Compare March 13, 2024 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants