-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CPU] SHM based allreduce improvement for small message size #5571
Merged
adk9
merged 37 commits into
deepspeedai:master
from
delock:gma/symmetric_naive_allreduce
Jun 12, 2024
Merged
[CPU] SHM based allreduce improvement for small message size #5571
adk9
merged 37 commits into
deepspeedai:master
from
delock:gma/symmetric_naive_allreduce
Jun 12, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Hi @awan-10 , this PR is ready for review, can this PR be reviewed? Thanks! |
adk9
suggested changes
Jun 11, 2024
csrc/cpu/comm/shm.cpp
Outdated
parallel_memcpy(slice_data(data_ptr, chunk_el, data_size, rank), | ||
slice_data(workspace[rank]->buffer, chunk_el, chunk_size / chunk_el, rank), | ||
slice_size(chunk_el, rank) * data_size); | ||
wait_buffer_state_until_2(i, reduce_current, copy_next, state_group); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested change
wait_buffer_state_until_2(i, reduce_current, copy_next, state_group); | |
if (i != world_rank) { wait_buffer_state_until_2(i, reduce_current, copy_next, state_group); } |
Hi @adk9 , code updated according to comments, thanks! |
adk9
approved these changes
Jun 12, 2024
github-merge-queue bot
pushed a commit
that referenced
this pull request
Jul 16, 2024
#5604) This PR allows `deepspeed.comm.inference_all_reduce()` enters torch.compile graph even it is implemented as C++ kernel in DeepSpeed. Previous implementation register `inference_all_reduce()` C++ kernel as pybind function so it can be called inside PyThon code. However pybind function cannot be recognized by PyTorch so graph breaks when `inference_all_reduce` is called. We address issue by register `inference_all_reduce` as a PyTorch custom op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into PyTorch graph The output trace code from torchinductor ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"): # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor) inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3) # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input) permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None # No stacktrace found for following nodes copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None return [addmm, inference_all_reduce] ``` Note in this PR the inference_all_reduce op for CPU does not handle multinode and FP16 data type. For FP16 data type support, we will align with PyTorch CPU FP16 plan. For multinode, we are still looking at the possibility to upstream oneCCL integration into PyTorch, so we are able to get use of oneCCL for multinode tensor parallel inference with PyTorch. This PR is independent to #5571. They can work seperately or together without issue. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
On CPU server, when running SHM based allreduce for small messages, the performance is pretty much dominated by synchronization latency. These latency includes the following two situations:
#pragma omp parallel for
to accelerator memory bandwidth bound operations such asparallel_memcpy
orreduce
.Each synchronization add a little time to allreduce latency. In current implementation, for small messages, 5 syncs on rank 0 are needed. This includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4) copy-out; 5) wait for other ranks finish copy-out
We redesign the algorithm for small message allreduce (called
symmetric_naive_allreduce
) to have only three syncs, each rank do exactly the same steps: 1) copy-in; 2) wait for other ranks done copy; 3) reduce to output buffer directly. We use double buffer so we can skip the last wait and go directly to next call using another buffer. We have a carefully designed state check to avoid using global barrier among ranks.Test shows for message size < 1MB, allreduce latency will reduce 30% to 50%. This is especially helpful for tensor parallel decoding with small batch size, where the tensor size is usually a few 10s of KBytes.
Because the new method would compute same reduce value on all ranks. Caution needs to be taken to ensure the result is identical on all ranks. We use the test in the link https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70 to ensure the implementation is correct. https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a test script for better coverage.