[CPU] SHM based allreduce improvement for small message size #5571

delock · 2024-05-27T08:56:57Z

On CPU server, when running SHM based allreduce for small messages, the performance is pretty much dominated by synchronization latency. These latency includes the following two situations:

Wait for status change from other ranks.
Use #pragma omp parallel for to accelerator memory bandwidth bound operations such as parallel_memcpy or reduce.

Each synchronization add a little time to allreduce latency. In current implementation, for small messages, 5 syncs on rank 0 are needed. This includes: 1) copy-in; 2) wait for other ranks done copy; 3) reduce; 4) copy-out; 5) wait for other ranks finish copy-out

We redesign the algorithm for small message allreduce (called symmetric_naive_allreduce) to have only three syncs, each rank do exactly the same steps: 1) copy-in; 2) wait for other ranks done copy; 3) reduce to output buffer directly. We use double buffer so we can skip the last wait and go directly to next call using another buffer. We have a carefully designed state check to avoid using global barrier among ranks.

Test shows for message size < 1MB, allreduce latency will reduce 30% to 50%. This is especially helpful for tensor parallel decoding with small batch size, where the tensor size is usually a few 10s of KBytes.

message size(bytes)	new method latency(us)	old method latency(us)
2	13.34	20.39
4	13.44	19.57
8	13.70	19.76
16	13.27	20.43
32	13.42	19.75
64	13.38	19.80
128	13.70	19.44
256	13.99	20.33
512	13.91	20.28
1024	15.00	22.86
2048	15.82	20.93
4096	16.00	21.08
8192	16.31	21.50
16384	16.27	22.95
32768	16.13	25.17
65536	18.92	25.90
131072	21.12	27.42
262144	23.09	32.36
524288	32.78	42.80

Because the new method would compute same reduce value on all ranks. Caution needs to be taken to ensure the result is identical on all ranks. We use the test in the link https://github.com/delock/ds_allreduce_bench/blob/main/ds_comm_bench.py#L70 to ensure the implementation is correct. https://github.com/delock/ds_allreduce_bench/blob/main/validate.sh is a test script for better coverage.

delock · 2024-06-06T06:51:15Z

Hi @awan-10 , this PR is ready for review, can this PR be reviewed? Thanks!

tjruwase · 2024-06-09T22:52:39Z

Hi @awan-10 , this PR is ready for review, can this PR be reviewed? Thanks!

@delock, we are reviewing now. Thanks for the PR!

csrc/cpu/comm/shm.cpp

adk9 · 2024-06-11T17:26:27Z

csrc/cpu/comm/shm.cpp

-        parallel_memcpy(slice_data(data_ptr, chunk_el, data_size, rank),
-                        slice_data(workspace[rank]->buffer, chunk_el, chunk_size / chunk_el, rank),
-                        slice_size(chunk_el, rank) * data_size);
+        wait_buffer_state_until_2(i, reduce_current, copy_next, state_group);


Suggested change

wait_buffer_state_until_2(i, reduce_current, copy_next, state_group);

if (i != world_rank) { wait_buffer_state_until_2(i, reduce_current, copy_next, state_group); }

csrc/cpu/comm/shm.cpp

delock · 2024-06-12T02:23:29Z

Hi @adk9 , code updated according to comments, thanks!

adk9 · 2024-06-12T07:00:19Z

Hi @adk9 , code updated according to comments, thanks!

Hi @delock, thanks for your changes! The formatting check seems to fail for this PR. Could you run pre-commit or clang-format on your changes?

delock · 2024-06-12T15:47:55Z

Hi @adk9 , code updated according to comments, thanks!

Hi @delock, thanks for your changes! The formatting check seems to fail for this PR. Could you run pre-commit or clang-format on your changes?

Hi @adk9 , formatting had been fixed in latest CI. Thanks!

#5604) This PR allows `deepspeed.comm.inference_all_reduce()` enters torch.compile graph even it is implemented as C++ kernel in DeepSpeed. Previous implementation register `inference_all_reduce()` C++ kernel as pybind function so it can be called inside PyThon code. However pybind function cannot be recognized by PyTorch so graph breaks when `inference_all_reduce` is called. We address issue by register `inference_all_reduce` as a PyTorch custom op `torch.ops.deepspeed.inference_all_reduce`, so it can be built into PyTorch graph The output trace code from torchinductor ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[5, 4]", primals_2: "f32[5]", primals_3: "f32[4, 4]"): # File: /home/gma/DeepSpeed/deepspeed/comm/torch.py:161 in inference_all_reduce, code: return torch.ops.deepspeed.inference_all_reduce_(tensor) inference_all_reduce: "f32[4, 4]" = torch.ops.deepspeed.inference_all_reduce.default(primals_3) # File: /home/gma/allreduce_graph/test_allreduce.py:33 in forward, code: return self.linear(input) permute: "f32[4, 5]" = torch.ops.aten.permute.default(primals_1, [1, 0]); primals_1 = None addmm: "f32[4, 5]" = torch.ops.aten.addmm.default(primals_2, inference_all_reduce, permute); primals_2 = permute = None # No stacktrace found for following nodes copy_: "f32[4, 4]" = torch.ops.aten.copy_.default(primals_3, inference_all_reduce); primals_3 = None return [addmm, inference_all_reduce] ``` Note in this PR the inference_all_reduce op for CPU does not handle multinode and FP16 data type. For FP16 data type support, we will align with PyTorch CPU FP16 plan. For multinode, we are still looking at the possibility to upstream oneCCL integration into PyTorch, so we are able to get use of oneCCL for multinode tensor parallel inference with PyTorch. This PR is independent to #5571. They can work seperately or together without issue. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]>

delock and others added 30 commits May 8, 2024 03:21

add profile for naive all_reduce

5830213

add multi parallel copy

fec2c9b

alternative multi-parallel memcpy

4c642a1

use double buffer

ed748ce

change naive all reduce to symmetric

e2312ec

clean up

3e4b6c3

don't use coll_begin set in naive_all_reduce

031b831

seperate buffer for different algorithm

2b15c22

turn off profile

d5865aa

fix distributed naive allreduce

2588277

cleanup

2f69443

Remove profiling code

d1b2f09

add back original naive_all_reduce

0ba1f07

remove naive_all_reduce

05fc250

cleanup

7bc708d

remove barrier which is not needed

af7d4fa

cleanup

b76937c

can handle > 16 rank with efficiency

0da84b6

Remove REPEAT

49c2153

clean up state

a3cc129

fix distributed allreduce perf

7b41d2f

remove unnecessary state change

87accf4

double buffer for distributed_naive_all_reduce

a1ff77e

fix result error

8af6113

multiparallel copy #1

00a1c27

single omp region multi parallel copy

31b3643

add alternaive path

8e5639e

remove multi-memcpy which actually cause perf drop

c0733cb

fix distributed accuracy issue

b7713b6

cleanup

3f088e4

delock requested review from awan-10, mrwyattii and arashb as code owners May 27, 2024 08:56

delock and others added 2 commits May 27, 2024 05:03

fix format

6c7ec55

Merge branch 'master' into gma/symmetric_naive_allreduce

dabae15

delock mentioned this pull request Jun 6, 2024

[CPU] Allow deepspeed.comm.inference_all_reduce in torch.compile graph #5604

Merged

Merge branch 'master' into gma/symmetric_naive_allreduce

f063421

tjruwase requested review from adk9 and tjruwase and removed request for arashb, awan-10 and mrwyattii June 9, 2024 22:52

adk9 suggested changes Jun 11, 2024

View reviewed changes

Follow comments, remove unneeded codes and syncs.

608cf7c

adk9 approved these changes Jun 12, 2024

View reviewed changes

Merge branch 'master' into gma/symmetric_naive_allreduce

1847a10

fix format

7f614cb

Merge branch 'master' into gma/symmetric_naive_allreduce

e1853f6

adk9 added this pull request to the merge queue Jun 12, 2024

Merged via the queue into deepspeedai:master with commit eda5075 Jun 12, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CPU] SHM based allreduce improvement for small message size #5571

[CPU] SHM based allreduce improvement for small message size #5571

delock commented May 27, 2024 •

edited

Loading

delock commented Jun 6, 2024

tjruwase commented Jun 9, 2024

adk9 Jun 11, 2024

delock commented Jun 12, 2024

adk9 commented Jun 12, 2024

delock commented Jun 12, 2024

	wait_buffer_state_until_2(i, reduce_current, copy_next, state_group);
	if (i != world_rank) { wait_buffer_state_until_2(i, reduce_current, copy_next, state_group); }

[CPU] SHM based allreduce improvement for small message size #5571

[CPU] SHM based allreduce improvement for small message size #5571

Conversation

delock commented May 27, 2024 • edited Loading

delock commented Jun 6, 2024

tjruwase commented Jun 9, 2024

adk9 Jun 11, 2024

Choose a reason for hiding this comment

delock commented Jun 12, 2024

adk9 commented Jun 12, 2024

delock commented Jun 12, 2024

delock commented May 27, 2024 •

edited

Loading