Update Domino for Llama3 #7084

shenzheyu · 2025-02-26T20:08:20Z

No description provided.

GuanhuaWang · 2025-03-05T21:30:11Z

@hwchen2017 , please follow up on this pr. thank you!

Signed-off-by: Zheyu SHEN <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Signed-off-by: Zheyu SHEN <[email protected]>

Propagate API change. Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

- add zero2 test - minor fix with transformer version update & ds master merge. Signed-off-by: inkcherry <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

bf16 with moe refresh optimizer state from bf16 ckpt will raise IndexError: list index out of range Signed-off-by: shaomin <[email protected]> Co-authored-by: shaomin <[email protected]> Co-authored-by: Hongwei Chen <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.4 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

@jeffra

@jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Description This PR includes Tecorigin SDAA accelerator support. With this PR, DeepSpeed supports SDAA as backend for training tasks. --------- Signed-off-by: siqi <[email protected]> Co-authored-by: siqi <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

More information on libuv in pytorch: https://pytorch.org/tutorials/intermediate/TCPStore_libuv_backend.html Issue tracking the prevalence of the error on Windows (unresolved at the time of this PR): pytorch/pytorch#139990 LibUV github: https://github.com/libuv/libuv Windows error: ``` File "C:\hostedtoolcache\windows\Python\3.12.7\x64\Lib\site-packages\torch\distributed\rendezvous.py", line 189, in _create_c10d_store return TCPStore( ^^^^^^^^^ RuntimeError: use_libuv was requested but PyTorch was build without libuv support ``` use_libuv isn't well supported on Windows in pytorch <2.4, so we need to guard around this case. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

@fukun07

@fukun07 and I discovered a bug when using the `offload_states` and `reload_states` APIs of the Zero3 optimizer. When using grouped parameters (for example, in weight decay or grouped lr scenarios), the order of the parameters mapping in `reload_states` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953)) does not correspond with the initialization of `self.lp_param_buffer` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)), which leads to misaligned parameter loading. This issue was overlooked by the corresponding unit tests ([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)), so we fixed the bug in our PR and added the corresponding unit tests. --------- Signed-off-by: Wei Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Following changes in Pytorch trace rules , my previous PR to avoid graph breaks caused by logger is no longer relevant. So instead I've added this functionality to torch dynamo - pytorch/pytorch@16ea0dd This commit allows the user to config torch to ignore logger methods and avoid associated graph breaks. To enable ignore logger methods - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" To ignore logger methods except for a specific method / methods (for example, info and isEnabledFor) - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info, isEnabledFor" Signed-off-by: ShellyNR <[email protected]> Co-authored-by: snahir <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

…t` (deepspeedai#7069) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted ![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d) This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072 Signed-off-by: Zheyu SHEN <[email protected]>

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

shenzheyu requested review from GuanhuaWang and hwchen2017 as code owners February 26, 2025 20:08

loadams and others added 19 commits March 5, 2025 17:55

Update setup.py handling of ROCm cupy (deepspeedai#7051)

963f11b

Signed-off-by: Zheyu SHEN <[email protected]>

nv-ds-chat breaks with latest transformers (deepspeedai#7052)

f538f55

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

update for llama3

ef6c29b

Signed-off-by: Zheyu SHEN <[email protected]>

fix format

54a1421

Signed-off-by: Zheyu SHEN <[email protected]>

Rename aio_thread_count to intra_op_parallelism (deepspeedai#7056)

4239526

Propagate API change. Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

add autoTP training zero2 tests (deepspeedai#7049)

f3ce29f

- add zero2 test - minor fix with transformer version update & ds master merge. Signed-off-by: inkcherry <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Update version.txt after 0.16.4 release (deepspeedai#7063)

adb4e08

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.16.4 Author - @loadams Co-authored-by: loadams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

fix an outdated doc wrt CUDA_VISIBLE_DEVICES (deepspeedai#7058)

aeaf0ce

@jeffra and I fixed this many years ago, so bringing this doc to a correct state. --------- Signed-off-by: Stas Bekman <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Update README with info on newest accelerator (deepspeedai#7065)

1faaf1e

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Fix TOCTOU issues, switch to fstat (deepspeedai#7067)

3638f9c

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

Revert "Handle special case of libuv for Windows (deepspeedai#7064)" (d…

dddc7cf

…eepspeedai#7076) This reverts commit 8577bd2. Fixes: deepspeedai#7072 Signed-off-by: Zheyu SHEN <[email protected]>

Add DeepseekV3 AutoTP. (deepspeedai#7045)

91d05e2

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: Zheyu SHEN <[email protected]>

shenzheyu force-pushed the master branch from 6d32bb4 to 91d05e2 Compare March 5, 2025 22:56

shenzheyu requested review from tjruwase, tohtana, jomayeri and loadams as code owners March 5, 2025 22:56

Merge branch 'master' into master

be199d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Domino for Llama3 #7084

Update Domino for Llama3 #7084

shenzheyu commented Feb 26, 2025

GuanhuaWang commented Mar 5, 2025

Update Domino for Llama3 #7084

Are you sure you want to change the base?

Update Domino for Llama3 #7084

Conversation

shenzheyu commented Feb 26, 2025

GuanhuaWang commented Mar 5, 2025