Enable torch.autocast with ZeRO #6993

tohtana · 2025-02-03T07:19:20Z

DeepSpeed supports mixed precision training, but the behavior is different from torch.autocast. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while torch.autocast maintains parameters in FP32 but computes only certain operators in the lower precision.
This leads to differences in:

performance: torch.autocast needs downcast in forward/backward
memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision
accuracy: torch.autocast has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32.

To align DeepSpeed's behavior with torch.autocast when necessary, this PR adds the integration with torch.autocast with ZeRO. Here is an examples of the configuration.

"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}

Each configuration works as follows:

enabled: Enable the integration with torch.autocast if this is set to True. You don't need to call torch.autocast in your code. The grad scaler is also applied in the DeepSpeed optimizer.
dtype: lower precision dtype passed to torch.autocast. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of lower_precision_safe_modules are also downcasted to this dtype.
lower_precision_safe_modules: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows torch.autocast's policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: [torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d].

Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable fp16 or bf16 in DeepSpeed config.

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

NVIDIA Blackwell GPU generation has number 10. The SM code and architecture should be `100`, but the current code generates `1.`, because it expects a 2 characters string. This change modifies the logic to consider it as a string that contains a `.`, hence splits the string and uses the array of strings. Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Masahiro Tanaka <[email protected]>

Same as [this PR](#6922). [affeb88](affeb88) I noticed the CI updated the DCO check recently. Using the suggested rebase method for sign-off would reintroduce many conflicts, so I opted for a squash merge with sign-off instead. thanks: ) Signed-off-by: inkcherry <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

Those files have code that gets run when importing them, so in systems that doesn't support triton but have triton installed this causes issues. In general, I think it is better to import triton when it is installed and supported. Signed-off-by: Omar Elayan <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix #7014 Avoid naming collision on `partition()` --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix typos Signed-off-by: Masahiro Tanaka <[email protected]>

BUGFIX for Apple Silicon hostname #6497 --------- Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: inkcherry <[email protected]> Signed-off-by: Roman Fitzjalen <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Liangliang Ma <[email protected]> Co-authored-by: inkcherry <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

- Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: inkcherry <[email protected]> Signed-off-by: Omar Elayan <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Liangliang Ma <[email protected]> Co-authored-by: inkcherry <[email protected]> Co-authored-by: Omar Elayan <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

This change is required to successfully build fp_quantizer extension on ROCm. --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

@tjruwase

cc @tjruwase @jomayeri --------- Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix #7029 - Add Chinese blog for deepspeed windows - Fix format in README.md Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Adding compile support for AIO library on AMD GPUs. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update CUDA compute capability for cross compile according to wiki page. https://en.wikipedia.org/wiki/CUDA#GPUs_supported --------- Signed-off-by: Hongwei <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Propagate API change. Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

@fukun07

@fukun07 and I discovered a bug when using the `offload_states` and `reload_states` APIs of the Zero3 optimizer. When using grouped parameters (for example, in weight decay or grouped lr scenarios), the order of the parameters mapping in `reload_states` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L2953)) does not correspond with the initialization of `self.lp_param_buffer` ([here](https://github.com/deepspeedai/DeepSpeed/blob/14b3cce4aaedac69120d386953e2b4cae8c2cf2c/deepspeed/runtime/zero/stage3.py#L731)), which leads to misaligned parameter loading. This issue was overlooked by the corresponding unit tests ([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)), so we fixed the bug in our PR and added the corresponding unit tests. --------- Signed-off-by: Wei Wu <[email protected]> Co-authored-by: Masahiro Tanaka <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Following changes in Pytorch trace rules , my previous PR to avoid graph breaks caused by logger is no longer relevant. So instead I've added this functionality to torch dynamo - pytorch/pytorch@16ea0dd This commit allows the user to config torch to ignore logger methods and avoid associated graph breaks. To enable ignore logger methods - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" To ignore logger methods except for a specific method / methods (for example, info and isEnabledFor) - os.environ["DISABLE_LOGS_WHILE_COMPILING"] = "1" and os.environ["LOGGER_METHODS_TO_EXCLUDE_FROM_DISABLE"] = "info, isEnabledFor" Signed-off-by: ShellyNR <[email protected]> Co-authored-by: snahir <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

…t` (#7069) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted ![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d) This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

This reverts commit 8577bd2. Fixes: #7072 Signed-off-by: Masahiro Tanaka <[email protected]>

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fixes: #7082 --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

…/runner (#7086) Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Signed-off-by: Masahiro Tanaka <[email protected]>

tjruwase and others added 30 commits February 28, 2025 22:53

Use ds-specific module id to avoid conflicts (#6847)

a4fbc3a

Fix #6772 --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

add autocast support and ds_config item

9d48dc6

Signed-off-by: Masahiro Tanaka <[email protected]>

prepare ipg buckets for multiple dtypes

8957009

Signed-off-by: Masahiro Tanaka <[email protected]>

switch communication data type

c390c76

Signed-off-by: Masahiro Tanaka <[email protected]>

add gradscaler

458797d

Signed-off-by: Masahiro Tanaka <[email protected]>

Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (…

2984415

…#6967) - Issues with nv-sd updates, will follow up with a subsequent PR Signed-off-by: Masahiro Tanaka <[email protected]>

fix import and formatting

6b6a600

Signed-off-by: Masahiro Tanaka <[email protected]>

convert comm type for z3

2817c02

Signed-off-by: Masahiro Tanaka <[email protected]>

Update GH org references (#6998)

fcdeda3

Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Fabien Dupont <[email protected]> Co-authored-by: Fabien Dupont <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update CNAME

b476d07

Signed-off-by: Masahiro Tanaka <[email protected]>

Update CNAME

72f9687

Signed-off-by: Masahiro Tanaka <[email protected]>

[XPU] max1100 workflow update for docker and softwares (#7003)

a2425da

1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5 Signed-off-by: Masahiro Tanaka <[email protected]>

cast dtype for allgather

1c1d43a

Signed-off-by: Masahiro Tanaka <[email protected]>

Update A6000 tests transformers version (#7016)

359c85d

Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Stas Bekman <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix ds-chat CI regression (#7015)

735fc2c

Fix #7014 Avoid naming collision on `partition()` --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

[Ulysses tutorial] typos (#7024)

1e7888c

Fix typos Signed-off-by: Masahiro Tanaka <[email protected]>

[ROCm] Enable fp_quantizer on ROCm (#7027)

f1aea5d

This change is required to successfully build fp_quantizer extension on ROCm. --------- Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

add gds chinese blog (#7034)

8152824

cc @tjruwase @jomayeri --------- Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Signed-off-by: Masahiro Tanaka <[email protected]>

Add chinese blog for deepspeed windows, and fix format (#7035)

c898ac5

Fix #7029 - Add Chinese blog for deepspeed windows - Fix format in README.md Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

AIO on ROCM (#7023)

e3ea926

Adding compile support for AIO library on AMD GPUs. --------- Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Control trace cache warnings (#7039)

e946615

Make trace cache warnings configurable, and disabled by default. Fix #6985, #4081, #5033, #5006, #5662 --------- Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update CUDA compute capability to support Blackwell (#7047)

38e9bf3

Update CUDA compute capability for cross compile according to wiki page. https://en.wikipedia.org/wiki/CUDA#GPUs_supported --------- Signed-off-by: Hongwei <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update setup.py handling of ROCm cupy (#7051)

acc6a1e

Signed-off-by: Masahiro Tanaka <[email protected]>

nv-ds-chat breaks with latest transformers (#7052)

bda1430

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Rename aio_thread_count to intra_op_parallelism (#7056)

c184b16

Propagate API change. Signed-off-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

loadams and others added 13 commits February 28, 2025 22:53

Update README with info on newest accelerator (#7065)

3f5cd1a

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix TOCTOU issues, switch to fstat (#7067)

877c30e

Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Fix meta load tensor imcompatible issue (#7073)

060aa5a

The partition tensor doesn't need to move to the current device when meta load is used. Signed-off-by: Lai, Yejing <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Revert "Handle special case of libuv for Windows (#7064)" (#7076)

f99605b

This reverts commit 8577bd2. Fixes: #7072 Signed-off-by: Masahiro Tanaka <[email protected]>

Add DeepseekV3 AutoTP. (#7045)

57805b2

Add deepseekv3 autotp. Signed-off-by: Lai, Yejing <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Improve inference tutorial docs (#7083)

697050e

Fixes: #7082 --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Pin transformers version on tests that use latest. (#7085)

83c9461

Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update README.md with ICS '23 MoE paper link (#7087)

c6bf7fb

Co-authored-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Update parallelism for nv-torch-latest/nightly tests due to more GPUs…

2c36865

…/runner (#7086) Signed-off-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

Remove workflows for very old torch versions (#7090)

f2b89ec

These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <[email protected]> Signed-off-by: Masahiro Tanaka <[email protected]>

tohtana force-pushed the tohtana/support_autocast branch from 453cc16 to f2b89ec Compare February 28, 2025 22:54

tohtana and others added 8 commits February 28, 2025 14:55

Merge branch 'master' into tohtana/support_autocast

965cb2b

clear reduce buffer

981e8e2

Signed-off-by: Masahiro Tanaka <[email protected]>

add config to set lower precision modules

37f77ae

Signed-off-by: Masahiro Tanaka <[email protected]>

fix to use comm dtype in config when autocast is disabled

3083d94

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/support_autocast

d688b75

add tests

aa60eb3

Signed-off-by: Masahiro Tanaka <[email protected]>

sort dtypes

9529830

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/support_autocast

c8056a8

tohtana marked this pull request as ready for review March 5, 2025 23:08

tohtana requested review from tjruwase and loadams as code owners March 5, 2025 23:08

tohtana and others added 5 commits March 6, 2025 01:46

fix for cases where param and param.ds_tensor have different dtypes

c56339c

Signed-off-by: Masahiro Tanaka <[email protected]>

Merge branch 'master' into tohtana/support_autocast

1d6ed6e

fix moe tests

aa10e11

Signed-off-by: Masahiro Tanaka <[email protected]>

fix tests for opt state offloading

26e62e4

Signed-off-by: Masahiro Tanaka <[email protected]>

fix var name

a74fa1e

Signed-off-by: Masahiro Tanaka <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable torch.autocast with ZeRO #6993

Enable torch.autocast with ZeRO #6993

tohtana commented Feb 3, 2025 •

edited

Loading

Enable torch.autocast with ZeRO #6993

Are you sure you want to change the base?

Enable torch.autocast with ZeRO #6993

Conversation

tohtana commented Feb 3, 2025 • edited Loading

tohtana commented Feb 3, 2025 •

edited

Loading