Variable batch size and LR scheduler #7020

bm-synth · 2025-02-08T23:30:22Z

Background and rationale

In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in Attention is all you need, section 5.1:

Sentence pairs were batched together by approximate sequence length. Each training
batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000
target tokens.

Dynamic batch sizes has been requested in DeepSpeed issue 1051, DeepSpeed issue 3455 , Pytorch Lightning issue 16914, huggingface issue 2647 and is available already in many libraries e.g. NVIDIA Triton and Meta FairSeq (implementation here ).

The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a BxTxE (Batch x Time x Embedding) -shaped input should ideally have high B and low T at the early curriculum steps (many short sentences packed together as a batch), and low B and high T at the late steps (few long sentences in the batch). A dynamic size T is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's reset_activation_shape():

For curriculum learning that changes the seqlen of each sample, we need to call this whenever the seqlen is going to change.

However, dynamic B is not supported. A dynamic B would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as:

Linear Scaling Rule: "When the minibatch size is multiplied by k, multiply the learning rate by k", as in Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al.
Square Root scaling: "when multiplying the batch size by k, multiply the learning rate by √k, to keep the variance in the gradient expectation constant" by One weird trick for parallelizing convolutional neural networks, A. Krizhevsky et al.

In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config.

Illustration of dynamic batch size, sequence length and LR

Imagine we picked a limit of 30 tokens per batch, and have set a reference lr=1e-3 for a train_batch_size=2 (in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.:

Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then 10 and 4 on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become 5e-3 and 2e-3.

Pipeline parallelism

Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as engine.reset_activation_shape() is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similar BxTxE between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right):

We can see that the pipeline use case (right) has the same BxTxE shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size)

Attention Head

For an input of size BxTxE the attention has a shape of TxT for a mask of fixed size across samples of same size, or BxTxT for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as:

Note the memory savings: the attention head has a size of BxTxT, i.e. a linear memory dependency on the batch size B and quadratic memory dependency on the largest sequence length T in the (micro-) batch. Thus, supporting a dynamic size T allows for an increase of B.

PR overview

This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling get_dataloader_and_lr_scheduler_for_variable_batch_size. A small explanation of that function follows:

The logic behind the algorithms for LR scaling is in scale_lr;
The partitioning of samples into batches is done by batch_by_seqlen.
For pipeline parallelism, it is required that all micro-batches in a pipeline pass to have the same activation shapes. This is enabled by setting to True the following parameters:
- required_microbatches_of_same_sizes that will force the B dimension to be the same across all gradient accumulation steps of all dataloaders on a batch;
- required_microbatches_of_same_lengths that will force the T dimension to be the same across all gradient accumulation steps. Works by calling the user-provided sample_padding_fn(sentence, len) that pads a given sentence to the argument length;
- batch_by_seqlen returns microbatch_sample_ids (the list of sample ids per micro-batch), batch_sizes (the size of effective batch sizes, and batch_max_seqlens (longest sequence across all microbatches in a batch)
dataloader_for_variable_batch_size relies on microbatch_sample_ids and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches;
lr_scheduler_for_variable_batch_size relies on batch_sizes to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc).
- Special note to the lr_scheduler returned that will either accept either:
  1. an user-provided Optimizer that will scale the learning rates (in param groups) at every batch, or
  2. an user-defined LRScheduler, that in this case will first get the learning rate from the scheduler and then scale it accordingly.

Example

An example for the use case with and without pipelining is provided in deepspeed/runtime/data_pipeline/data_sampling/variable_batch_size_and_lr_example.py. The example shows an attention head with attention of variable-sized BxTxT per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (see feedforward: needs to convert BxTxE to BxMxE by padding extra tokens in the code).

The output of the variable_batch_size_and_lr_example is the following:

Config

The example file also comments the relevant deepspeed config with comments:

config = {
  "train_batch_size": 16,
  # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together.
  #  I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`.
  "train_micro_batch_size_per_gpu": 2,
  "data_efficiency": {
    "enabled": True,
    # seed to be applied to all data efficiency modules, including dynamic batching
    "seed": 42,
    "data_sampling": {
      "num_workers": 0, # dataloader num_workers argument
      "pin_memory": False,  # dataloader pin_memory argument
      "dynamic_batching": {
        # enables or disables dynamic batching
        "enabled": True,
        # how many tokens we need to fill a pack of sequences (that will be collated together as a sample)
        "max_tokens": 100,
        # Input and output write to read from or write the length of every sequence.
        # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx
        # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs.
        "metrics_path": "./curriculum_output/",
        # As batch size increases/decreses, which method to use to scale LR accordingly?
        # Options: linear, sqrt (square root), or None to disable
        "lr_scaling_method": "linear",
        # how to pick sentences to be packed into samples:
        # - dataloader: by same order as they come in with the dataloader
        # - seqlen: by sequence length (shortest to longest)
        # - random: random order using the seed in config['data_efficiency']['seed'
        "sentence_picking_order": "dataloader",  # "random" / "seqlen" / "dataloader"
        # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded.
        "min_batch_size": 1,
        # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded.
        "max_batch_size": 10,
        # enable the output of microbatching information about sentence packing
        "verbose": True,
      },
    },
  },
}

Future work

A follow-up PR will enable dynamic batching when calling deepspeed.initialize. I.e. instead of this:

engine, _, _, _ = deepspeed.initialize(config=config, model=model)
dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...)
engine.lr_scheduler = lr_scheduler

we'd ideally have this:

engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model)

where initialize will call internally get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed.

This PR adds the setup instructions for Huawei Ascend NPU. Please refer to the remainder of the guide for instructions on other devices. --------- Co-authored-by: sjh <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]>

This reverts commit 55b4cae.

This reverts commit 77f61e6.

deepspeedai#5728 --------- Co-authored-by: Logan Adams <[email protected]>

Check if the dtype is supported by the accelarator if not then skip --------- Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>

CIFAR10_DATASET_PATH -> Path where the dataset is stored STORE_CIFAR10 -> Store the dataset 1/0 CIFAR10_OFFLINE -> To use offline dataset 1/0 MISC: Added getDeviceId to get device if by name in case of accelerator --------- Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]> Co-authored-by: Logan Adams <[email protected]> Co-authored-by: Logan Adams <[email protected]>

…edai#6452) Add the link of https://www.deepspeed.ai/tutorials/accelerator-setup-guide/ into the installation section in Getting Started page so that users can easily find the doc. Signed-off-by: roger feng <[email protected]>

Tested with triton==3.0.x and the kernel tests pass so adding as an allowed version. Triton 2.3.x is not well supported on arm64. Triton 3.0.0 is supported on arm64 and it appears the fp8 kernel works fine with triton==3.0.0 so this simplifies usage on arm hosts (GH200).

The `numThreads` config option determines how many threads are used to read from the file. In the CPU case these threads are created via AIO, in the GDS case they are handled by the GDS library via the cufile.json. If we were to also create AIO threads it would have a multiplicative effect. Example 8 AIO threads * 8 GDS threads would be 64 threads reading from the file when the user really only intended for 8 threads. Co-authored-by: Olatunji Ruwase <[email protected]>

previous condition check is not right, it would cause this condition always be True. --------- Co-authored-by: Logan Adams <[email protected]>

@jithunnair-amd

This PR is to avoid the below error during DeepSpeed build on ROCm. The error is because of the incompatibility of GDSBuilder extension on ROCm. ``` Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-req-build-lv1v39xc/setup.py", line 180, in <module> op_compatible = builder.is_compatible() File "/tmp/pip-req-build-lv1v39xc/op_builder/gds.py", line 47, in is_compatible CUDA_LIB64 = os.path.join(CUDA_HOME, "lib64") File "/opt/conda/envs/py_3.9/lib/python3.9/posixpath.py", line 76, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType Total number of unsupported CUDA function calls: 0 Total number of replaced kernel launches: 1 ---------------------------------------- ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output ``` cc: @jithunnair-amd --------- Co-authored-by: Logan Adams <[email protected]>

Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]>

…deepspeedai#6450) - Adds a nightly workflow that tests to confirm we can build DeepSpeed without torch as a dependency, as this often only surfaces when doing a release.

- fix step function to cast to FP32 before step in case of different gradient accumulation data type - remove redundatn function initialize_optimizer_states()

Co-authored-by: Logan Adams <[email protected]> Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>

…ult (deepspeedai#6487) move the logic that prints a warning when triton cache dir is on NFS to act on the actual calculated cache_dir rather than on the default. this means that: - when the default directory (in the user's home directory) is on NFS but `TRITON_CACHE_DIR` is set to a non-NFS directory, no warning will be printed whereas prior to this change a spurious and confusing warning was printed - when the user's home directory is not on NFS but `TRITON_CACHE_DIR` is set to an NFS directory, a warning will be printed whereas prior to this change no warning would be printed fixes deepspeedai#6486

@VeryLazyBoy

This PR fixes an issue addressed in deepspeedai#5921. With this change, we only apply the patch for parameter partitioning to classes that have `__init__` so that we can avoid applying the patch multiple times. The class that does not have `__init__` now uses its superclass's one. So this PR also applies the patch to the root class, `torch.nn.modules.module.Module`. Thanks @VeryLazyBoy for the report and initial solution. --------- Co-authored-by: Logan Adams <[email protected]>

Currently DS_BUILD_OPS=1 fails on incompatible ops. This is a deviation from [documentation](https://www.deepspeed.ai/tutorials/advanced-install/#pre-install-deepspeed-ops) which states that only compatible ops are built. <img width="614" alt="image" src="https://github.com/user-attachments/assets/0f1a184e-b568-4d25-9e9b-e394fb047df2">

…_optimizer (deepspeedai#6446) Add default value `checkpoint_folder=None` for compatibility. Co-authored-by: Olatunji Ruwase <[email protected]>

Avoid shell=True security issues with Popen

…eepspeedai#6488) * Handles an edge case when building `gds` where `CUDA_HOME` is not defined on ROCm systems

@loadams

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.15.1 Author - @loadams Co-authored-by: loadams <[email protected]>

Co-authored-by: Olatunji Ruwase <[email protected]>

Set the default value of op_builder/xxx.py/is_compatible()/verbose to False for quite warning. Add verbose judgement before op_builder/xxx.py/is_compatible()/self.warning(...). Otherwise the verbose arg will not work. --------- Co-authored-by: Logan Adams <[email protected]>

…dai#6484) Co-authored-by: Logan Adams <[email protected]>

Simple changes to fix the Intel cpu example link and add more xpu examples. Signed-off-by: roger feng <[email protected]>

…5878) In some multi-node environment like SLURM，there are some environment vars that contain special chars and can trigger errors when being exported. For example, there is a var `SLURM_JOB_CPUS_PER_NODE=64(x2)` when requesting two nodes with 64 cpus using SLURM. Using `runner.add_export` to export this var will add a command `export SLURM_JOB_CPUS_PER_NODE=64(x2)` when launching subprocesses, while this will cause a bash error since `(` is a key word of bash, like: ``` [2024-08-07 16:56:24,651] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w server22,server27 export PYTHONPATH=/public/home/grzhang/code/CLIP-2; export SLURM_JOB_CPUS_PER_NODE=64(x2); ... server22: bash: -c: 行 0: 未预期的符号“(”附近有语法错误 ``` This PR simply wrap the environment vars with a pair of `"` to make sure they are treated as string. Co-authored-by: Logan Adams <[email protected]>

@YangQun1

…k" (deepspeedai#6508) Reverts deepspeedai#5328 After offline discussion with @YangQun1 , we agreed that there is no memory effect as clear_lp_grads flag triggers zero_() ops which just zeros buffers and does not free any memory. the outcome is compute overhead.

Avoid security issues of `shell=True` in subprocess --------- Co-authored-by: Logan Adams <[email protected]>

loadams · 2025-02-11T17:24:48Z

Thanks @bm-synth - can you take a look at the formatting and DCO errors?

xuedinge233 and others added 30 commits February 8, 2025 23:02

Revert "Fix torch check (deepspeedai#6402)"

b3d3eec

This reverts commit 55b4cae.

Revert "Revert "Fix torch check (deepspeedai#6402)""

aa5b56d

This reverts commit 77f61e6.

Add documentation for launcher without SSH (deepspeedai#6455)

f3efeff

deepspeedai#5728 --------- Co-authored-by: Logan Adams <[email protected]>

[CCL] fix condition issue in ccl.py (deepspeedai#6443)

eab3691

previous condition check is not right, it would cause this condition always be True. --------- Co-authored-by: Logan Adams <[email protected]>

TestLowCpuMemUsage UT get device by device_name (deepspeedai#6397)

454befc

Co-authored-by: Shaik Raza Sikander <[email protected]> Co-authored-by: Logan Adams <[email protected]>

Add workflow to build DS without torch to better test before releases (…

99524d4

…deepspeedai#6450) - Adds a nightly workflow that tests to confirm we can build DeepSpeed without torch as a dependency, as this often only surfaces when doing a release.

bf16_optimizer: fixes to different grad acc dtype (deepspeedai#6485)

23472b1

- fix step function to cast to FP32 before step in case of different gradient accumulation data type - remove redundatn function initialize_optimizer_states()

DeepNVMe tutorial (deepspeedai#6449)

1512aa5

Co-authored-by: Logan Adams <[email protected]> Co-authored-by: jomayeri <deepspeed@H100-VM2.shlnn55tgwve1eacvp21ie45dg.jx.internal.cloudapp.net>

Add default value to "checkpoint_folder" in "load_state_dict" of bf16…

f0d043d

…_optimizer (deepspeedai#6446) Add default value `checkpoint_folder=None` for compatibility. Co-authored-by: Olatunji Ruwase <[email protected]>

Safe usage of popen (deepspeedai#6490)

31f828c

Avoid shell=True security issues with Popen

Handle an edge case where CUDA_HOME is not defined on ROCm systems (d…

77e903e

…eepspeedai#6488) * Handles an edge case when building `gds` where `CUDA_HOME` is not defined on ROCm systems

Update version.txt after 0.15.1 release (deepspeedai#6493)

bb31313

**Auto-generated PR to update version.txt after a DeepSpeed release** Released version - 0.15.1 Author - @loadams Co-authored-by: loadams <[email protected]>

HPU: add required ENV vars to acccelerator init (deepspeedai#6495)

5ca4f5a

Co-authored-by: Olatunji Ruwase <[email protected]>

fix pipeline eval_batch micro_batches argument for schedule (deepspee…

babe81c

…dai#6484) Co-authored-by: Logan Adams <[email protected]>

Fix the broken url link (deepspeedai#6500)

ed56882

Simple changes to fix the Intel cpu example link and add more xpu examples. Signed-off-by: roger feng <[email protected]>

wrap include cuda_bf16.h with ifdef BF16_AVAILABLE (deepspeedai#6520)

b59b8b1

Avoid security issues of subprocess shell (deepspeedai#6498)

eef0647

Avoid security issues of `shell=True` in subprocess --------- Co-authored-by: Logan Adams <[email protected]>

bm-synth added 5 commits February 8, 2025 23:03

cleaner code

6356eed

cleaner config

9faae86

removed ruff errors on lambda expression

9066972

cleanup after example

dbc5acb

merge conflicts

8298895

bm-synth marked this pull request as ready for review February 9, 2025 00:12

bm-synth requested review from loadams, tohtana and tjruwase as code owners February 9, 2025 00:12

bm-synth mentioned this pull request Feb 9, 2025

Variable batch size and LR scheduler (moved to #7020) #5237

Closed

auto-formatter

a59906d

bm-synth marked this pull request as draft February 9, 2025 15:12

bm-synth added 6 commits February 9, 2025 16:29

restore data_analyzer.py to master

5f67a49

removed all references to @bm-synth in lieu of @brunomaga

fc2c73b

pre-commit fixes

df60e27

reverted to match branch

ed0b3b8

added destroy_process_group and barrier to end of run script

0f68839

reverted old value in test

d5662de

This was referenced Feb 9, 2025

[REQUEST]Does deepspeed support dynamic batch during inference? #3455

Closed

Dynamic/variable batch size support Lightning-AI/pytorch-lightning#16914

Open

Dynamic/variable batch size support #1051

Open

bm-synth added 7 commits February 9, 2025 22:55

renamed variables, added explanation on example, set seed

f3f0984

typo

0e1d520

added training on multiple epochs

c4b26a8

enable verbose by default, missing final check on LR scaling

fef0495

cleaner log, selective verbose on rank 0 only

45bb705

final print fix. all good.

493e6b6

Merge branch 'master' into variable_batch_size_and_lr

7e695b3

bm-synth marked this pull request as ready for review February 11, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variable batch size and LR scheduler #7020

Variable batch size and LR scheduler #7020

bm-synth commented Feb 8, 2025 •

edited

Loading

loadams commented Feb 11, 2025

Variable batch size and LR scheduler #7020

Are you sure you want to change the base?

Variable batch size and LR scheduler #7020

Conversation

bm-synth commented Feb 8, 2025 • edited Loading

Background and rationale

Illustration of dynamic batch size, sequence length and LR

Pipeline parallelism

Attention Head

PR overview

Example

Config

Future work

loadams commented Feb 11, 2025

bm-synth commented Feb 8, 2025 •

edited

Loading