Update

[ghstack-poisoned]
pytorch · Feb 3, 2025 · c07b666 · c07b666
2 parents fa01d7c + eaea6f6
commit c07b666
Show file tree

Hide file tree

Showing 45 changed files with 559 additions and 562 deletions.
diff --git a/.github/workflows/integration_test_4gpu.yaml b/.github/workflows/integration_test_4gpu.yaml
diff --git a/.github/workflows/integration_test_8gpu.yaml b/.github/workflows/integration_test_8gpu.yaml
@@ -5,8 +5,8 @@ on:
     branches: [ main ]
   pull_request:
   schedule:
-    # Runs nightly
-    - cron: '0 0 * * *'
+    # Runs every 6 hours
+    - cron: '0 */6 * * *'
 concurrency:
   group: unit-test${{ github.workflow }}-${{ github.ref == 'refs/heads/main' && github.run_number || github.ref }}
   cancel-in-progress: true
@@ -21,7 +21,7 @@ jobs:
     with:
       runner: linux.g5.48xlarge.nvidia.gpu
       gpu-arch-type: cuda
-      gpu-arch-version: "12.1"
+      gpu-arch-version: "12.4"
       # This image is faster to clone than the default, but it lacks CC needed by triton
       # (1m25s vs 2m37s).
       docker-image: torchtitan-ubuntu-20.04-clang12
@@ -37,5 +37,9 @@ jobs:
         pip config --user set global.progress_bar off
 
         python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
+
+        # install torchtitan to test the files in ./scripts
+        python -m pip install -e .
+
         mkdir artifacts-to-be-uploaded
-        python ./test_runner.py artifacts-to-be-uploaded --ngpu 8
+        python ./tests/integration_tests.py artifacts-to-be-uploaded --ngpu 8
diff --git a/.github/workflows/unit_test_cpu.yaml b/.github/workflows/unit_test_cpu.yaml
@@ -25,4 +25,4 @@ jobs:
         pip config --user set global.progress_bar off
 
         pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
-        pytest test --cov=. --cov-report=xml --durations=20 -vv
+        pytest tests/unit_tests --cov=. --cov-report=xml --durations=20 -vv
diff --git a/.gitignore b/.gitignore
@@ -5,6 +5,7 @@ __pycache__
 build
 outputs
 dist/*
+.vscode
 
 # data
 data

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -24,7 +24,7 @@ repos:
         files: \.py$
         args:
         - --license-filepath
-        - docs/license_header.txt
+        - assets/license_header.txt
 
 -   repo: https://github.com/pycqa/flake8
     rev: 34cbf8ef3950f43d09b85e2e45c15ae5717dc37b

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -62,7 +62,7 @@ It is the contributor’s responsibility to justify the change. The requirements
 - If a change is expected to impact computation results, loss converging should be verified via end-to-end training on representable datasets (e.g. Llama 3 models on the C4 dataset). Please refer to the recommended practices in [converging.md](docs/converging.md).
 
 #### Performance
-- Memory and WPS / MFU, which are available from logging, should meet expectations.
+- Memory and TPS / MFU, which are available from logging, should meet expectations.
 - It is worth noting that performance expectations vary from case to case. For example, there are cases when a technique targeting memory reduction may cause throughput regression but still be acceptable (e.g. activation checkpointing). Again, it is the contributor's job to justify the feature, whether by achieving hypothetical performance, or by comparing with existing well-known implementations, etc.
 - If necessary, verify the numbers on jobs spanning multiple nodes (e.g. on 64 GPUs). Please reach out to the `torchtitan` team for help if you are resource-constrained.
 - When appropriate, one should show profile traces and/or memory snapshots to prove the effectiveness.
@@ -72,9 +72,9 @@ It is the contributor’s responsibility to justify the change. The requirements
 When appropriate, one should consider
 
 - Adding CPU/GPU unit/integration tests.
-  - To add a unit test, put it in the [test](test/) folder and follow the existing test files.
-  - To add a GPU integration test, create a new `OverrideDefinitions` in [test_runner.py](test_runner.py). It will override the default config to run on the [debug model](train_configs/debug_model.toml).
+  - To add a unit test, put it in the [tests](tests/) folder and follow the existing test files.
+  - To add a GPU integration test, create a new `OverrideDefinitions` in [integration_tests.py](tests/integration_tests.py). It will override the default config to run on the [debug model](train_configs/debug_model.toml).
 - Updating [README](README.md) and writing a new note in the [docs](docs/) folder on installation and usage, similar to [float8.md](docs/float8.md).
 - Updating [performance.md](docs/performance.md) with new performance results.
 - Creating GitHub issues for things that cannot be addressed at the moment.
-- Writing a post on [PyTorch Dev Discussions](https://dev-discuss.pytorch.org/c/distributed/6) forum and linking to it.
+- Writing a post on [PyTorch Forums](https://discuss.pytorch.org/c/distributed/torchtitan/44) and linking to it.
diff --git a/README.md b/README.md
@@ -1,57 +1,41 @@
-[![4 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_4gpu.yaml?query=branch%3Amain)
-[![8 GPU Integration Test](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
+<div align="center">
 
 # torchtitan
 
+#### A PyTorch native library for large-scale model training
+
+[![integration tests](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml/badge.svg?branch=main)](https://github.com/pytorch/torchtitan/actions/workflows/integration_test_8gpu.yaml?query=branch%3Amain)
+[![arXiv](https://img.shields.io/badge/arXiv-2410.06511-b31b1b.svg)](https://arxiv.org/abs/2410.06511)
+[![docs](https://img.shields.io/badge/docs-latest-blue.svg)](docs/)
+[![forum](https://img.shields.io/badge/pytorch-forum-DE3412.svg)](https://discuss.pytorch.org/c/distributed/torchtitan/44)
+[![license](https://img.shields.io/badge/license-BSD_3--Clause-lightgrey.svg)](./LICENSE)
+
+</div>
+
 `torchtitan` is currently in a pre-release state and under extensive development. Currently we showcase pre-training **Llama 3.1** LLMs of various sizes from scratch. To use the latest features of `torchtitan`, we recommend using the most recent PyTorch nightly.
 
-`torchtitan` is a proof-of-concept for Large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. torchtitan is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, Megablocks, LLM Foundry, Deepspeed, etc. Instead, we hope that the features showcased in torchtitan will be adopted by these codebases quickly. torchtitan is unlikely to ever grow a large community around it.
+## Overview
+
+`torchtitan` is a proof-of-concept for large-scale LLM training using native PyTorch. It is (and will continue to be) a repo to showcase PyTorch's latest distributed training features in a clean, minimal codebase. `torchtitan` is complementary to and not a replacement for any of the great large-scale LLM training codebases such as Megatron, MegaBlocks, LLM Foundry, DeepSpeed, etc. Instead, we hope that the features showcased in `torchtitan` will be adopted by these codebases quickly. `torchtitan` is unlikely to ever grow a large community around it.
 
 Our guiding principles when building `torchtitan`:
 
 * Designed to be easy to understand, use and extend for different training purposes.
-* Minimal changes to the model code when applying 1D, 2D, or (soon) 3D Parallel.
+* Minimal changes to the model code when applying multi-dimensional parallelism.
 * Modular components instead of a monolithic codebase.
 * Get started in minutes, not hours!
 
-### Intro video - learn more about torchtitan in under 4 mins:
+### Intro video - learn more about `torchtitan` in under 4 mins
 
 [![Welcome to torchtitan!](assets/images/titan_play_video.png)](https://youtu.be/ee5DOEqD35I?si=_B94PbVv0V5ZnNKE "Welcome to torchtitan!")
 
-### torchtitan paper on arXiv
-
-[![arXiv](https://img.shields.io/badge/arXiv-2410.06511-b31b1b.svg?style=plastic)](https://arxiv.org/abs/2410.06511)
-
-We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques:  [TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training](https://arxiv.org/abs/2410.06511).
-```
-@misc{torchtitan,
-      title={TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training},
-      author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},
-      year={2024},
-      eprint={2410.06511},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL},
-      url={https://arxiv.org/abs/2410.06511},
-}
-```
-
-### Dive into the code
-
-You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:
-* [train.py](train.py) - the main training loop and high-level setup code
-* [torchtitan/parallelisms/parallelize_llama.py](torchtitan/parallelisms/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
-* [torchtitan/parallelisms/pipeline_llama.py](torchtitan/parallelisms/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model
-* [torchtitan/checkpoint.py](torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints
-* [torchtitan/float8.py](torchtitan/float8.py) - utils for applying Float8 techniques
-* [torchtitan/models/llama/model.py](torchtitan/models/llama/model.py) - the Llama 3.1 model definition
-
 ### Key features available
 
 1. Multi-dimensional composable parallelisms
    - [FSDP2](docs/fsdp.md) with per-parameter sharding
    - [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) (including [async TP](https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487))
-   - Pipeline Parallel
-   - Context Parallel
+   - [Pipeline Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-training-with-zero-bubble-pipeline-parallelism/214420)
+   - [Context Parallel](https://discuss.pytorch.org/t/distributed-w-torchtitan-breaking-barriers-training-long-context-llms-with-1m-sequence-length-in-pytorch-using-context-parallel/215082)
 2. Selective layer and operator activation checkpointing
 3. [Distributed checkpointing](https://discuss.pytorch.org/t/distributed-w-torchtitan-optimizing-checkpointing-efficiency-with-pytorch-dcp/211250) (including async checkpointing)
    - [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine-tuning
@@ -63,8 +47,22 @@ You may want to see how the model is defined or how parallelism techniques are a
 9. Loss, GPU memory, throughput (tokens/sec), and MFU displayed and logged via [Tensorboard or Weights & Biases](/docs/metrics.md)
 10. [Debugging tools](docs/debugging.md) including CPU/GPU profiling, memory profiling, Flight Recorder, etc.
 11. All options easily configured via [toml files](train_configs/)
+12. [Helper scripts](scripts/) to
+    - convert original Llama 3 checkpoints into the expected DCP format
+    - estimate FSDP/HSDP memory usage without materializing the model
+    - run distributed inference with Tensor Parallel
 
-We report our [Performance](docs/performance.md) verified on 64/128 GPUs.
+We report [performance](docs/performance.md) on up to 512 GPUs, and verify [loss converging](docs/converging.md) correctness of various techniques.
+
+### Dive into the code
+
+You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:
+* [train.py](train.py) - the main training loop and high-level setup code
+* [torchtitan/parallelisms/parallelize_llama.py](torchtitan/parallelisms/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
+* [torchtitan/parallelisms/pipeline_llama.py](torchtitan/parallelisms/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model
+* [torchtitan/checkpoint.py](torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints
+* [torchtitan/float8.py](torchtitan/float8.py) - utils for applying Float8 techniques
+* [torchtitan/models/llama/model.py](torchtitan/models/llama/model.py) - the Llama 3.1 model definition
 
 
 ## Installation
@@ -96,7 +94,7 @@ Llama 3 8B model locally on 8 GPUs
 CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh
 ```
 
-## Multi-Node Training
+### Multi-Node Training
 For training on ParallelCluster/Slurm type configurations, you can use the `multinode_trainer.slurm` file to submit your sbatch job.
 
 To get started adjust the number of nodes and GPUs
@@ -111,15 +109,22 @@ Then start a run where `nnodes` is your total node count, matching the sbatch no
 srun torchrun --nnodes 2
 ```
 
-If your gpu count per node is not 8, adjust:
-
-```--nproc_per_node```
+If your gpu count per node is not 8, adjust `--nproc_per_node` in the torchrun command and `#SBATCH --gpus-per-task` in the SBATCH command section.
 
- in the torchrun command and
+## Citation
 
-```#SBATCH --gpus-per-task```
-
-in the SBATCH command section.
+We provide a detailed look into the parallelisms and optimizations available in `torchtitan`, along with summary advice on when to use various techniques:  [TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training](https://arxiv.org/abs/2410.06511).
+```
+@misc{torchtitan,
+      title={TorchTitan: One-stop PyTorch native solution for production ready LLM pre-training},
+      author={Wanchao Liang and Tianyu Liu and Less Wright and Will Constable and Andrew Gu and Chien-Chin Huang and Iris Zhang and Wei Feng and Howard Huang and Junjie Wang and Sanket Purandare and Gokul Nadathur and Stratos Idreos},
+      year={2024},
+      eprint={2410.06511},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2410.06511},
+}
+```
 
 ## License
 

diff --git a/assets/images/llama3_1_405B_loss_curves.png b/assets/images/llama3_1_405B_loss_curves.png
diff --git a/assets/images/llama3_loss_curves.png b/assets/images/llama3_loss_curves.png
diff --git a/assets/images/loss_curves.png b/assets/images/loss_curves.png
diff --git a/assets/images/readme.md b/assets/images/readme.md
diff --git a/docs/license_header.txt → assets/license_header.txt b/docs/license_header.txt → assets/license_header.txt
diff --git a/version.txt → assets/version.txt b/version.txt → assets/version.txt
diff --git a/create_seed_checkpoint.sh b/create_seed_checkpoint.sh
diff --git a/docs/checkpoint.md b/docs/checkpoint.md
@@ -1,4 +1,4 @@
-## How to convert a Llama3 checkpoint for use in torchtitan
+## How to convert a Llama 3 checkpoint for use in torchtitan
 
 If you want to continue training from an existing model checkpoint, the checkpoint must be in the DCP format expected by the checkpoint manager.
 An example script for converting the original Llama3 checkpoints into the expected DCP format can be found in `scripts/convert_llama_to_dcp.py`.
@@ -9,8 +9,7 @@ python3 scripts/convert_llama_to_dcp.py <input_dir> <output_dir>
 ```
 
 
-
-## How to Convert a torchtitan Checkpoint for Use in torchtune
+## How to convert a torchtitan checkpoint for use in torchtune
 
 This guide will walk you through the steps required to convert a checkpoint from torchtitan so that it can be loaded into torchtune.
 
@@ -66,3 +65,15 @@ python -m torch.distributed.checkpoint.format_utils dcp_to_torch torchtitan/outp
 ```
 
 That's it. You have now successfully converted a sharded torchtitan checkpoint for use in torchtune.
+
+
+## How to create a seed checkpoint
+Sometimes one needs to create a seed checkpoint to initialize a model from step 0.
+E.g. it is hard, if not impossible, for meta initialization on multiple devices to reproduce the initialization on a single device.
+A seed checkpoint does initialization of the model on a single CPU, and can be loaded from another job on an arbitrary number of GPUs via DCP resharding.
+
+To create a seed checkpoint, use the same model config as you use for training.
+e.g.
+```bash
+NGPU=1 CONFIG=<path_to_model_config> ./run_llama_train.sh --checkpoint.enable_checkpoint --checkpoint.create_seed_checkpoint --training.data_parallel_replicate_degree 1 --training.data_parallel_shard_degree 1 --training.tensor_parallel_degree 1 --experimental.pipeline_parallel_degree 1 --experimental.context_parallel_degree 1
+```
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,6 +5,7 @@ __pycache__ @@
     build
     outputs
     dist/*
+    .vscode
     # data
     data
@@ Expand Down @@