-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
End-to-End LLM Model Development with Torchtitan and Torchtune #341
base: main
Are you sure you want to change the base?
Conversation
SMHP: Remove 14k log lines from efa exporter LCC
Add conda and docker environment setups for 16.pytorch-capu-ddp test case.
Bump dcgm exporter version to correctly capture GPU utilization
NCCL 2.19.4 has performance regression.
Change nccl version to 2.20.3
Update 3.container-train.sbatch
This reverts commit da7a51d.
Typo in the name.
Rename 0.crate-conda-env.sh to 0.create-conda-env.sh
Updating CF template for HyperPod to support second private subnet
smp v2 llama2 training example using fp8
Update 1.conda-train.sbatch
Signed-off-by: Sean Smith <[email protected]>
Validate Json in preflight check
3ae455a
to
64e0724
Compare
…distributed-training into torchtitan-torchtune
64e0724
to
00dfbf5
Compare
44e448e
to
1209815
Compare
…distributed-training into torchtitan-torchtune
436b58c
to
952eba3
Compare
Basic functionalities have been implemented. Allow me to iterate on the other PRs... |
3.test_cases/torchtune/slurm/tutorials/e2e-llama3-70b-development/README.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Pavel Belevich <[email protected]>
…ent/README.md Co-authored-by: Pavel Belevich <[email protected]>
Co-authored-by: Pavel Belevich <[email protected]>
* Evaluation | ||
* Deployment | ||
|
||
for details of each step, refer the [overview documentation](../../README.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for details of each step, refer the [overview documentation](../../README.md). | |
for details of each step, refer to the [overview documentation](../../README.md). |
In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`. | ||
|
||
### Memory Consumption Challenges | ||
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP). | |
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter(6 bytes for parameter in mixed precision training, 4 bytes for gradient and 8 bytes for AdamW optimizer states) plus activation memory. For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TiB of accelerator's memory, which far exceeds the 80 GB capacity of H100 memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP). |
In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`. | ||
|
||
### Memory Consumption Challenges | ||
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How was 1.12 TiB calculated?
70_000_000_000 * 18 = 1_260_000_000_000
1_260_000_000_000 / 1024 / 1024 / 1024 / 1024 = 1.15TiB
In this step, you will fine-tune the Llama3 model starting from the original checkpoint using the WikiText dataset. This process, known as Full-Parameter Finetuning, updates all the parameters in the original model. The configuration file used for this process is `./tutorials/e2e-llama3-70b-development/full_finetune_distributed.yaml`. | ||
|
||
### Memory Consumption Challenges | ||
One of the primary challenges during such training is memory consumption. A typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory (6 bytes for parameters in mixed precision training, 8 bytes for AdamW, and 4 bytes for other overheads). For more details on the anatomy, see the [Hugging Face blog post](https://huggingface.co/docs/transformers/model_memory_anatomy) blog post. This means that training a 70B parameter model would require more than 1.12 TB of accelerated memory, which far exceeds the 80 GB capacity of H100 accelerated memory. To address this issue, torchtune integrates PyTorch Fully Sharded Data Parallel (FSDP). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory is not accelerated itself
|
||
### Basic concepts and relevant configuration | ||
|
||
**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
**FSDP** is a distributed training feature designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`. | |
**FSDP** is a distributed training technique designed to efficiently handle large model training by sharding model parameters, gradients, and optimizer states across multiple devices. This approach significantly reduces memory consumption and optimizes resource utilization, making it possible to train models that are too large to fit on a single GPU. In `torchtune` users can launch FSDP training job with command `tune run full_finetune_distributed`. |
--master_port $RANDOM | ||
--nproc_per_node=8 | ||
--nnodes $NNODES | ||
--nnodes=$SLURM_JOB_NUM_NODES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--nnodes twice
--master_port $RANDOM | ||
--nproc_per_node=8 | ||
--nnodes $NNODES | ||
--nnodes=$SLURM_JOB_NUM_NODES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--nnodes twice
sbatch tutorials/e2e-llama3-70b-development/full_finetune_distributed.sbatch | ||
``` | ||
|
||
By default, this script launches the FSDP training job with two instances. Once the job has been scheduled, you will see the following outputs in the log file named `logs/full-finetuning*`: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where two instances are specified by default, I see only --nnodes 1 --nnodes=1
in sbatch files
Issue #, if available:
Description of changes:
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.