LLM Training

General

The Llama 3 Herd of Models [paper]
TorchScale - A Library for Transformers at (Any) Scale [GitHub]
DLRover: An Automatic Distributed Deep Learning System [GitHub]

2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [paper]
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [paper]
ByteCheckpoint: A Unified Checkpointing System for LLM Development [paper]