LLM Training General The Llama 3 Herd of Models [paper] TorchScale - A Library for Transformers at (Any) Scale [GitHub] DLRover: An Automatic Distributed Deep Learning System [GitHub] 2024 FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision [paper] MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs [paper] ByteCheckpoint: A Unified Checkpointing System for LLM Development [paper]