Skip to content

emattia/distributed-learning-templates

Repository files navigation

Concepts and methods

Gathering data

Continued pretraining

Continued pretraining: Adapt language models to a new language or domain, or simply improve it by continue pre-training (causal language modeling) on a new/specific dataset.

Finetuning

Supervised fine-tuning: teach language models to follow instructions and tips on how to collect and curate your own training dataset.

litgpt finetune_full: This method trains all model weight parameters and is the most memory-intensive fine-tuning technique in LitGPT.

Alignment

Reward modeling: Teach language models to distinguish model responses according to human or AI preferences.

Rejection sampling: A technique to boost the performance of a SFT model.

Direct preference optimization (DPO): a powerful and promising alternative to PPO.

Odds Ratio Preference Optimisation (ORPO): a technique to fine-tune language models with human preferences, combining SFT and DPO in a single stage.es

Paremeter-efficient finetuning (PEFT) methods

litgpt finetune_lora: A more memory-efficient alternative to full fine-tuning.

  • litgpt finetune_lora stabilityai/stablelm-base-alpha-3b

litgpt finetune_adapter: A form of prefix-tuning that prepends a learnable adaption-prompt to the inputs of the attention blocks in an LLM.

litgpt finetune_adapter_v2

Questions

  • What is relation between fine-tuning and alignment?

  • How to use Nvidia NIM to connect LLM to

    • combinatorial search
    • logic
    • CSP
    • difference between inductive vs. deductive reasoning
    • soundness vs. completeness of inference/reasoning

Patterns

Use a torchtune recipe

Finetune with LitGPT

Finetune with Axoltl

Metaflow templates in this repo

composer?

Model Pretrained by Use case Workflow Type Launcher Trainer library Hardware Notes
llama-8b Meta text generation, chat fine-tuning torchtune torch ... ...
llama-8b Meta text generation, chat fine-tuning deepspeed torch ... ...
llama-8b Meta text generation, chat fine-tuning-lora torchtune torch ... ...
mistral-8b Mistral text generation, chat fine-tuning torchtune torch ... ...
mistral-8b Mistral text generation, chat fine-tuning deepspeed torch ... ...
mistral-8b Mistral text generation, chat fine-tuning-lora torchtune torch ... ...
... ... ... ... ... ... ... ...
llama-70b Meta text generation, chat fine-tuning torchtune ... ... ...
llama-70b Meta text generation, chat fine-tuning deepspeed ... ... ...
llama-70b Meta text generation, chat fine-tuning-lora torchtune ... ... ...
mixtral-8x7b Mistral text generation, chat fine-tuning torchtune ... ... ...
mixtral-8x7b Mistral text generation, chat fine-tuning deepspeed ... ... ...
mixtral-8x7b Mistral text generation, chat fine-tuning-lora torchtune ... ... ...
codestral Mistral text generation, chat fine-tuning torchtune ... ... ...
stable-diffusion-xl Stability AI text-to-image, text-to-video fine-tuning ... ... ... ...
stable-diffusion-xl Stability AI text-to-image, text-to-video fine-tuning ... ... ... ...
stable-diffusion-xl Stability AI text-to-image, text-to-video fine-tuning-lora ... ... ... ...

Choosing a hardware setup

A critical developer experience bottleneck in AI is the cycle of choosing what you want to do, what resources doing this requires, how to find and use these resources.

alt text

Major cloud providers can add a layer of indirection by buying and repackaging GPUs from Nvidia and AMD in EC2 offerings. If Robin knows her workflow tasks require GPU cards with ≥24 GB of RAM, knowing the available GPUs and mapping them to cloud VMs can be difficult. Google has a handy CLI filter and matching UI table filter to make this easier:

gcloud compute accelerator-types list --filter="nvidia-h100-80gb"

Heuristics

  • To do a full finetune of a 7B model requires ≥1-4 cards with ≥24 GB VRAM.
    • Using PagedAdamW from bitsandbytes reduces the number of cards you'll need.
  • To do a full finetune of a 70B model requires 8 cards with 80 GB VRAM.
  • To do a LORA finetune of a 7B model requires ≥1 cards with ≥24 GB VRAM, possibly ≥1-2 cards with ≥16 GB VRAM.
  • To do a QLoRA finetune of a 7B model requires ≥1 card with ≥16 GB VRAM.

Data center GPUs

GPU Type Architecture Today's equivalent Dtypes VRAM Memory Bandwidth Interconnect Server packaging AWS Azure GCP
H200 80GB SXM Hopper N/A bf16, fp64, fp32, fp16, fp8, int8 141GB 4.8TB/s NVLink 900GB/s 4 or 8 GPUs N/A ... ...
H200 80GB NVL Hopper N/A bf16, fp64, fp32, fp16, fp8, int8 141GB 4.8TB/s NVLink 900GB/s 1-8 GPUs N/A
H100 PCIe Hopper N/A bf16, fp64, fp32, fp16, fp8, int8 80GB HBM3 3.35TB/s NVLink 600GB/s 1-8 GPUs N/A ... ...
H100 SXM Hopper N/A bf16, fp64, fp32, fp16, fp8, int8 80GB HBM3 2TB/s NVLink 900GB/s 4 or 8 GPUs p5.48xlarge ... a3-megagpu-8g
H100 NVL Hopper N/A bf16, fp64, fp32, fp16, fp8, int8 188 GB 7.8TB/s NVLink 600GB/s 2-4 GPU pairs N/A
A100 80GB PCIe Ampere H100 fp64, fp32, tf32, bf16, fp16, int8 80GB HBM2e 1.935TB/s NVLink 600GB/s ... N/A NC_A100_v4-series ...
A100 80GB SXM Ampere H100 fp64, fp32, tf32, bf16, fp16, int8 80GB HBM2e 2.039TB/s NVLink 600GB/s ... p4de.24xlarge a2-ultragpu-1g, ..., a2-ultragpu-1g
A100 40GB PCIe Ampere H100 fp64, fp32, tf32, bf16, fp16, int8 40GB HBM2 1.555TB/s NVLink 600GB/s ... N/A ... ...
A100 40GB SXM Ampere H100 fp64, fp32, tf32, bf16, fp16, int8 40GB HBM2 1.555TB/s NVLink 600GB/s ... p4d.24xlarge ... a2-highgpu-1g, ..., a2-highgpu-16g
L40s Ada Lovelace N/A fp32, tf32, bf16, fp16, fp8, int4, int8 48GB GDDR6 864GB/s PCIe Gen4x16: 64GB/s ... N/A ... ...
L40 Ada Lovelace N/A fp32, tf32, bf16, fp16, fp8, int4, int8 48GB GDDR6 864GB/s PCIe Gen4x16: 64GB/s ... N/A
L4 Ada Lovelace N/A fp32, fp16, bf16, tf32, int8 24GB GDDR6 300GB/s PCIe Gen4 64GB/s 1-8 GPUs g6.xlarge, ..., g6.48xlarge N/A g2-standard-4 , ..., g2-standard-96
A2 Ampere ... ... 16GB GDDR6 200GB/s PCIe Gen4 ... ... N/A
A40 Ampere L40/L40s ... 48GB GDDR6 696GB/s NVLink 112.5GB/s ... N/A
A30 Ampere ... bf16, fp64, fp32, tf32, fp16, int4, int8 24GB HBM2 933GB/s NVLink 200GB/s ... N/A
A16 Ampere ... fp32, tf32, fp16, int8 4x 16GB GDDR6 4x 200GB/s PCIe Gen4: 64 GB/s ... N/A
A10G Ampere ... ... 24GB GDDR6 600GB/s PCIe Gen4: 64 GB/s ... g5.xlarge, ..., g5.48xlarge
A10 Ampere ... bf16, fp32, tf32, fp16, int4, int8 24GB GDDR6 600GB/s PCIe Gen4: 64 GB/s ... N/A
T4 Turing ... int4, int8 16GB GDDR6 ... PCIe Gen3: 32GB/s ... N/A ... nvidia-tesla-t4
RTX A6000 Ampere L40/L40s ... 48GB GDDR6 768GB/s NVLink 112.5GB/s ... ... ... ...
RTX A5000 Ampere L40/L40s ... 24GB GDDR6 768GB/s NVLink 112.5GB/s ... ... ... ...
RTX A4000 Ampere L40/L40s ... 16 GB GDDR6 448GB/s PCIe Gen4 ... ... ... ...
Quadro RTX 8000 Turing L40/L40s ... 48 GB GDDR6 ... NVLink 100GB/s ... ... ... ...
Quadro RTX 6000 Turing L40/L40s ... 24 GB GDDR6 ... NVLink 100GB/s ... ... ... ...
Quadro RTX 5000 Turing L40/L40s ... 24 GB GDDR6 ... NVLink 100GB/s ... ... ... ...
Quadro RTX 4000 Turing L40/L40s ... 24 GB GDDR6 ... NVLink 100GB/s ... ... ... ...

Terms: single node multiple GPU

NVLink

  • A wire-based communications protocol first produced by Nvidia in 2014.
  • Direct GPU-to-GPU interconnect that scales multi-GPU input and output (IO) within a server/VM.
    • Direct GPU-to-GPU interconnect leads to better compute utilization rates and less need for @resources(memory=...)
  • "A single NVIDIA Blackwell Tensor Core GPU supports up to 18 NVLink 100 gigabyte-per-second (GB/s) connections for a total bandwidth of 1.8 terabytes per second (TB/s)." - Nvidia

PCIe

  • Peripheral Component Interconnect Express
  • A standard for moving data on a bus at high speeds between graphics cards, SSDs, Ethernet connections, etc.
  • Each device connected to the bus has a dedicated connection to the host, which is faster than shared bus architectures.
  • PCIe devices communicate via "interconnects" or "links", point-to-point communication channels between PCIe ports that send/receive requests and interrupts.
  • Example use: The PCIe bus is used to attach non-volatile memory express (NVMe), a specification for how hardware and software can better use parallelism in modern SSDs. This reduces I/O overhead.

SXM

  • An Nvidia product that connects GPUs by directly socketing them to the motherboard, instead of using PCIe slots to connect them to the motherboard.
  • So far, each DGX system series (Pascal, Volta, Ampere, Hopper, ...), comes with its own SXM socket generation.
  • SXM may have NVLink switches, allowing faster GPU-to-GPU communication.

Terms: multiple node multiple GPU

NVSwitch

  • Part of DGX-2
  • Extend NVLink across nodes
  • Connect multiple NVLinks to provide all-to-all GPU communication at full NVLink speed within a single rack and between racks

NIC

  • Network interface card - distributed/multi-node training
  • Examples: Nvidia and Microsoft use Infiniband. AWS and GCP have proprietary NICs, the Elastic Fabric Adapter (EFA) and gVNIC, respectively.
  • End user APIs like MPI rely on guarantees of systems at this level.
    • The connection is made via libfabric, a communication API that routes requests from MPI (or other) user-facing programs to providers, such as Infiniband Verbs, SHM, or EFA.

RDMA

  • Transfer data directly from application memory - or GPU VRAM - to the wire, reducing need for host resources and latency in message passing.
  • When it comes to Nvidia GPUs, GPUDirect and Infiniband are important examples of RDMA.

Infiniband

  • Mellanox manufactured host bus adapters, network switches, and Infiniband. In 2019 Nvidia acquired Mellanox, an Israeli-American computer networking company and the last independent supplier of Infiniband.
  • Infiniband is a NIC, consisting of the physical link-layer protocol and the Verbs API, an RDMA implementation.
  • interconnect bottleneck: when connections between integrated circuits are faster than the computation that runs within them.
  • On AWS, the proprietary Infiniband substitute is Elastic Fabric Adapter (EFA).

GPUDirect

  • Nvidia doesn't market Infiniband much. Instead they leverage a new tech called GPUDirect, which

NCCL

  • ...

Here are some resources with heuristics and data that simplify this process:

Where are the big GPUs?

GCP

As of June 2024,

  • Tokyo, Japan, APAC: asia-northeast1-b
  • Jurong West, Singapore: asia-southeast1-b
  • St. Ghislain, Belgium: europe-west1-b
  • Eemshaven, Netherlands: europe-west4-b
  • Tel Aviv, Israel: me-west1-b
  • Ashburn, Virginia: us-east-4b
  • Council Bluffs, Iowa: us-central1-c
  • Columbus, Ohio: us-east5-a
  • The Dalles, Oregon: us-west1-a
  • Las Vegas, Nevada: us-west4-a

Resources

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published