Name	Name	Last commit message	Last commit date
parent directory ..
2022	2022
README.md	README.md

Parallel Training Methods

Sam Foreman
2024-03-12

Recording: 🎬 Intro to AI Series: Parallel Training Methods for AI
Slides: 📊 Parallel Training Slides [GitHub]

Hands-On

We will use distributed data parallel training to train a LLM on multiple nodes of Polaris @ ALCF.

Wordplay @ ALCF

Launch Job:

$ qsub -A ALCFAITP -q debug -l select=2 -l walltime=01:00:00,filesystems=eagle:home -I
qsub: waiting for job 1779554.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov to start
qsub: job 1779554.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov ready

Load conda:

$ module load conda/2023-10-04 ; conda activate base

Clone saforem2/wordplay:

$ git clone https://github.com/saforem2/wordplay
$ cd wordplay

Make + activate virtual-env:

$ mkdir -p venvs/polaris/2023-10-04
$ python3 -m venv venvs/polaris/2023-10-04 --system-site-packages
$ source venvs/polaris/2023-10-04/bin/activate

Install wordplay:

(2023-10-04) $ python3 -m pip install -e "."

Install ezpz:

(2023-10-04) $ git clone https://github.com/saforem2/ezpz
(2023-10-04) $ python3 -m pip install -e "ezpz[dev]"
(2023-10-04) $ source ezpz/src/ezpz/bin/savejobenv > /tmp/savejobenv.log 2>&1  # || exit
(2023-10-04) $ source ezpz/src/ezpz/bin/getjobenv  # || exit
┌──────────────────────────────────────────────────────────────────
│ [Hosts]:
│     • [host:0] - x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov
│     • [host:1] - x3006c0s19b0n0.hsn.cm.polaris.alcf.anl.gov
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [DIST INFO]:
│     • Loading job env from: /home/foremans/.pbsenv
│     • HOSTFILE: /var/spool/pbs/aux/1779643.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
│     • NHOSTS: 2
│     • NGPU_PER_HOST: 4
│     • NGPUS (NHOSTS x NGPU_PER_HOST): 8
│     • WORLD_SIZE: 8
│     • DIST_LAUNCH: mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/1779643.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov
└──────────────────────────────────────────────────────────────────
┌──────────────────────────────────────────────────────────────────
│ [Launch]:
│     • Use: 'launch' (=mpiexec --verbose --envall -n 8 -ppn 4 --hostfile /var/spool/pbs/aux/1779643.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov)
│       to launch job
└──────────────────────────────────────────────────────────────────

Prepare data:

(2023-10-04) $ python3 data/shakespeare_char/prepare.py
length of dataset in characters: 1,115,394
all the unique characters:
 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
vocab size: 65
train has 1,003,854 tokens
val has 111,540 tokens

Launch Training:

(2023-10-04) $ cd src/wordplay
(2023-10-04) $ launch python3 __main__.py +experiment=shakespeare data=shakespeare train.backend=DDP train.max_iters=100 train.log_interval=5 train.compile=false

Output:

Connected to tcp://x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov:7919
Found executable /lus/eagle/projects/datascience/foremans/tmp/wordplay/venvs/polaris/2023-10-04/bin/python3
Launching application 52482c6c-599c-4d4a-b57b-34f0ac962249
[2024-03-09 10:38:01][INFO][configs:72] - Setting HF_DATASETS_CACHE to /lus/eagle/projects/datascience/foremans/tmp/wordplay/.cache/huggingface/datasets
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 7
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 4
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 0
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 5
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 6
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 2
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 1
[2024-03-09 10:38:03][INFO][distributed_c10d:442] - Added key: store_based_barrier_key:1 to store for rank: 3
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 7: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 5: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 2: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 6: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 3: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 4: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][distributed_c10d:476] - Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 8 nodes.
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=2/7][local_rank=2/3][node=0/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=4/7][local_rank=0/3][node=0/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=3/7][local_rank=3/3][node=1/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=5/7][local_rank=1/3][node=1/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=7/7][local_rank=3/3][node=1/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=1/7][local_rank=1/3][node=1/1]
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=6/7][local_rank=2/3][node=0/1]
[2024-03-09 10:38:03][INFO][dist:239] - DistInfo={
    "DEVICE": "cuda",
    "DEVICE_ID": "cuda:0",
    "DISTRIBUTED_BACKEND": "nccl",
    "GPUS_PER_NODE": 4,
    "HOSTFILE": "/var/spool/pbs/aux/1779643.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov",
    "HOSTNAME": "x3006c0s13b1n0.hsn.cm.polaris.alcf.anl.gov",
    "HOSTS": "['x3006c0s13b1n0', 'x3006c0s19b0n0']",
    "LOCAL_RANK": 0,
    "MACHINE": "Polaris",
    "NGPUS": 8,
    "NODE_ID": 0,
    "NUM_NODES": 2,
    "RANK": 0,
    "SCHEDULER": "PBS",
    "WORLD_SIZE_IN_USE": 8,
    "WORLD_SIZE_TOTAL": 8
}
[2024-03-09 10:38:03][INFO][dist:605] - [0/8] Using device='cuda' with backend='DDP' + 'nccl' for distributed training.
[2024-03-09 10:38:03][INFO][dist:290] - [device='cuda'][rank=0/7][local_rank=0/3][node=0/1]
[2024-03-09 10:38:03][WARNING][dist:296] - Using [8 / 8] available "cuda" devices !!
[2024-03-09 10:38:03][INFO][configs:308] - Loading val from /lus/eagle/projects/datascience/foremans/tmp/wordplay/data/shakespeare_char/val.bin
[2024-03-09 10:38:03][INFO][configs:308] - Loading train from /lus/eagle/projects/datascience/foremans/tmp/wordplay/data/shakespeare_char/train.bin
[2024-03-09 10:38:03][INFO][configs:283] - Rescaling GAS -> GAS // WORLD_SIZE = 8 // 8
[2024-03-09 10:38:03][INFO][configs:433] - Tokens per iteration: 131,072
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][configs:456] - Using self.ptdtype=torch.bfloat16 on self.device_type='cuda'
[2024-03-09 10:38:04][INFO][configs:462] - Initializing a new model from scratch
[2024-03-09 10:38:04][INFO][dist:751] - Setting up wandb from rank: 0
[2024-03-09 10:38:04][INFO][dist:752] - Using: WB PROJECT: WordPlay
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:2'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:1'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:2'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:3'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:1'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:3'"
[2024-03-09 10:38:04][CRITICAL][trainer:338] - "devid='cuda:0'"
wandb: Currently logged in as: foremans (aurora_gpt). Use `wandb login --relogin` to force relogin
wandb: wandb version 0.16.4 is available!  To upgrade, please run:
wandb:  $ pip install wandb --upgrade
wandb: Tracking run with wandb version 0.16.2
wandb: Run data is saved locally in /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03/wandb/run-20240309_103805-6b22rdws
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run feasible-sunset-4
wandb: ⭐️ View project at https://wandb.ai/aurora_gpt/WordPlay
wandb: 🚀 View run at https://wandb.ai/aurora_gpt/WordPlay/runs/6b22rdws
[2024-03-09 10:38:06][INFO][dist:782] - W&B RUN: [feasible-sunset-4](https://wandb.ai/aurora_gpt/WordPlay/runs/6b22rdws)
[2024-03-09 10:38:06][INFO][dist:810] - Running on machine='Polaris'
[2024-03-09 10:38:06][WARNING][__main__:87] - {
    "train": {
        "framework": "pytorch",
        "backend": "DDP",
        "device": null,
        "seed": null,
        "port": null,
        "ds_config_path": null,
        "precision": null,
        "ngpus": null,
        "use_wandb": true,
        "eval_interval": 250,
        "log_interval": 5,
        "eval_iters": 200,
        "eval_only": false,
        "always_save_checkpoint": false,
        "init_from": "scratch",
        "wandb_project": "WordPlay",
        "max_iters": 100,
        "warmup_iters": 100,
        "dtype": "bfloat16",
        "compile": false
    },
    "model": {
        "n_layer": 6,
        "n_head": 6,
        "n_embd": 384,
        "batch_size": 64,
        "block_size": 256,
        "activation": "gelu",
        "dropout": 0.2,
        "bias": false,
        "vocab_size": 65
    },
    "data": {
        "dataset": "shakespeare_char",
        "out_dir": "out-shakespeare-char",
        "root_path": null
    },
    "optimizer": {
        "learning_rate": 0.001,
        "weight_decay": 0.1,
        "beta1": 0.9,
        "beta2": 0.99,
        "grad_clip": 1.0,
        "decay_lr": true,
        "lr_decay_iters": 5000,
        "min_lr": 0.0001,
        "gradient_accumulation_steps": 1
    }
}
[2024-03-09 10:38:06][WARNING][__main__:88] - Output dir: /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03
[2024-03-09 10:38:06][INFO][trainer:277] - Initializing a new model from scratch
[2024-03-09 10:38:06][INFO][model:255] - number of parameters: 10.65M
[2024-03-09 10:38:06][INFO][model:445] - num decayed parameter tensors: 26, with 10,740,096 parameters
[2024-03-09 10:38:06][INFO][model:449] - num non-decayed parameter tensors: 13, with 4,992 parameters
[2024-03-09 10:38:06][INFO][model:465] - using fused AdamW: True
[2024-03-09 10:38:06][CRITICAL][trainer:338] - "devid='cuda:0'"
[2024-03-09 10:38:08][INFO][trainer:700] - Startup time: 5.5545
  0%|          | 0/100 [00:00<?, ?it/s][2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 5.5563
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 5.5572
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 6.0207
                              Training Legend
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃    abbr    ┃ desc                                                        ┃
┡━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│    step    │ Current training iteration                                  │
│    loss    │ Loss value                                                  │
│     dt     │ Elapsed time per training step (measured in **ms**)         │
│    dtf     │ Elapsed time per forward step (measured in **ms**)          │
│    dtb     │ Elapsed time per backward step (measured in **ms**)         │
│    sps     │ Samples per second                                          │
│    mtps    │ Tokens per second, measured in MEGA (1 x 10^6) tokens / sec │
│    mfu     │ Model flops utilization                                     │
│ train_loss │ Training loss value                                         │
│  val_loss  │ Validation loss value                                       │
└────────────┴─────────────────────────────────────────────────────────────┘
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 6.0228
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 6.0243
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 6.0259
[2024-03-09 10:38:09][INFO][trainer:700] - Startup time: 5.5711
  1%|          | 1/100 [00:03<06:15,  3.79s/it][2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
[2024-03-09 10:38:12][INFO][distributed:1140] - Reducer buckets have been rebuilt in this iteration.
  3%|▎         | 3/100 [00:04<01:31,  1.05it/s][2024-03-09 10:38:13][INFO][trainer:768] - step=5 loss=3.6333 dt=84.3107 dtf=4.3360 dtb=78.0501 sps=94.8871 mtps=1.5546 mfu=-100.0000 train_loss=4.2858 val_loss=4.2797
  8%|▊         | 8/100 [00:04<00:23,  3.96it/s][2024-03-09 10:38:13][INFO][trainer:768] - step=10 loss=3.2341 dt=88.2915 dtf=4.5337 dtb=81.7871 sps=90.6089 mtps=1.4845 mfu=4.2204 train_loss=4.2858 val_loss=4.2797
 14%|█▍        | 14/100 [00:04<00:10,  8.37it/s][2024-03-09 10:38:14][INFO][trainer:768] - step=15 loss=2.9544 dt=76.5423 dtf=4.5539 dtb=69.2066 sps=104.5173 mtps=1.7124 mfu=4.2852 train_loss=4.2858 val_loss=4.2797
 18%|█▊        | 18/100 [00:05<00:08,  9.84it/s][2024-03-09 10:38:14][INFO][trainer:768] - step=20 loss=2.7892 dt=110.9556 dtf=4.6069 dtb=103.4925 sps=72.1009 mtps=1.1813 mfu=4.1925 train_loss=4.2858 val_loss=4.2797
 24%|██▍       | 24/100 [00:05<00:06, 12.28it/s][2024-03-09 10:38:14][INFO][trainer:768] - step=25 loss=2.6724 dt=58.4101 dtf=4.5098 dtb=51.2084 sps=136.9626 mtps=2.2440 mfu=4.4112 train_loss=4.2858 val_loss=4.2797
 28%|██▊       | 28/100 [00:06<00:05, 12.77it/s][2024-03-09 10:38:15][INFO][trainer:768] - step=30 loss=2.6189 dt=97.2098 dtf=4.3142 dtb=90.9276 sps=82.2962 mtps=1.3483 mfu=4.3534 train_loss=4.2858 val_loss=4.2797
 34%|███▍      | 34/100 [00:06<00:05, 11.95it/s][2024-03-09 10:38:15][INFO][trainer:768] - step=35 loss=2.5821 dt=98.7089 dtf=4.3972 dtb=91.5855 sps=81.0464 mtps=1.3279 mfu=4.2955 train_loss=4.2858 val_loss=4.2797
 38%|███▊      | 38/100 [00:06<00:05, 12.40it/s][2024-03-09 10:38:16][INFO][trainer:768] - step=40 loss=2.5464 dt=59.2544 dtf=4.4211 dtb=52.8164 sps=135.0110 mtps=2.2120 mfu=4.4948 train_loss=4.2858 val_loss=4.2797
 44%|████▍     | 44/100 [00:07<00:04, 12.77it/s][2024-03-09 10:38:16][INFO][trainer:768] - step=45 loss=2.4945 dt=104.4158 dtf=4.6090 dtb=97.0525 sps=76.6167 mtps=1.2553 mfu=4.4022 train_loss=4.2858 val_loss=4.2797
 48%|████▊     | 48/100 [00:07<00:04, 12.18it/s][2024-03-09 10:38:16][INFO][trainer:768] - step=50 loss=2.4974 dt=111.7754 dtf=4.4986 dtb=105.1647 sps=71.5721 mtps=1.1726 mfu=4.2954 train_loss=4.2858 val_loss=4.2797
 54%|█████▍    | 54/100 [00:08<00:03, 11.94it/s][2024-03-09 10:38:17][INFO][trainer:768] - step=55 loss=2.4952 dt=75.8763 dtf=4.5003 dtb=68.6473 sps=105.4348 mtps=1.7274 mfu=4.3569 train_loss=4.2858 val_loss=4.2797
 58%|█████▊    | 58/100 [00:08<00:03, 11.35it/s][2024-03-09 10:38:17][INFO][trainer:768] - step=60 loss=2.4836 dt=109.1753 dtf=4.4064 dtb=102.7523 sps=73.2766 mtps=1.2006 mfu=4.2625 train_loss=4.2858 val_loss=4.2797
 64%|██████▍   | 64/100 [00:09<00:03,  9.38it/s][2024-03-09 10:38:18][INFO][trainer:768] - step=65 loss=2.4776 dt=118.7971 dtf=4.8723 dtb=111.1970 sps=67.3417 mtps=1.1033 mfu=4.1499 train_loss=4.2858 val_loss=4.2797
 69%|██████▉   | 69/100 [00:09<00:03,  9.22it/s][2024-03-09 10:38:18][INFO][trainer:768] - step=70 loss=2.4589 dt=79.8218 dtf=4.4681 dtb=72.6108 sps=100.2232 mtps=1.6421 mfu=4.2018 train_loss=4.2858 val_loss=4.2797
 73%|███████▎  | 73/100 [00:10<00:02, 10.19it/s][2024-03-09 10:38:19][INFO][trainer:768] - step=75 loss=2.4409 dt=106.6802 dtf=4.2909 dtb=100.3735 sps=74.9905 mtps=1.2286 mfu=4.1309 train_loss=4.2858 val_loss=4.2797
 79%|███████▉  | 79/100 [00:10<00:02, 10.17it/s][2024-03-09 10:38:19][INFO][trainer:768] - step=80 loss=2.4544 dt=114.6547 dtf=4.4237 dtb=107.5242 sps=69.7747 mtps=1.1432 mfu=4.0428 train_loss=4.2858 val_loss=4.2797
 84%|████████▍ | 84/100 [00:11<00:01,  8.56it/s][2024-03-09 10:38:20][INFO][trainer:768] - step=85 loss=2.4350 dt=94.4584 dtf=4.3971 dtb=87.4210 sps=84.6934 mtps=1.3876 mfu=4.0330 train_loss=4.2858 val_loss=4.2797
 89%|████████▉ | 89/100 [00:12<00:01,  9.06it/s][2024-03-09 10:38:21][INFO][trainer:768] - step=90 loss=2.4754 dt=76.8662 dtf=4.5669 dtb=69.5501 sps=104.0770 mtps=1.7052 mfu=4.1145 train_loss=4.2858 val_loss=4.2797
 93%|█████████▎| 93/100 [00:12<00:00, 12.15it/s][2024-03-09 10:38:21][INFO][trainer:768] - step=95 loss=2.4567 dt=79.3622 dtf=4.4828 dtb=72.8414 sps=100.8037 mtps=1.6516 mfu=4.1725 train_loss=4.2858 val_loss=4.2797
 99%|█████████▉| 99/100 [00:12<00:00, 13.33it/s][2024-03-09 10:38:21][INFO][trainer:768] - step=100 loss=2.4527 dt=58.1078 dtf=4.5494 dtb=49.4123 sps=137.6752 mtps=2.2557 mfu=4.3966 train_loss=4.2858 val_loss=4.2797
100%|██████████| 100/100 [00:12<00:00,  7.81it/s]
[2024-03-09 10:38:21][INFO][trainer:666] - Saving checkpoint to: /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03
[2024-03-09 10:38:21][INFO][trainer:667] - Saving model to: /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03/model.pth
[2024-03-09 10:38:21][INFO][configs:132] - Appending /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03 to /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/ckpts/checkpoints.log
wandb:
wandb:
wandb: Run history:
wandb:              Loss/iter ▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
wandb:             Loss/lossf █▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
wandb:               Loss/mfu ▁███████████████████
wandb:             Loss/train ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:               Loss/val ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:          Timing/dt_avg ▄▅▃▇▁▆▆▁▆▇▃▇█▄▇█▅▃▄▁
wandb:         Timing/dt_iter ▄▄▃▇▁▆▆▁▆▇▃▇█▄▇█▅▃▃▁
wandb:          Timing/dt_tot ▄▅▃▇▁▆▆▁▆▇▃▇█▄▇█▅▃▄▁
wandb:         Timing/dtb_avg ▄▅▃▇▁▆▆▁▆▇▃▇█▄▇█▅▃▄▁
wandb:         Timing/dtb_tot ▄▅▃▇▁▆▆▁▆▇▃▇█▄▇█▅▃▄▁
wandb:         Timing/dtf_avg ▂▄▄▅▄▁▂▃▅▄▄▂█▃▁▃▂▄▃▄
wandb:         Timing/dtf_tot ▂▄▄▅▄▁▂▃▅▄▄▂█▃▁▃▂▄▃▄
wandb:            Timing/iter ▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
wandb: Timing/samples_per_sec ▄▃▅▁█▂▂█▂▁▅▂▁▄▂▁▃▅▄█
wandb:    Timing/startup_time ▁
wandb:  Timing/tokens_per_sec ▄▃▅▁█▂▂█▂▁▅▂▁▄▂▁▃▅▄█
wandb:          Training/iter ▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
wandb:          Training/loss █▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
wandb:      Training/loss_tot █▆▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
wandb:            Training/lr ▁▁▂▂▂▃▃▄▄▄▅▅▅▆▆▇▇▇██
wandb:
wandb: Run summary:
wandb:              Loss/iter 100
wandb:             Loss/lossf 2.45266
wandb:               Loss/mfu 4.39655
wandb:             Loss/train 4.28576
wandb:               Loss/val 4.27968
wandb:          Timing/dt_avg 0.02698
wandb:         Timing/dt_iter 0.05811
wandb:          Timing/dt_tot 0.05396
wandb:         Timing/dtb_avg 0.04941
wandb:         Timing/dtb_tot 0.04941
wandb:         Timing/dtf_avg 0.00455
wandb:         Timing/dtf_tot 0.00455
wandb:            Timing/iter 99
wandb: Timing/samples_per_sec 137.67523
wandb:    Timing/startup_time 5.55634
wandb:  Timing/tokens_per_sec 2255670.95332
wandb:          Training/iter 99
wandb:          Training/loss 2.45266
wandb:      Training/loss_tot 2.45266
wandb:            Training/lr 0.00099
wandb:
wandb: 🚀 View run feasible-sunset-4 at: https://wandb.ai/aurora_gpt/WordPlay/runs/6b22rdws
wandb: ️⚡ View job at https://wandb.ai/aurora_gpt/WordPlay/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjE0NzA3MTg5Mw==/version_details/v3
wandb: Synced 5 W&B file(s), 0 media file(s), 1 artifact file(s) and 1 other file(s)
wandb: Find logs at: /lus/eagle/projects/datascience/foremans/tmp/wordplay/src/outputs/runs/shakespeare/pytorch/DDP/2024-03-09/10-38-03/wandb/run-20240309_103805-6b22rdws/logs
Application 52482c6c resources: utime=126s stime=112s maxrss=3541260KB inblock=1704 oublock=506056 minflt=4851205 majflt=0 nvcsw=69557 nivcsw=23617

[Homework] Submit either:

Link to W&B run, e.g.

wandb: 🚀 View run feasible-sunset-4 at: https://wandb.ai/aurora_gpt/WordPlay/runs/6b22rdws

Path to logfile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

06_parallel_training

06_parallel_training

README.md

Parallel Training Methods

Hands-On

Wordplay @ ALCF

Files

06_parallel_training

Directory actions

More options

Directory actions

More options

Latest commit

History

06_parallel_training

Folders and files

parent directory

README.md

Parallel Training Methods

Hands-On

Wordplay @ ALCF