Merge pull request #254 from Modalities/warmstart_infrastructure_switch

Warmstart infrastructure switch
Modalities · Sep 17, 2024 · 8158de7 · 8158de7
2 parents dace200 + 9a3ff8c
commit 8158de7
Show file tree

Hide file tree

Showing 97 changed files with 3,911 additions and 949 deletions.
diff --git a/.gitignore b/.gitignore
@@ -160,5 +160,5 @@ pyenv*
 noteboks/*
 
 tests/tmp/*
+*wandb_storage*
 .coverage/*
-wandb_storage/
diff --git a/README.md b/README.md
@@ -23,7 +23,7 @@ We successfully scaled Modalities up to 2048 GPUs on two HPC centers, namely [Le
 Besides its scalabilty, Modalities allows to seamlessly integrate new components and features, such as custom attention mechanisms, loss functions, optimizers or models. We provide a series of tutorials to help you get started with training and evaluating models using Modalities. We achieve this level of extensibility by having clear interfaces for each component type (e.g., model, optimizer, etc.), that a component must implement to be registered within Modalities at runtime. 
 
 ## Getting Started
-For training and evaluation of a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/examples/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. 
+For training and evaluation of a model, feel free to checkout [this](https://github.com/Modalities/modalities/blob/main/tutorials/getting_started/README.md) getting started tutorial, in which we train a small, 60M-parameter GPT model on a tiny subset of the Redpajama V2 dataset. 
 
 ## Installation
 
@@ -108,7 +108,7 @@ Explanation:
 
 * `$(which modalities) run`: This part dynamically finds the path to the Modalities executable and runs it. The run command triggers the main process to start the training.
 
-* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In the example above, it is given by `configs/pretraining_config.yaml`. A configuraton file contains an exhaustive parameterization for all the training components (e.g., dataset, model, optimizer, etc.), making training fully reproducible. An example configuration file can be found [here](examples/getting_started/example_config.yaml), and a complete list of components available in Modalities is provided [here](docs/components/components.md).
+* `--config_file_path configs/pretraining_config.yaml`: The --config_file_path argument provides the path to the configuration file for the training job. In the example above, it is given by `configs/pretraining_config.yaml`. A configuraton file contains an exhaustive parameterization for all the training components (e.g., dataset, model, optimizer, etc.), making training fully reproducible. An example configuration file can be found [here](tutorials/getting_started/example_config.yaml), and a complete list of components available in Modalities is provided [here](docs/components/components.md).
 
 If you are a VSCode user, you may want to add this to your `launch.json`:
 ```json
@@ -155,7 +155,7 @@ The `modalities data create_raw_index` command triggers the process of creating
 
 ### Raw Training Dataset Tokenization
 
-Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](examples/getting_started/example_dataset_config_train.yaml).
+Tokenization is the process of converting raw text data into a sequence of tokens that can be used as input to the model. The tokenization requires a configuration file, fully describing the tokenization process, making it fully reproducible. An example tokenization config can be found [here](tutorials/getting_started/example_dataset_config_train.yaml).
 
 Example:
 ```sh
@@ -164,7 +164,7 @@ modalities data pack_encoded_data configs/tokenization_config.yaml
 
 ### Inference
 
-For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](examples/getting_started/example_text_generation_config.yaml).
+For inference on a model checkpoint, we have to pass a configuration file that specifies the full inference setup. An example inference config can be found [here](tutorials/getting_started/example_text_generation_config.yaml).
 
 Example:
 
@@ -176,14 +176,19 @@ modalities generate_text --config_file_path example_text_generation_config.yaml
 ## Tutorials
 Even though Modalities significantly simplifies LLM training, there is still some technical complexity left. We provide a series of tutorials to help you get started with training and evaluating models using Modalities.
 
-- [Getting Started](examples/getting_started/README.md)</br>
+- [Modalities in 15mins](tutorials/modalities_in_15_mins/README.md) </br>
+  Train a dense model with Modalities in 15 minutes
+
+- [Getting Started](tutorials/getting_started/README.md)</br>
   Brief overview on how to get started with Modalities by training a small GPT model on a tiny subset of the Redpajama V2 dataset.
 
-- [Library Usage](examples/library_usage/README.md)</br>
+- [Warmstart](tutorials/warmstart/README.md) </br>
+  Continue the training from a checkpoint, e.g., after the training was interrupted or had crashed.
+
+- [Library Usage](tutorials/library_usage/README.md)</br>
   How to use Modalities as a library and register custom components with Modalities.
 
-- [Modalities in 15mins] </br>
-  Jupyter notebook will be added soon
+
 
 
 ## Supported Features

diff --git a/config_files/training/config_example_coca.yaml b/config_files/training/config_example_coca.yaml
@@ -4,27 +4,53 @@ settings:
   referencing_keys:
     sample_key: input_ids
     target_key: target_ids
-  training:
-    training_log_interval_in_steps: 2
-    checkpointing_interval_in_steps: 2
-    evaluation_interval_in_steps: 2
-    global_num_seen_tokens: 0
-    activation_checkpointing_modules: []
-    gradient_acc_steps: 1
-    local_train_micro_batch_size: 3
-    sequence_length: 256
+    prediction_key: logits
   cuda_env:
     local_rank: ${cuda_env:LOCAL_RANK}
     global_rank: ${cuda_env:RANK}
     world_size: ${cuda_env:WORLD_SIZE}
   paths:
-    checkpointing_path: data/checkpoints
-
-tokenizer:
-  component_key: tokenizer
-  variant_key: gpt2_tokenizer_fast
-  config:
-    tokenizer_file: data/tokenizer/tokenizer_gpt2.json
+    checkpoint_saving_path: data/checkpoints
+    train_dataset_path: ./data/lorem_ipsum.pbin
+  intervals:
+    training_log_interval_in_steps: 2
+    checkpointing_interval_in_steps: 2
+    evaluation_interval_in_steps: 2
+  consistency_enforcement:
+    enforce_tokens_per_step_consistency: true
+    enforce_last_step_logged: false
+    enforce_last_step_evaluated: false
+    enforce_last_step_checkpointed: false
+  step_profile: 
+    gradient_accumulation_steps: 1
+    local_train_micro_batch_size: 1
+    sequence_length: 256
+  training_target:
+    num_target_tokens:      
+      component_key: number_conversion
+      variant_key: num_tokens_from_num_steps
+      config:
+        num_steps: ${settings.training_target.num_target_steps}
+        num_ranks: ${settings.cuda_env.world_size}
+        local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
+        sequence_length: ${settings.step_profile.sequence_length}
+        gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
+    num_target_steps:  # for the batch progress subscriber
+      component_key: number_conversion
+      variant_key: num_steps_from_num_samples
+      config:
+        num_ranks: ${settings.cuda_env.world_size}
+        local_micro_batch_size: ${settings.step_profile.local_train_micro_batch_size}
+        global_num_samples: ${settings.coca_example_settings.train_num_samples}
+        gradient_accumulation_steps: ${settings.step_profile.gradient_accumulation_steps}
+  training_progress: 
+    global_num_seen_tokens: 0
+    num_seen_steps: 0
+    local_num_seen_batches: 0
+    last_step: -1
+  coca_example_settings:
+    train_num_samples: 64
+    val_num_samples: 32
 
 collate_fn:
   component_key: collate_fn
@@ -41,7 +67,7 @@ train_dataset:
   component_key: dataset
   variant_key: dummy_dataset
   config:
-    num_samples: 64
+    num_samples: ${settings.coca_example_settings.train_num_samples}
     sample_definition:
       - sample_key: images
         sample_shape: [3, 224, 224]
@@ -54,7 +80,7 @@ val_dataset:
   component_key: dataset
   variant_key: dummy_dataset
   config:
-    num_samples: 32
+    num_samples: ${settings.coca_example_settings.val_num_samples}
     sample_definition:
       - sample_key: images
         sample_shape: [3, 224, 224]
@@ -69,23 +95,26 @@ train_dataloader:
   config:
     num_workers: 2
     pin_memory: true
-    shuffle: false
-    dataloader_tag: "train"
+    dataloader_tag: train
+    skip_num_batches: ${settings.training_progress.local_num_seen_batches}
     dataset:
       instance_key: train_dataset
       pass_type: BY_REFERENCE
     batch_sampler:
       component_key: batch_sampler
       variant_key: default
       config:
-        batch_size: ${settings.training.local_train_micro_batch_size}
+        batch_size: ${settings.step_profile.local_train_micro_batch_size}
+        drop_last: true
         sampler:
           component_key: sampler
           variant_key: distributed_sampler
           config:
             rank: ${settings.cuda_env.global_rank}
             num_replicas: ${settings.cuda_env.world_size}
             shuffle: true
+            drop_last: true
+            seed: 42
             dataset:
               instance_key: train_dataset
               pass_type: BY_REFERENCE
@@ -99,23 +128,25 @@ val_dataloader:
   config:
     num_workers: 2
     pin_memory: true
-    shuffle: false
-    dataloader_tag: "val"
+    dataloader_tag: val
     dataset:
       instance_key: val_dataset
       pass_type: BY_REFERENCE
     batch_sampler:
       component_key: batch_sampler
       variant_key: default
       config:
-        batch_size: ${settings.training.local_train_micro_batch_size}
+        batch_size: ${settings.step_profile.local_train_micro_batch_size}
+        drop_last: true
+
         sampler:
           component_key: sampler
           variant_key: distributed_sampler
           config:
             rank: ${settings.cuda_env.global_rank}
             num_replicas: ${settings.cuda_env.world_size}
             shuffle: false
+            drop_last: true
             dataset:
               instance_key: train_dataset
               pass_type: BY_REFERENCE
@@ -140,22 +171,16 @@ checkpoint_saving:
       component_key: checkpoint_saving_execution
       variant_key: fsdp
       config:
-        checkpoint_path: ${settings.paths.checkpointing_path}
+        checkpoint_path: ${settings.paths.checkpoint_saving_path}
         global_rank: ${settings.cuda_env.global_rank}
         experiment_id: ${settings.experiment_id}
-        get_num_tokens_from_num_steps_callable:
-          component_key: number_conversion
-          variant_key: num_tokens_from_num_steps_callable
-          config:
-            num_ranks: ${settings.cuda_env.world_size}
-            local_micro_batch_size: ${settings.training.local_train_micro_batch_size}
-            sequence_length: ${settings.training.sequence_length}
+
 loss_fn:
   component_key: loss
   variant_key: clm_cross_entropy_loss
   config:
     target_key: ${settings.referencing_keys.target_key}
-    prediction_key: logits
+    prediction_key: ${settings.referencing_keys.prediction_key}
 
 wrapped_model:
   component_key: model
@@ -169,7 +194,7 @@ wrapped_model:
     sharding_strategy: FULL_SHARD
     block_names: [TransformerBlock, VisionTransformerBlock]
 
-model: 
+model:
   component_key: model
   variant_key: model_initialized
   config:
@@ -241,9 +266,10 @@ scheduler:
     max_lr: 6e-4
     div_factor: 10
     final_div_factor: 1
-    total_steps: 64
+    total_steps: ${settings.training_target.num_target_steps}
     pct_start: 0.01
     anneal_strategy: cos
+    last_epoch: ${settings.training_progress.last_step}
 
 optimizer:
   component_key: optimizer
@@ -267,24 +293,14 @@ gradient_clipper:
       pass_type: BY_REFERENCE
     norm_type: P2_NORM
 
-
-batch_progress_subscriber:
+progress_subscriber:
   component_key: progress_subscriber
   variant_key: rich
   config:
     global_rank: ${settings.cuda_env.global_rank}
-    global_num_seen_steps:
-      component_key: number_conversion
-      variant_key: num_steps_from_num_tokens
-      config:
-        num_ranks: ${settings.cuda_env.world_size}
-        local_micro_batch_size: ${settings.training.local_train_micro_batch_size}
-        global_num_tokens: ${settings.training.global_num_seen_tokens}
-        sequence_length: ${settings.training.sequence_length}
-    gradient_acc_steps: ${settings.training.gradient_acc_steps}
-    train_dataloader:
-      instance_key: train_dataloader
-      pass_type: BY_REFERENCE
+    num_seen_steps: ${settings.training_progress.num_seen_steps}
+    num_target_steps: ${settings.training_target.num_target_steps}
+    train_dataloader_tag: ${train_dataloader.config.dataloader_tag}
     eval_dataloaders:
       instance_key: eval_dataloaders
       pass_type: BY_REFERENCE