Document gated model usage, add a check after install to see if user …

…needs to setup gated model access (#327) * Add gated command to help asking for permission to huggingface * Update documentation * Fix readthedocs builds * Fix readthedocs builds * Move documentation around --------- Co-authored-by: pierre.delaunay <[email protected]>
mila-iqia · Jan 13, 2025 · d1cb39a · d1cb39a
1 parent 1331235
commit d1cb39a
Show file tree

Hide file tree

Showing 23 changed files with 356 additions and 128 deletions.
diff --git a/config/base.yaml b/config/base.yaml
@@ -67,12 +67,14 @@ llama:
   group: llm
   install_group: torch
   max_duration: 3600
+  url: https://huggingface.co/meta-llama/Llama-2-7b/tree/main
   tags:
     - nlp
     - llm
     - inference
     - monogpu
     - nobatch
+    - gated
 
   voir:
     options:
@@ -541,6 +543,8 @@ _llm:
   tags:
     - nlp
     - llm
+    - gated
+
   max_duration: 3600
   num_machines: 1
   inherits: _defaults
@@ -549,6 +553,7 @@ _llm:
 
 llm-lora-single:
   inherits: _llm
+  url: https://huggingface.co/meta-llama/Llama-3.1-8B
   tags:
     - monogpu
   plan:
@@ -574,8 +579,11 @@ llm-lora-ddp-gpus:
   plan:
     method: njobs
     n: 1
+
+  url: https://huggingface.co/meta-llama/Llama-3.1-8B
   tags:
     - multigpu
+
   argv:
     "{milabench_code}/recipes/lora_finetune_distributed.py": true
     --config: "{milabench_code}/configs/llama3_8B_lora_single_device.yaml"
@@ -599,6 +607,7 @@ llm-lora-ddp-nodes:
     method: njobs
     n: 1
 
+  url: https://huggingface.co/meta-llama/Llama-3.1-8B
   argv:
     "{milabench_code}/recipes/lora_finetune_distributed.py": true
     --config: "{milabench_code}/configs/llama3_8B_lora_single_device.yaml"
@@ -618,6 +627,7 @@ llm-lora-ddp-nodes:
 
 llm-lora-mp-gpus:
   inherits: _llm
+  url: https://huggingface.co/meta-llama/Llama-3.1-70B
   tags:
     - multigpu
   plan:
@@ -644,6 +654,8 @@ llm-full-mp-gpus:
     options:
       stop: 30
   inherits: _llm
+
+  url: https://huggingface.co/meta-llama/Llama-3.1-70B
   tags:
     - multigpu
   plan:
@@ -666,6 +678,7 @@ llm-full-mp-gpus:
     device={device_name}: true
 
 llm-full-mp-nodes:
+  url: https://huggingface.co/meta-llama/Llama-3.1-70B
   tags:
     - multinode
   max_duration: 3600

diff --git a/docs/.readthedocs.yaml b/docs/.readthedocs.yaml
@@ -0,0 +1,13 @@
+version: 2
+
+build:
+  os: ubuntu-22.04
+  tools:
+    python: "3.11"
+
+sphinx:
+  configuration: docs/conf.py
+
+python:
+  install:
+    - requirements: docs/requirements.txt
diff --git a/docs/config.rst → docs/Contributing/config.rst b/docs/config.rst → docs/Contributing/config.rst
diff --git a/docs/Contributing/design.rst b/docs/Contributing/design.rst
@@ -0,0 +1,49 @@
+Design
+======
+
+Milabench aims to simulate research workloads for benchmarking purposes.
+
+* Performance is measured as throughput (samples / secs).
+  For example, for a model like resnet the throughput would be image per seconds.
+
+* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
+  Simulating something similar to a hyper parameter search.
+  The performance of the benchmark is the sum of throughput of each processes.
+
+* Multi GPU workloads
+
+* Multi Nodes
+
+
+Run
+---
+
+* Milabench Manager Process
+   * Handles messages from benchmark processes
+   * Saves messages into a file for future analysis
+
+* Benchmark processes
+   * run using ``voir``
+   * voir is configured to intercept and send events during the training process
+   * This allow us to add models from git repositories without modification
+   * voir sends data through a file descriptor that was created by milabench main process
+
+
+What milabench is
+-----------------
+
+* Training focused
+* milabench show candid performance numbers
+   * No optimization beyond batch size scaling is performed
+   * we want to measure the performance our researcher will see
+     not the performance they could get.
+* pytorch centric
+   * Pytorch has become the defacto library for research
+   * We are looking for accelerator with good maturity that can support
+     this framework with limited code change.
+
+
+What milabench is not
+---------------------
+
+* milabench goal is not a performance show case of an accelerator.
diff --git a/docs/dev-usage.rst → docs/Contributing/dev-usage.rst b/docs/dev-usage.rst → docs/Contributing/dev-usage.rst
diff --git a/docs/instrument.rst → docs/Contributing/instrument.rst b/docs/instrument.rst → docs/Contributing/instrument.rst
diff --git a/docs/new_benchmarks.rst → docs/Contributing/new_benchmarks.rst b/docs/new_benchmarks.rst → docs/Contributing/new_benchmarks.rst
@@ -1,6 +1,6 @@
 
-Creating a new benchmark
-------------------------
+Adding a benchmark
+==================
 
 To define a new benchmark (let's assume it is called ``ornatebench``), 
 

diff --git a/docs/flow.rst → docs/Contributing/overview.rst b/docs/flow.rst → docs/Contributing/overview.rst
@@ -1,5 +1,5 @@
-Milabench Overview
-------------------
+Overview
+========
 
 .. code-block:: txt
 
@@ -230,11 +230,88 @@ Execution Flow
       * **run_script**: the script will start to run now
       * **finalize**: tearing down
 
-How do I
---------
 
-* I want to run a benchmark without milabench for debugging purposes
-   * ``milabench dev {benchname}`` will open bash with the benchmark venv sourced
-   * alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate``
+Execution Plan
+--------------
+
+* milabench main process
+  * gather metrics from benchmark processes, save them to file
+  * manages the benchmarks (timeout etc...)
+
+  * if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
+    * each processes log their GPU data
+    * might spawn a monitor process
+      * will init pynvml
+    * dataloader will also spawn process workers
+      * usually not using GPU
+
+  * if ``njobs`` is used, milabench will launch a single process (torchrun)
+    * torchrun in turn will spawn one process per GPU
+      * RANK 0 is used for logging
+      * RANK 0 might spawn a monitor process
+        * will init pynvml
+      * dataloader will also spawn process workers 
+        * usually not using GPU
+
+per_gpu
+^^^^^^^
+
+``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark
+
+.. code-block:: yaml
+
+   _torchvision:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision
+     group: torchvision
+     install_group: torch
+     plan:
+       method: per_gpu
+
+Milabench will essentially execute something akin to below. 
+
+.. code-block:: bash
+
+   echo "---"
+   echo "fp16"
+   echo "===="
+   time (
+     CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
+     wait
+   )
+
+njobs
+^^^^^
+
+``njobs`` used to launch a single jobs that can see all the gpus.
+
+.. code-block:: yaml
 
+   _torchvision_ddp:
+     inherits: _defaults
+     definition: ../benchmarks/torchvision_ddp
+     group: torchvision
+     install_group: torch
+     plan:
+       method: njobs
+       n: 1
+
+Milabench will essentially execute something akin to below.
+
+.. code-block:: bash
+
+   echo "---"
+   echo "lightning-gpus"
+   echo "=============="
+   time (
+     $BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
+     wait
+   )
 
diff --git a/docs/process.rst → docs/Contributing/process.rst b/docs/process.rst → docs/Contributing/process.rst
@@ -8,6 +8,7 @@ Preparing
 
    * NVIDIA
    * AMD
+   * Intel
 
 2. Create a milabench configuration for your RFP
    Milabench comes with a wide variety of benchmarks.

diff --git a/docs/recipes.rst → docs/Contributing/recipes.rst b/docs/recipes.rst → docs/Contributing/recipes.rst
@@ -1,5 +1,5 @@
-Running Milabench
-=================
+Recipes
+=======
 
 Base Setup
 ----------
@@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go.
 Note that some benchmarks do require more than 40Go of VRAM.
 One bench might be problematic; rwkv which requires nvcc but can be ignored.
 
-Recipes
--------
 
 Increase Runtime
-^^^^^^^^^^^^^^^^
+----------------
 
 For profiling it might be useful to run the benchmark for longer than the default configuration.
 You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
@@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila
                                  # an observation is usually a batch forward/backward/optimizer.step (i.e one train step)
 
 One Env
-^^^^^^^
+-------
 
 If your are using a container with dependencies such as pytorch already installed,
 you can force milabench to use a single environment for everything.
@@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything.
     milabench run --use-current-env --select bert-fp32 
 
 Batch resizer
-^^^^^^^^^^^^^
+-------------
 
 If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
 Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.
 
 .. code-block:: bash
 
-   MILABENCH_SIZER_AUTO=True milabench run
+   MILABENCH_SIZER_AUTO=1 milabench run
 
 Device Select
-^^^^^^^^^^^^^
+-------------
 
 To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
 which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)
@@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices
    CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run 
 
 Update Package
-^^^^^^^^^^^^^^
+--------------
 
 To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)
 
@@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e
    pip install -U torch torchvision torchaudio
 
 Arguments
-^^^^^^^^^
+---------
 
 If environment variables are troublesome, the values can also be passed as arguments.
 
@@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench.
   zip -r results.zip results
 
 
+Run a benchmark without milabench
+---------------------------------
+
+.. code-block:: bash
+
+   milabench dev {benchname}  # will open bash with the benchmark venv sourced 
+
+   # alternatively
+
+   source $MILABENCH_BASE/venv/torch/bin/activate
+
+
 Containers
 ----------
 
@@ -306,6 +316,7 @@ Example Reports
 
 Issues
 ------
+
 .. code-block:: txt
   
     > Traceback (most recent call last):

diff --git a/docs/sizer.rst → docs/Contributing/sizer.rst b/docs/sizer.rst → docs/Contributing/sizer.rst
diff --git a/docs/docker.rst → docs/GettingStarted/docker.rst b/docs/docker.rst → docs/GettingStarted/docker.rst
diff --git a/docs/usage.rst → docs/GettingStarted/usage.rst b/docs/usage.rst → docs/GettingStarted/usage.rst
@@ -29,9 +29,21 @@ Before running the benchmarks
 
 2. Set the ``$MILABENCH_CONFIG`` environment variable to the configuration file that represents the benchmark suite you want to run. Normally it should be set to ``config/standard.yaml``.
 
-3. ``milabench install``: Install the individual benchmarks in virtual environments.
+3. Setup huggingface access
 
-4. ``milabench prepare``: Download the datasets, weights, etc.
+   1. Request access to gated models
+
+      - `Llama-2-7b <https://huggingface.co/meta-llama/Llama-2-7b>`_
+      - `Llama-3.1-8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_
+      - `Llama-3.1-70B <https://huggingface.co/meta-llama/Llama-3.1-70B>`_
+
+   2. Create a new `read token <https://huggingface.co/settings/tokens/new?tokenType=read>`_ to download the models
+
+   3. Add the token to your environment ``export MILABENCH_HF_TOKEN={your_token}``
+
+4. ``milabench install``: Install the individual benchmarks in virtual environments.
+
+5. ``milabench prepare``: Download the datasets, weights, etc.
 
 If the machine has both NVIDIA/CUDA and AMD/ROCm GPUs, you may have to set the ``MILABENCH_GPU_ARCH`` environment variable as well, to either ``cuda`` or ``rocm``.
 

diff --git a/docs/Welcome/Changelog.rst b/docs/Welcome/Changelog.rst
@@ -0,0 +1,4 @@
+Changelog
+=========
+
+TBD
-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,7 @@ Preparing @@
        * NVIDIA
        * AMD
+       * Intel
 . Create a milabench configuration for your RFP
        Milabench comes with a wide variety of benchmarks.
@@ Expand Down @@