Skip to content

Commit

Permalink
Document gated model usage, add a check after install to see if user …
Browse files Browse the repository at this point in the history
…needs to setup gated model access (#327)

* Add gated command to help asking for permission to huggingface

* Update documentation

* Fix readthedocs builds

* Fix readthedocs builds

* Move documentation around

---------

Co-authored-by: pierre.delaunay <[email protected]>
  • Loading branch information
Delaunay and pierre.delaunay authored Jan 13, 2025
1 parent 1331235 commit d1cb39a
Show file tree
Hide file tree
Showing 23 changed files with 356 additions and 128 deletions.
13 changes: 13 additions & 0 deletions config/base.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,14 @@ llama:
group: llm
install_group: torch
max_duration: 3600
url: https://huggingface.co/meta-llama/Llama-2-7b/tree/main
tags:
- nlp
- llm
- inference
- monogpu
- nobatch
- gated

voir:
options:
Expand Down Expand Up @@ -541,6 +543,8 @@ _llm:
tags:
- nlp
- llm
- gated

max_duration: 3600
num_machines: 1
inherits: _defaults
Expand All @@ -549,6 +553,7 @@ _llm:

llm-lora-single:
inherits: _llm
url: https://huggingface.co/meta-llama/Llama-3.1-8B
tags:
- monogpu
plan:
Expand All @@ -574,8 +579,11 @@ llm-lora-ddp-gpus:
plan:
method: njobs
n: 1

url: https://huggingface.co/meta-llama/Llama-3.1-8B
tags:
- multigpu

argv:
"{milabench_code}/recipes/lora_finetune_distributed.py": true
--config: "{milabench_code}/configs/llama3_8B_lora_single_device.yaml"
Expand All @@ -599,6 +607,7 @@ llm-lora-ddp-nodes:
method: njobs
n: 1

url: https://huggingface.co/meta-llama/Llama-3.1-8B
argv:
"{milabench_code}/recipes/lora_finetune_distributed.py": true
--config: "{milabench_code}/configs/llama3_8B_lora_single_device.yaml"
Expand All @@ -618,6 +627,7 @@ llm-lora-ddp-nodes:

llm-lora-mp-gpus:
inherits: _llm
url: https://huggingface.co/meta-llama/Llama-3.1-70B
tags:
- multigpu
plan:
Expand All @@ -644,6 +654,8 @@ llm-full-mp-gpus:
options:
stop: 30
inherits: _llm

url: https://huggingface.co/meta-llama/Llama-3.1-70B
tags:
- multigpu
plan:
Expand All @@ -666,6 +678,7 @@ llm-full-mp-gpus:
device={device_name}: true

llm-full-mp-nodes:
url: https://huggingface.co/meta-llama/Llama-3.1-70B
tags:
- multinode
max_duration: 3600
Expand Down
13 changes: 13 additions & 0 deletions docs/.readthedocs.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
version: 2

build:
os: ubuntu-22.04
tools:
python: "3.11"

sphinx:
configuration: docs/conf.py

python:
install:
- requirements: docs/requirements.txt
File renamed without changes.
49 changes: 49 additions & 0 deletions docs/Contributing/design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
Design
======

Milabench aims to simulate research workloads for benchmarking purposes.

* Performance is measured as throughput (samples / secs).
For example, for a model like resnet the throughput would be image per seconds.

* Single GPU workloads are spawned per GPU to ensure the entire machine is used.
Simulating something similar to a hyper parameter search.
The performance of the benchmark is the sum of throughput of each processes.

* Multi GPU workloads

* Multi Nodes


Run
---

* Milabench Manager Process
* Handles messages from benchmark processes
* Saves messages into a file for future analysis

* Benchmark processes
* run using ``voir``
* voir is configured to intercept and send events during the training process
* This allow us to add models from git repositories without modification
* voir sends data through a file descriptor that was created by milabench main process


What milabench is
-----------------

* Training focused
* milabench show candid performance numbers
* No optimization beyond batch size scaling is performed
* we want to measure the performance our researcher will see
not the performance they could get.
* pytorch centric
* Pytorch has become the defacto library for research
* We are looking for accelerator with good maturity that can support
this framework with limited code change.


What milabench is not
---------------------

* milabench goal is not a performance show case of an accelerator.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@

Creating a new benchmark
------------------------
Adding a benchmark
==================

To define a new benchmark (let's assume it is called ``ornatebench``),

Expand Down
91 changes: 84 additions & 7 deletions docs/flow.rst → docs/Contributing/overview.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Milabench Overview
------------------
Overview
========

.. code-block:: txt
Expand Down Expand Up @@ -230,11 +230,88 @@ Execution Flow
* **run_script**: the script will start to run now
* **finalize**: tearing down

How do I
--------

* I want to run a benchmark without milabench for debugging purposes
* ``milabench dev {benchname}`` will open bash with the benchmark venv sourced
* alternatively: ``source $MILABENCH_BASE/venv/torch/bin/activate``
Execution Plan
--------------

* milabench main process
* gather metrics from benchmark processes, save them to file
* manages the benchmarks (timeout etc...)

* if ``per_gpu`` is used, milabench will launch one process per GPU (sets ``CUDA_VISIBLE_DEVCES``)
* each processes log their GPU data
* might spawn a monitor process
* will init pynvml
* dataloader will also spawn process workers
* usually not using GPU

* if ``njobs`` is used, milabench will launch a single process (torchrun)
* torchrun in turn will spawn one process per GPU
* RANK 0 is used for logging
* RANK 0 might spawn a monitor process
* will init pynvml
* dataloader will also spawn process workers
* usually not using GPU

per_gpu
^^^^^^^

``per_gpu``: used for mono gpu benchmarks, spawn one process per gpu and run the same benchmark

.. code-block:: yaml
_torchvision:
inherits: _defaults
definition: ../benchmarks/torchvision
group: torchvision
install_group: torch
plan:
method: per_gpu
Milabench will essentially execute something akin to below.

.. code-block:: bash
echo "---"
echo "fp16"
echo "===="
time (
CUDA_VISIBLE_DEVICES=0 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=1 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=2 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=3 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=4 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=5 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=6 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
CUDA_VISIBLE_DEVICES=7 $SRC/milabench/benchmarks/flops/activator $BASE/venv/torch $SRC/milabench/benchmarks/flops/main.py --number 30 --repeat 90 --m 8192 --n 8192 --dtype fp16 &
wait
)
njobs
^^^^^

``njobs`` used to launch a single jobs that can see all the gpus.

.. code-block:: yaml
_torchvision_ddp:
inherits: _defaults
definition: ../benchmarks/torchvision_ddp
group: torchvision
install_group: torch
plan:
method: njobs
n: 1
Milabench will essentially execute something akin to below.

.. code-block:: bash
echo "---"
echo "lightning-gpus"
echo "=============="
time (
$BASE/venv/torch/bin/benchrun --nnodes=1 --rdzv-backend=c10d --rdzv-endpoint=127.0.0.1:29400 --master-addr=127.0.0.1 --master-port=29400 --nproc-per-node=8 --no-python -- python $SRC/milabench/benchmarks/lightning/main.py --epochs 10 --num-workers 8 --loader pytorch --data $BASE/data/FakeImageNet --model resnet152 --batch-size 16 &
wait
)
1 change: 1 addition & 0 deletions docs/process.rst → docs/Contributing/process.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Preparing

* NVIDIA
* AMD
* Intel

2. Create a milabench configuration for your RFP
Milabench comes with a wide variety of benchmarks.
Expand Down
33 changes: 22 additions & 11 deletions docs/recipes.rst → docs/Contributing/recipes.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Running Milabench
=================
Recipes
=======

Base Setup
----------
Expand Down Expand Up @@ -35,11 +35,9 @@ The current setup runs on 8xA100 SXM4 80Go.
Note that some benchmarks do require more than 40Go of VRAM.
One bench might be problematic; rwkv which requires nvcc but can be ignored.

Recipes
-------

Increase Runtime
^^^^^^^^^^^^^^^^
----------------

For profiling it might be useful to run the benchmark for longer than the default configuration.
You can update the yaml file (``config/base.yaml`` or ``config/standard.yaml``) to increase the runtime limits.
Expand All @@ -57,7 +55,7 @@ and ``voir.options.stop`` which represent the target number of observations mila
# an observation is usually a batch forward/backward/optimizer.step (i.e one train step)
One Env
^^^^^^^
-------

If your are using a container with dependencies such as pytorch already installed,
you can force milabench to use a single environment for everything.
Expand All @@ -69,17 +67,17 @@ you can force milabench to use a single environment for everything.
milabench run --use-current-env --select bert-fp32
Batch resizer
^^^^^^^^^^^^^
-------------

If the GPU you are using has lower VRAM automatic batch resizing could be enabled with the command below.
Note that will not impact benchmarks that already use a batch of one, such as opt-6_7b and possibly opt-1_3b.

.. code-block:: bash
MILABENCH_SIZER_AUTO=True milabench run
MILABENCH_SIZER_AUTO=1 milabench run
Device Select
^^^^^^^^^^^^^
-------------

To run on a subset of GPUs (note that by default milabench will try to use all the GPUs all the time
which might make a run take a bit longer, reducing the number of visible devices to 2 might make experimentation faster)
Expand All @@ -89,7 +87,7 @@ which might make a run take a bit longer, reducing the number of visible devices
CUDA_VISIBLE_DEVICES=0,1,2,3 milabench run
Update Package
^^^^^^^^^^^^^^
--------------

To update pytorch to use a newer version of cuda (milabench creates a separate environment for benchmarks)

Expand All @@ -100,7 +98,7 @@ To update pytorch to use a newer version of cuda (milabench creates a separate e
pip install -U torch torchvision torchaudio
Arguments
^^^^^^^^^
---------

If environment variables are troublesome, the values can also be passed as arguments.

Expand All @@ -118,6 +116,18 @@ It holds all the benchmark specific logs and metrics gathered by milabench.
zip -r results.zip results
Run a benchmark without milabench
---------------------------------

.. code-block:: bash
milabench dev {benchname} # will open bash with the benchmark venv sourced
# alternatively
source $MILABENCH_BASE/venv/torch/bin/activate
Containers
----------

Expand Down Expand Up @@ -306,6 +316,7 @@ Example Reports
Issues
------

.. code-block:: txt
> Traceback (most recent call last):
Expand Down
File renamed without changes.
File renamed without changes.
16 changes: 14 additions & 2 deletions docs/usage.rst → docs/GettingStarted/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,21 @@ Before running the benchmarks

2. Set the ``$MILABENCH_CONFIG`` environment variable to the configuration file that represents the benchmark suite you want to run. Normally it should be set to ``config/standard.yaml``.

3. ``milabench install``: Install the individual benchmarks in virtual environments.
3. Setup huggingface access

4. ``milabench prepare``: Download the datasets, weights, etc.
1. Request access to gated models

- `Llama-2-7b <https://huggingface.co/meta-llama/Llama-2-7b>`_
- `Llama-3.1-8B <https://huggingface.co/meta-llama/Llama-3.1-8B>`_
- `Llama-3.1-70B <https://huggingface.co/meta-llama/Llama-3.1-70B>`_

2. Create a new `read token <https://huggingface.co/settings/tokens/new?tokenType=read>`_ to download the models

3. Add the token to your environment ``export MILABENCH_HF_TOKEN={your_token}``

4. ``milabench install``: Install the individual benchmarks in virtual environments.

5. ``milabench prepare``: Download the datasets, weights, etc.

If the machine has both NVIDIA/CUDA and AMD/ROCm GPUs, you may have to set the ``MILABENCH_GPU_ARCH`` environment variable as well, to either ``cuda`` or ``rocm``.

Expand Down
4 changes: 4 additions & 0 deletions docs/Welcome/Changelog.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
Changelog
=========

TBD
Loading

0 comments on commit d1cb39a

Please sign in to comment.