Skip to content

Commit

Permalink
Merge pull request #84 from dusty-nv/20240227-slm
Browse files Browse the repository at this point in the history
20240227 slm
  • Loading branch information
dusty-nv authored Mar 3, 2024
2 parents 3da3749 + 2a53872 commit 2a85813
Show file tree
Hide file tree
Showing 6 changed files with 117 additions and 14 deletions.
32 changes: 22 additions & 10 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,36 +5,48 @@ hide:

# Benchmarks

## LLM
## Large Language Models (LLM)

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=363544488&amp;format=interactive"></iframe>
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2126319913&format=interactive"></iframe>

For running LLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.

## VLM
## Small Language Models (SLM)

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1183626625&amp;format=interactive"></iframe>
<iframe width="916" height="507" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1746097360&format=interactive"></iframe>

For running VLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) and [`MiniGPT-4`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/minigpt4) container documentation.
Small language models are generally defined as having fewer than 7B parameters *(Llama-7B shown for reference)*
For more data and info about running these models, see the [`SLM`](tutorial_slm.md) tutorial and [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.

## ViT
## Vision Language Models (VLM)

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1923307503&amp;format=interactive"></iframe>
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=642317430&format=interactive"></iframe>

This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md).

> <sup>• &nbsp; These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder.</sup>
> <sup>• &nbsp; Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.</sup>
> <sup>• &nbsp; The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava)</sup>

## Vision Transformers (ViT)

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=702230147&format=interactive"></iframe>

VIT performance data from [[1]](https://github.com/mit-han-lab/efficientvit#imagenet) [[2]](https://github.com/NVIDIA-AI-IOT/nanoowl#performance) [[3]](https://github.com/NVIDIA-AI-IOT/nanosam#performance)

## Stable Diffusion

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1177544659&amp;format=interactive"></iframe>
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2015943178&format=interactive"></iframe>

## Riva

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=266480884&amp;format=interactive"></iframe>
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1167153335&format=interactive"></iframe>

For running Riva benchmarks, see [ASR Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html) and [TTS Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-performance.html).

## Vector Database

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1160879788&amp;format=interactive"></iframe>
<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=891899240&format=interactive"></iframe>

For running vector database benchmarks, see the [`NanoDB`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/vectordb/nanodb) container documentation.
Binary file added docs/images/slm_console.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/images/slm_console_2.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/overrides/home.html
Original file line number Diff line number Diff line change
Expand Up @@ -548,17 +548,17 @@ <h2 class="section-title"><a href="benchmarks.html">Benchmarks</a></h2>
</div>
<!-- feature-box-item -->
<div class="col-lg-4 col-sm-6 mb-4">
<a href="benchmarks.html#llm" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
<a href="benchmarks.html#large-language-models-llm" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
<img src="./images/graph_llm-text-generation.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
</a>
</div>
<div class="col-lg-4 col-sm-6 mb-4">
<a href="benchmarks.html#vlm" class="padding-graph bg-white shadow padding-feature-box-item text-center d-block match-height" style="height: 280px;">
<a href="benchmarks.html#vision-language-models-vlm" class="padding-graph bg-white shadow padding-feature-box-item text-center d-block match-height" style="height: 280px;">
<img src="./images/graph_vlm-text-generation.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
</a>
</div>
<div class="col-lg-4 col-sm-6 mb-4">
<a href="benchmarks.html#vit" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
<a href="benchmarks.html#vision-transformers-vit" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
<img src="./images/graph_vit-vision-transformers.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
</a>
</div>
Expand Down
90 changes: 90 additions & 0 deletions docs/tutorial_slm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Tutorial - Small Language Models (SLM)

Small Language Models (SLMs) represent a growing class of language models that have <7B parameters - for example [StableLM](https://stability.ai/news/stable-lm-3b-sustainable-high-performance-language-models-smart-devices){:target="_blank"}, [Phi-2](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/){:target="_blank"}, and [Gemma-2B](https://blog.google/technology/developers/gemma-open-models/){:target="_blank"}. Their smaller memory footprint and faster performance make them good candidates for deploying on Jetson Orin Nano. Some are very capable with abilities at a similar level as the larger models, having been trained on high-quality curated datasets.

<img width="900px" src="images/slm_console.gif">

This tutorial shows how to run optimized SLMs with quantization using the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container and MLC/TVM backend. You can run these models through tools like [`text-generation-webui`](./tutorial_text-generation.md){:target="_blank"} and llama.cpp as well, just not as fast - and since the focus of SLMs is reduced computational and memory requirements, here we'll use the most optimized path available. Those shown below have been profiled.

## SLM Benchmarks

<iframe width="916" height="507" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1746097360&format=interactive"></iframe>

<iframe width="1325px" height="350px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=921468602&amp;single=true&amp;widget=true&amp;headers=false"></iframe>

> <sup>• &nbsp; The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.</sup>
> <sup>• &nbsp; The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)</sup>
> <sup>• &nbsp; The `Chat Model` is the instruction-tuned variant for chatting with in the commands below, as opposed to the base completion model.</sup>
Based on user interactions, the recommended models to try are [`stabilityai/stablelm-zephyr-3b`](https://huggingface.co/stabilityai/stablelm-zephyr-3b){:target="_blank"} and [`princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT`](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT){:target="_blank"}, for having output quality on par with Llama-2-7B and well-optimized neural architectures. These models have also been used as the base for various fine-tunes (for example [`Nous-Capybara-3B-V1.9`](https://huggingface.co/NousResearch/Nous-Capybara-3B-V1.9){:target="_blank"}) and mini VLMs. Others may not be particularly coherent.

## Chatting with SLMs

!!! abstract "What you need"

1. One of the following Jetson devices:

<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
<span class="blobLightGreen3">Jetson Orin NX (16GB)</span>
<span class="blobLightGreen4">Jetson Orin Nano (8GB)</span>
2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack){:target="_blank"}:

<span class="blobPink2">JetPack 6 (L4T r36.x)</span>

3. Sufficient storage space (preferably with NVMe SSD).

- `22GB` for `local_llm` container image
- Space for models (`>5GB`)

4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:

```bash
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt
```
5. If you had previous used [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container, update it first:

- `sudo docker pull $(./autotag local_llm)`

The [`local_llm.chat`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat){:target="_blank"} program will automatically download and quantize models from HuggingFace like those listed in the table above:

```bash
./run.sh $(./autotag local_llm) \
python3 -m local_llm.chat --api=mlc \
--model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
```
> <sup>• &nbsp; For models requiring authentication, use `--env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>`</sup>
> <sup>• &nbsp; Press <kbd>Ctrl+C</kbd> twice in succession to exit (once will interrupt bot output)</sup>
This will enter into interactive mode where you chat back and forth using the keyboard (entering `reset` will clear the chat history)

<img width="900px" src="images/slm_console_2.gif">

### Automated Prompts

During testing, you can specify prompts on the command-line that will run sequentially:

```bash
./run.sh $(./autotag local_llm) \
python3 -m local_llm.chat --api=mlc \
--model stabilityai/stablelm-zephyr-3b \
--max-new-tokens 512 \
--prompt 'hi, how are you?' \
--prompt 'whats the square root of 900?' \
--prompt 'can I get a recipie for french onion soup?'
```

You can also load JSON files containing prompt sequences, like with [`--prompt /data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (the output of which is below)

### Example Output

<iframe width="1325px" height="650px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=1801223941&amp;single=true&amp;widget=true&amp;headers=false"></iframe>

<sup>• &nbsp; The model responses are with 4-bit quantization and are truncated to 256 tokens for brevity.</sup>
<sup>• &nbsp; These chat questions are from [`/data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (found in jetson-containers)</sup>

3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ nav:
- Text (LLM):
- text-generation-webui: tutorial_text-generation.md
- llamaspeak: tutorial_llamaspeak.md
- Small LMs (SLM) 🆕: tutorial_slm.md
- Text + Vision (VLM):
- Mini-GPT4: tutorial_minigpt4.md
- LLaVA: tutorial_llava.md
Expand All @@ -99,7 +100,7 @@ nav:
- NanoDB: tutorial_nanodb.md
- Audio:
- AudioCraft: tutorial_audiocraft.md
- Whisper 🆕: tutorial_whisper.md
- Whisper: tutorial_whisper.md
# - Tools:
# - LangChain: tutorial_distillation.md
- Tips:
Expand Down

0 comments on commit 2a85813

Please sign in to comment.