Merge pull request #84 from dusty-nv/20240227-slm

20240227 slm
NVIDIA-AI-IOT · Mar 3, 2024 · 2a85813 · 2a85813
2 parents 3da3749 + 2a53872
commit 2a85813
Show file tree

Hide file tree

Showing 6 changed files with 117 additions and 14 deletions.
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -5,36 +5,48 @@ hide:
 
 # Benchmarks
 
-## LLM
+## Large Language Models (LLM)
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=363544488&amp;format=interactive"></iframe>
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2126319913&format=interactive"></iframe>
 
 For running LLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.
 
-## VLM
+## Small Language Models (SLM)
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1183626625&amp;format=interactive"></iframe>
+<iframe width="916" height="507" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1746097360&format=interactive"></iframe>
 
-For running VLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) and [`MiniGPT-4`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/minigpt4) container documentation.
+Small language models are generally defined as having fewer than 7B parameters *(Llama-7B shown for reference)*   
+For more data and info about running these models, see the [`SLM`](tutorial_slm.md) tutorial and [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.
 
-## ViT
+## Vision Language Models (VLM)
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1923307503&amp;format=interactive"></iframe>
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=642317430&format=interactive"></iframe>
+
+This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md).  
+
+> <sup>• &nbsp; These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder.</sup>  
+> <sup>• &nbsp; Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.</sup>  
+> <sup>• &nbsp; The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava)</sup>  
+
+
+## Vision Transformers (ViT)
+
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=702230147&format=interactive"></iframe>
 
 VIT performance data from [[1]](https://github.com/mit-han-lab/efficientvit#imagenet) [[2]](https://github.com/NVIDIA-AI-IOT/nanoowl#performance)  [[3]](https://github.com/NVIDIA-AI-IOT/nanosam#performance)
 
 ## Stable Diffusion
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1177544659&amp;format=interactive"></iframe>
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2015943178&format=interactive"></iframe>
 
 ## Riva
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=266480884&amp;format=interactive"></iframe>
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1167153335&format=interactive"></iframe>
 
 For running Riva benchmarks, see [ASR Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html) and [TTS Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-performance.html).
 
 ## Vector Database
 
-<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1160879788&amp;format=interactive"></iframe>
+<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=891899240&format=interactive"></iframe>
 
 For running vector database benchmarks, see the [`NanoDB`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/vectordb/nanodb) container documentation.
diff --git a/docs/images/slm_console.gif b/docs/images/slm_console.gif
diff --git a/docs/images/slm_console_2.gif b/docs/images/slm_console_2.gif
diff --git a/docs/overrides/home.html b/docs/overrides/home.html
@@ -548,17 +548,17 @@ <h2 class="section-title"><a href="benchmarks.html">Benchmarks</a></h2>
             </div>
             <!-- feature-box-item -->
             <div class="col-lg-4 col-sm-6 mb-4">
-                <a href="benchmarks.html#llm" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
+                <a href="benchmarks.html#large-language-models-llm" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
                     <img src="./images/graph_llm-text-generation.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
                 </a>
             </div>
             <div class="col-lg-4 col-sm-6 mb-4">
-                <a href="benchmarks.html#vlm" class="padding-graph bg-white shadow padding-feature-box-item text-center d-block match-height" style="height: 280px;">
+                <a href="benchmarks.html#vision-language-models-vlm" class="padding-graph bg-white shadow padding-feature-box-item text-center d-block match-height" style="height: 280px;">
                     <img src="./images/graph_vlm-text-generation.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
                 </a>
             </div>
             <div class="col-lg-4 col-sm-6 mb-4">
-                <a href="benchmarks.html#vit" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
+                <a href="benchmarks.html#vision-transformers-vit" class="padding-graph bg-white shadow text-center d-block match-height" style="height: 280px;">
                     <img src="./images/graph_vit-vision-transformers.svg" style="width: 100%; height: 100%; object-fit:cover"></img>
                 </a>
             </div>

diff --git a/docs/tutorial_slm.md b/docs/tutorial_slm.md
@@ -0,0 +1,90 @@
+# Tutorial - Small Language Models (SLM)
+
+Small Language Models (SLMs) represent a growing class of language models that have <7B parameters - for example [StableLM](https://stability.ai/news/stable-lm-3b-sustainable-high-performance-language-models-smart-devices){:target="_blank"}, [Phi-2](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/){:target="_blank"}, and [Gemma-2B](https://blog.google/technology/developers/gemma-open-models/){:target="_blank"}.  Their smaller memory footprint and faster performance make them good candidates for deploying on Jetson Orin Nano.  Some are very capable with abilities at a similar level as the larger models, having been trained on high-quality curated datasets.
+
+<img width="900px" src="images/slm_console.gif">
+
+This tutorial shows how to run optimized SLMs with quantization using the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container and MLC/TVM backend.  You can run these models through tools like [`text-generation-webui`](./tutorial_text-generation.md){:target="_blank"} and llama.cpp as well, just not as fast - and since the focus of SLMs is reduced computational and memory requirements, here we'll use the most optimized path available.  Those shown below have been profiled.
+
+## SLM Benchmarks
+
+<iframe width="916" height="507" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=1746097360&format=interactive"></iframe>
+
+<iframe width="1325px" height="350px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=921468602&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
+
+> <sup>• &nbsp; The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.</sup>  
+> <sup>• &nbsp; The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)</sup>  
+> <sup>• &nbsp; The `Chat Model` is the instruction-tuned variant for chatting with in the commands below, as opposed to the base completion model.</sup> 
+
+Based on user interactions, the recommended models to try are [`stabilityai/stablelm-zephyr-3b`](https://huggingface.co/stabilityai/stablelm-zephyr-3b){:target="_blank"} and [`princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT`](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT){:target="_blank"}, for having output quality on par with Llama-2-7B and well-optimized neural architectures. These models have also been used as the base for various fine-tunes (for example [`Nous-Capybara-3B-V1.9`](https://huggingface.co/NousResearch/Nous-Capybara-3B-V1.9){:target="_blank"}) and mini VLMs. Others may not be particularly coherent.
+
+## Chatting with SLMs
+
+!!! abstract "What you need"
+
+    1. One of the following Jetson devices:
+
+        <span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
+        <span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
+        <span class="blobLightGreen3">Jetson Orin NX (16GB)</span>
+        <span class="blobLightGreen4">Jetson Orin Nano (8GB)</span>
+	   
+    2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack){:target="_blank"}:
+
+        <span class="blobPink2">JetPack 6 (L4T r36.x)</span>
+
+    3. Sufficient storage space (preferably with NVMe SSD).
+
+        - `22GB` for `local_llm` container image
+        - Space for models (`>5GB`)
+
+    4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+
+		```bash
+		git clone https://github.com/dusty-nv/jetson-containers
+		cd jetson-containers
+		sudo apt update; sudo apt install -y python3-pip
+		pip3 install -r requirements.txt
+		```  
+		
+    5. If you had previous used [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container, update it first:
+
+         - `sudo docker pull $(./autotag local_llm)`
+
+The [`local_llm.chat`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat){:target="_blank"} program will automatically download and quantize models from HuggingFace like those listed in the table above:
+
+```bash
+./run.sh $(./autotag local_llm) \
+  python3 -m local_llm.chat --api=mlc \
+    --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
+```
+> <sup>• &nbsp; For models requiring authentication, use `--env HUGGINGFACE_TOKEN=<YOUR-ACCESS-TOKEN>`</sup>   
+> <sup>• &nbsp; Press <kbd>Ctrl+C</kbd> twice in succession to exit (once will interrupt bot output)</sup>  
+
+This will enter into interactive mode where you chat back and forth using the keyboard (entering `reset` will clear the chat history)  
+
+<img width="900px" src="images/slm_console_2.gif">
+
+### Automated Prompts
+
+During testing, you can specify prompts on the command-line that will run sequentially:
+
+```bash
+./run.sh $(./autotag local_llm) \
+  python3 -m local_llm.chat --api=mlc \
+    --model stabilityai/stablelm-zephyr-3b \
+    --max-new-tokens 512 \
+    --prompt 'hi, how are you?' \
+    --prompt 'whats the square root of 900?' \
+    --prompt 'can I get a recipie for french onion soup?'
+```
+
+You can also load JSON files containing prompt sequences, like with [`--prompt /data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (the output of which is below)
+
+### Example Output
+
+<iframe width="1325px" height="650px" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubhtml?gid=1801223941&amp;single=true&amp;widget=true&amp;headers=false"></iframe>
+
+<sup>• &nbsp; The model responses are with 4-bit quantization and are truncated to 256 tokens for brevity.</sup>  
+<sup>• &nbsp; These chat questions are from [`/data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (found in jetson-containers)</sup> 
+
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -80,6 +80,7 @@ nav:
     - Text (LLM):
       - text-generation-webui: tutorial_text-generation.md 
       - llamaspeak: tutorial_llamaspeak.md
+      - Small LMs (SLM) 🆕: tutorial_slm.md
     - Text + Vision (VLM):
       - Mini-GPT4: tutorial_minigpt4.md
       - LLaVA: tutorial_llava.md
@@ -99,7 +100,7 @@ nav:
       - NanoDB: tutorial_nanodb.md
     - Audio:
       - AudioCraft: tutorial_audiocraft.md
-      - Whisper 🆕: tutorial_whisper.md
+      - Whisper: tutorial_whisper.md
     # - Tools:
     #   - LangChain: tutorial_distillation.md
     - Tips: