diff --git a/docs/benchmarks.md b/docs/benchmarks.md index 2a4aba0b..765a2806 100644 --- a/docs/benchmarks.md +++ b/docs/benchmarks.md @@ -5,36 +5,48 @@ hide: # Benchmarks -## LLM +## Large Language Models (LLM) - + For running LLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation. -## VLM +## Small Language Models (SLM) - + -For running VLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) and [`MiniGPT-4`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/minigpt4) container documentation. +Small language models are generally defined as having fewer than 7B parameters *(Llama-7B shown for reference)* +For more data and info about running these models, see the [`SLM`](tutorial_slm.md) tutorial and [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation. -## ViT +## Vision Language Models (VLM) - + + +This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md). + +> •   These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder. +> •   Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B. +> •   The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava) + + +## Vision Transformers (ViT) + + VIT performance data from [[1]](https://github.com/mit-han-lab/efficientvit#imagenet) [[2]](https://github.com/NVIDIA-AI-IOT/nanoowl#performance) [[3]](https://github.com/NVIDIA-AI-IOT/nanosam#performance) ## Stable Diffusion - + ## Riva - + For running Riva benchmarks, see [ASR Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html) and [TTS Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-performance.html). ## Vector Database - + For running vector database benchmarks, see the [`NanoDB`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/vectordb/nanodb) container documentation. \ No newline at end of file diff --git a/docs/images/slm_console.gif b/docs/images/slm_console.gif new file mode 100644 index 00000000..db3bb2bb Binary files /dev/null and b/docs/images/slm_console.gif differ diff --git a/docs/images/slm_console_2.gif b/docs/images/slm_console_2.gif new file mode 100644 index 00000000..655023d4 Binary files /dev/null and b/docs/images/slm_console_2.gif differ diff --git a/docs/overrides/home.html b/docs/overrides/home.html index dde69f6b..aa65a4ee 100644 --- a/docs/overrides/home.html +++ b/docs/overrides/home.html @@ -548,17 +548,17 @@

Benchmarks

- +
- +
- +
diff --git a/docs/tutorial_slm.md b/docs/tutorial_slm.md new file mode 100644 index 00000000..713ebc82 --- /dev/null +++ b/docs/tutorial_slm.md @@ -0,0 +1,90 @@ +# Tutorial - Small Language Models (SLM) + +Small Language Models (SLMs) represent a growing class of language models that have <7B parameters - for example [StableLM](https://stability.ai/news/stable-lm-3b-sustainable-high-performance-language-models-smart-devices){:target="_blank"}, [Phi-2](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/){:target="_blank"}, and [Gemma-2B](https://blog.google/technology/developers/gemma-open-models/){:target="_blank"}. Their smaller memory footprint and faster performance make them good candidates for deploying on Jetson Orin Nano. Some are very capable with abilities at a similar level as the larger models, having been trained on high-quality curated datasets. + + + +This tutorial shows how to run optimized SLMs with quantization using the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container and MLC/TVM backend. You can run these models through tools like [`text-generation-webui`](./tutorial_text-generation.md){:target="_blank"} and llama.cpp as well, just not as fast - and since the focus of SLMs is reduced computational and memory requirements, here we'll use the most optimized path available. Those shown below have been profiled. + +## SLM Benchmarks + + + + + +> •   The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect. +> •   The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect) +> •   The `Chat Model` is the instruction-tuned variant for chatting with in the commands below, as opposed to the base completion model. + +Based on user interactions, the recommended models to try are [`stabilityai/stablelm-zephyr-3b`](https://huggingface.co/stabilityai/stablelm-zephyr-3b){:target="_blank"} and [`princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT`](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT){:target="_blank"}, for having output quality on par with Llama-2-7B and well-optimized neural architectures. These models have also been used as the base for various fine-tunes (for example [`Nous-Capybara-3B-V1.9`](https://huggingface.co/NousResearch/Nous-Capybara-3B-V1.9){:target="_blank"}) and mini VLMs. Others may not be particularly coherent. + +## Chatting with SLMs + +!!! abstract "What you need" + + 1. One of the following Jetson devices: + + Jetson AGX Orin (64GB) + Jetson AGX Orin (32GB) + Jetson Orin NX (16GB) + Jetson Orin Nano (8GB) + + 2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack){:target="_blank"}: + + JetPack 6 (L4T r36.x) + + 3. Sufficient storage space (preferably with NVMe SSD). + + - `22GB` for `local_llm` container image + - Space for models (`>5GB`) + + 4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}: + + ```bash + git clone https://github.com/dusty-nv/jetson-containers + cd jetson-containers + sudo apt update; sudo apt install -y python3-pip + pip3 install -r requirements.txt + ``` + + 5. If you had previous used [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container, update it first: + + - `sudo docker pull $(./autotag local_llm)` + +The [`local_llm.chat`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat){:target="_blank"} program will automatically download and quantize models from HuggingFace like those listed in the table above: + +```bash +./run.sh $(./autotag local_llm) \ + python3 -m local_llm.chat --api=mlc \ + --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT +``` +> •   For models requiring authentication, use `--env HUGGINGFACE_TOKEN=` +> •   Press Ctrl+C twice in succession to exit (once will interrupt bot output) + +This will enter into interactive mode where you chat back and forth using the keyboard (entering `reset` will clear the chat history) + + + +### Automated Prompts + +During testing, you can specify prompts on the command-line that will run sequentially: + +```bash +./run.sh $(./autotag local_llm) \ + python3 -m local_llm.chat --api=mlc \ + --model stabilityai/stablelm-zephyr-3b \ + --max-new-tokens 512 \ + --prompt 'hi, how are you?' \ + --prompt 'whats the square root of 900?' \ + --prompt 'can I get a recipie for french onion soup?' +``` + +You can also load JSON files containing prompt sequences, like with [`--prompt /data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (the output of which is below) + +### Example Output + + + +•   The model responses are with 4-bit quantization and are truncated to 256 tokens for brevity. +•   These chat questions are from [`/data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (found in jetson-containers) + diff --git a/mkdocs.yml b/mkdocs.yml index eabf4dc6..2bdea220 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -80,6 +80,7 @@ nav: - Text (LLM): - text-generation-webui: tutorial_text-generation.md - llamaspeak: tutorial_llamaspeak.md + - Small LMs (SLM) 🆕: tutorial_slm.md - Text + Vision (VLM): - Mini-GPT4: tutorial_minigpt4.md - LLaVA: tutorial_llava.md @@ -99,7 +100,7 @@ nav: - NanoDB: tutorial_nanodb.md - Audio: - AudioCraft: tutorial_audiocraft.md - - Whisper 🆕: tutorial_whisper.md + - Whisper: tutorial_whisper.md # - Tools: # - LangChain: tutorial_distillation.md - Tips: