diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index 2a4aba0b..765a2806 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -5,36 +5,48 @@ hide:
# Benchmarks
-## LLM
+## Large Language Models (LLM)
-
+
For running LLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.
-## VLM
+## Small Language Models (SLM)
-
+
-For running VLM benchmarks, see the [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) and [`MiniGPT-4`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/minigpt4) container documentation.
+Small language models are generally defined as having fewer than 7B parameters *(Llama-7B shown for reference)*
+For more data and info about running these models, see the [`SLM`](tutorial_slm.md) tutorial and [`MLC`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc) container documentation.
-## ViT
+## Vision Language Models (VLM)
-
+
+
+This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md).
+
+> • These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder.
+> • Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.
+> • The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava)
+
+
+## Vision Transformers (ViT)
+
+
VIT performance data from [[1]](https://github.com/mit-han-lab/efficientvit#imagenet) [[2]](https://github.com/NVIDIA-AI-IOT/nanoowl#performance) [[3]](https://github.com/NVIDIA-AI-IOT/nanosam#performance)
## Stable Diffusion
-
+
## Riva
-
+
For running Riva benchmarks, see [ASR Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/asr/asr-performance.html) and [TTS Performance](https://docs.nvidia.com/deeplearning/riva/user-guide/docs/tts/tts-performance.html).
## Vector Database
-
+
For running vector database benchmarks, see the [`NanoDB`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/vectordb/nanodb) container documentation.
\ No newline at end of file
diff --git a/docs/images/slm_console.gif b/docs/images/slm_console.gif
new file mode 100644
index 00000000..db3bb2bb
Binary files /dev/null and b/docs/images/slm_console.gif differ
diff --git a/docs/images/slm_console_2.gif b/docs/images/slm_console_2.gif
new file mode 100644
index 00000000..655023d4
Binary files /dev/null and b/docs/images/slm_console_2.gif differ
diff --git a/docs/overrides/home.html b/docs/overrides/home.html
index dde69f6b..aa65a4ee 100644
--- a/docs/overrides/home.html
+++ b/docs/overrides/home.html
@@ -548,17 +548,17 @@
diff --git a/docs/tutorial_slm.md b/docs/tutorial_slm.md
new file mode 100644
index 00000000..713ebc82
--- /dev/null
+++ b/docs/tutorial_slm.md
@@ -0,0 +1,90 @@
+# Tutorial - Small Language Models (SLM)
+
+Small Language Models (SLMs) represent a growing class of language models that have <7B parameters - for example [StableLM](https://stability.ai/news/stable-lm-3b-sustainable-high-performance-language-models-smart-devices){:target="_blank"}, [Phi-2](https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/){:target="_blank"}, and [Gemma-2B](https://blog.google/technology/developers/gemma-open-models/){:target="_blank"}. Their smaller memory footprint and faster performance make them good candidates for deploying on Jetson Orin Nano. Some are very capable with abilities at a similar level as the larger models, having been trained on high-quality curated datasets.
+
+
+
+This tutorial shows how to run optimized SLMs with quantization using the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container and MLC/TVM backend. You can run these models through tools like [`text-generation-webui`](./tutorial_text-generation.md){:target="_blank"} and llama.cpp as well, just not as fast - and since the focus of SLMs is reduced computational and memory requirements, here we'll use the most optimized path available. Those shown below have been profiled.
+
+## SLM Benchmarks
+
+
+
+
+
+> • The HuggingFace [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard){:target="_blank"} is a collection of multitask benchmarks including reasoning & comprehension, math, coding, history, geography, ect.
+> • The model's memory footprint includes 4-bit weights and KV cache at full context length (factor in extra for process overhead, library code, ect)
+> • The `Chat Model` is the instruction-tuned variant for chatting with in the commands below, as opposed to the base completion model.
+
+Based on user interactions, the recommended models to try are [`stabilityai/stablelm-zephyr-3b`](https://huggingface.co/stabilityai/stablelm-zephyr-3b){:target="_blank"} and [`princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT`](https://huggingface.co/princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT){:target="_blank"}, for having output quality on par with Llama-2-7B and well-optimized neural architectures. These models have also been used as the base for various fine-tunes (for example [`Nous-Capybara-3B-V1.9`](https://huggingface.co/NousResearch/Nous-Capybara-3B-V1.9){:target="_blank"}) and mini VLMs. Others may not be particularly coherent.
+
+## Chatting with SLMs
+
+!!! abstract "What you need"
+
+ 1. One of the following Jetson devices:
+
+ Jetson AGX Orin (64GB)
+ Jetson AGX Orin (32GB)
+ Jetson Orin NX (16GB)
+ Jetson Orin Nano (8GB)
+
+ 2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack){:target="_blank"}:
+
+ JetPack 6 (L4T r36.x)
+
+ 3. Sufficient storage space (preferably with NVMe SSD).
+
+ - `22GB` for `local_llm` container image
+ - Space for models (`>5GB`)
+
+ 4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+
+ ```bash
+ git clone https://github.com/dusty-nv/jetson-containers
+ cd jetson-containers
+ sudo apt update; sudo apt install -y python3-pip
+ pip3 install -r requirements.txt
+ ```
+
+ 5. If you had previous used [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm){:target="_blank"} container, update it first:
+
+ - `sudo docker pull $(./autotag local_llm)`
+
+The [`local_llm.chat`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat){:target="_blank"} program will automatically download and quantize models from HuggingFace like those listed in the table above:
+
+```bash
+./run.sh $(./autotag local_llm) \
+ python3 -m local_llm.chat --api=mlc \
+ --model princeton-nlp/Sheared-LLaMA-2.7B-ShareGPT
+```
+> • For models requiring authentication, use `--env HUGGINGFACE_TOKEN=`
+> • Press Ctrl+C twice in succession to exit (once will interrupt bot output)
+
+This will enter into interactive mode where you chat back and forth using the keyboard (entering `reset` will clear the chat history)
+
+
+
+### Automated Prompts
+
+During testing, you can specify prompts on the command-line that will run sequentially:
+
+```bash
+./run.sh $(./autotag local_llm) \
+ python3 -m local_llm.chat --api=mlc \
+ --model stabilityai/stablelm-zephyr-3b \
+ --max-new-tokens 512 \
+ --prompt 'hi, how are you?' \
+ --prompt 'whats the square root of 900?' \
+ --prompt 'can I get a recipie for french onion soup?'
+```
+
+You can also load JSON files containing prompt sequences, like with [`--prompt /data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (the output of which is below)
+
+### Example Output
+
+
+
+• The model responses are with 4-bit quantization and are truncated to 256 tokens for brevity.
+• These chat questions are from [`/data/prompts/qa.json`](https://github.com/dusty-nv/jetson-containers/blob/master/data/prompts/qa.json){:target="_blank"} (found in jetson-containers)
+
diff --git a/mkdocs.yml b/mkdocs.yml
index eabf4dc6..2bdea220 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -80,6 +80,7 @@ nav:
- Text (LLM):
- text-generation-webui: tutorial_text-generation.md
- llamaspeak: tutorial_llamaspeak.md
+ - Small LMs (SLM) 🆕: tutorial_slm.md
- Text + Vision (VLM):
- Mini-GPT4: tutorial_minigpt4.md
- LLaVA: tutorial_llava.md
@@ -99,7 +100,7 @@ nav:
- NanoDB: tutorial_nanodb.md
- Audio:
- AudioCraft: tutorial_audiocraft.md
- - Whisper 🆕: tutorial_whisper.md
+ - Whisper: tutorial_whisper.md
# - Tools:
# - LangChain: tutorial_distillation.md
- Tips: