Merge pull request #85 from dusty-nv/20240227-slm

20240303 NanoVLM
NVIDIA-AI-IOT · Mar 3, 2024 · 0c1958c · 0c1958c
2 parents 2a85813 + 3e4ae84
commit 0c1958c
Show file tree

Hide file tree

Showing 18 changed files with 275 additions and 236 deletions.
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
@@ -22,12 +22,8 @@ For more data and info about running these models, see the [`SLM`](tutorial_slm.
 
 <iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=642317430&format=interactive"></iframe>
 
-This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md).  
-
-> <sup>• &nbsp; These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder.</sup>  
-> <sup>• &nbsp; Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.</sup>  
-> <sup>• &nbsp; The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava)</sup>  
-
+This measures the end-to-end pipeline performance for continuous streaming like with [Live Llava](tutorial_live-llava.md).  
+For more data and info about running these models, see the [`NanoVLM`](tutorial_nano-vlm.md) tutorial and [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) documentation.
 
 ## Vision Transformers (ViT)
 

diff --git a/docs/tutorial_audiocraft.md b/docs/tutorial_audiocraft.md
@@ -19,14 +19,15 @@ Let's run Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), t
         - `10.7 GB` for `audiocraft` container image
         - Space for checkpoints
 
-## Clone and set up `jetson-containers`
+    4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+
+		```bash
+		git clone https://github.com/dusty-nv/jetson-containers
+		cd jetson-containers
+		sudo apt update; sudo apt install -y python3-pip
+		pip3 install -r requirements.txt
+		``` 
 
-```
-git clone https://github.com/dusty-nv/jetson-containers
-cd jetson-containers
-sudo apt update; sudo apt install -y python3-pip
-pip3 install -r requirements.txt
-```
 ## How to start
 
 Use `run.sh` and `autotag` script to automatically pull or build a compatible container image.

diff --git a/docs/tutorial_live-llava.md b/docs/tutorial_live-llava.md
@@ -2,22 +2,14 @@
 
 !!! abstract "Recommended"
 
-    Follow the chat-based [LLaVA tutorial](tutorial_llava.md) first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation to familiarize yourself with VLMs and make sure the models are working.
+    Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation to familiarize yourself with VLMs and test the models first.
 
 This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:
 
 <a href="https://youtu.be/X-OXxPiUTuU" target="_blank"><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava.gif"></a>
 
 This example uses the popular [LLaVA](https://llava-vl.github.io/) model (based on Llama and [CLIP](https://openai.com/research/clip)) and has been quantized with 4-bit precision to be deployed on Jetson Orin.  It's using an optimized multimodal pipeline from the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) package and the MLC/TVM inferencing runtime, and acts as a building block for creating always-on edge applications that can trigger user-promptable alerts and actions with the flexibility of VLMs.
 
-### Clone and set up `jetson-containers`
-
-```
-git clone https://github.com/dusty-nv/jetson-containers
-cd jetson-containers
-sudo apt update; sudo apt install -y python3-pip
-pip3 install -r requirements.txt
-```
 ## Running the Live Llava Demo
 
 !!! abstract "What you need"
@@ -27,21 +19,26 @@ pip3 install -r requirements.txt
         <span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
         <span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
         <span class="blobLightGreen3">Jetson Orin NX (16GB)</span>
+        <span class="blobLightGreen4">Jetson Orin Nano (8GB)</span><span title="Orin Nano 8GB can run Llava-7b, VILA-7b, and Obsidian-3B">⚠️</span>
 	   
     2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):
 
-        <span class="blobPink1">JetPack 5 (L4T r35.x)</span>
         <span class="blobPink2">JetPack 6 (L4T r36.x)</span>
 
     3. Sufficient storage space (preferably with NVMe SSD).
 
-        - `25GB` for `local_llm` container image
-        - Space for models
-            - CLIP model : `1.7GB`
-            - llava-1.5-7b model : `10.5GB`
+        - `22GB` for `local_llm` container image
+        - Space for models (`>10GB`)
 		 
-    4. Follow the chat-based [LLaVA tutorial](tutorial_llava.md) first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation.
+    4. Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation.
 
+    5. Supported VLM models in [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat):
+
+        - [`liuhaotian/llava-v1.5-7b`](https://huggingface.co/liuhaotian/llava-v1.5-7b), [`liuhaotian/llava-v1.5-13b`](https://huggingface.co/liuhaotian/llava-v1.5-13b), [`liuhaotian/llava-v1.6-vicuna-7b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b), [`liuhaotian/llava-v1.6-vicuna-13b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b)
+        - [`Efficient-Large-Model/VILA-7b`](https://huggingface.co/Efficient-Large-Model/VILA-7b), [`Efficient-Large-Model/VILA-13b`](https://huggingface.co/Efficient-Large-Model/VILA-13b)
+        - [`NousResearch/Obsidian-3B-V0.5`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
+        - [`Llava-7b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b), [`VILA-7b`](https://huggingface.co/Efficient-Large-Model/VILA-7b), and [`Obsidian-3B`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5) can be run on Orin Nano 8GB.
+		
 The [`VideoQuery`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/agents/video_query.py) agent processes an incoming camera or video feed on prompts in a closed loop with Llava.  
 
 ```bash
@@ -50,12 +47,13 @@ The [`VideoQuery`](https://github.com/dusty-nv/jetson-containers/blob/master/pac
   $(./autotag local_llm) \
 	python3 -m local_llm.agents.video_query --api=mlc --verbose \
 	  --model liuhaotian/llava-v1.5-7b \
+	  --max-context-len 768 \
 	  --max-new-tokens 32 \
 	  --video-input /dev/video0 \
 	  --video-output webrtc://@:8554/output \
 	  --prompt "How many fingers am I holding up?"
 ```
-> refer to [Enabling HTTPS/SSL](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#enabling-httpsssl) to generate self-signed SSL certificates for enabling client-side browser webcams.
+> <small>Refer to [Enabling HTTPS/SSL](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#enabling-httpsssl) to generate self-signed SSL certificates for enabling client-side browser webcams.</small>
 
 This uses [`jetson_utils`](https://github.com/dusty-nv/jetson-utils) for video I/O, and for options related to protocols and file formats, see [Camera Streaming and Multimedia](https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md).  In the example above, it captures a V4L2 USB webcam connected to the Jetson (`/dev/video0`) and outputs a WebRTC stream that can be viewed from a browser at `https://HOSTNAME:8554`.  When HTTPS/SSL is enabled, it can also capture from the browser's webcam.
 

diff --git a/docs/tutorial_llamaspeak.md b/docs/tutorial_llamaspeak.md
@@ -7,7 +7,7 @@ Talk live with Llama using Riva ASR/TTS, and chat about images with Llava!
 * [`llamaspeak:v1`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak) - uses text-generation-webui loaders for LLM models (llama.cpp, exllama, AutoGPTQ, Transformers)
 * [`llamaspeak:v2`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) - uses AWQ/MLC from [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) package, web chat voice agent 
 
-llamaspeak v2 has multimodal support for chatting about images with quantized Llava-1.5:
+llamaspeak v2 has multimodal support for chatting about images with quantized vision-language models:
 
 <a href="https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#local_llm" target=”_blank”><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/llamaspeak_llava_clip.gif"></a>
 > [Multimodal Voice Chat with LLaVA-1.5 13B on NVIDIA Jetson AGX Orin](https://www.youtube.com/watch?v=9ObzbbBTbcc) (container: [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm))