Skip to content

Commit

Permalink
Merge pull request #85 from dusty-nv/20240227-slm
Browse files Browse the repository at this point in the history
20240303 NanoVLM
  • Loading branch information
dusty-nv authored Mar 3, 2024
2 parents 2a85813 + 3e4ae84 commit 0c1958c
Show file tree
Hide file tree
Showing 18 changed files with 275 additions and 236 deletions.
8 changes: 2 additions & 6 deletions docs/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,12 +22,8 @@ For more data and info about running these models, see the [`SLM`](tutorial_slm.

<iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=642317430&format=interactive"></iframe>

This measures the end-to-end pipeline performance for continuous streaming with [Live Llava](tutorial_live-llava.md).

> <sup>• &nbsp; These are all using [`CLIP ViT-L/14@336px`](https://huggingface.co/openai/clip-vit-large-patch14-336) for the vision encoder.</sup>
> <sup>• &nbsp; Jetson Orin Nano 8GB runs out of memory trying to run Llava-13B.</sup>
> <sup>• &nbsp; The tokens/sec performance is roughly equal to the base LM ([`StableLM-3B`](https://huggingface.co/stabilityai/stablelm-3b-4e1t) for [`Obsidian`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5), Llama for Llava)</sup>
This measures the end-to-end pipeline performance for continuous streaming like with [Live Llava](tutorial_live-llava.md).
For more data and info about running these models, see the [`NanoVLM`](tutorial_nano-vlm.md) tutorial and [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) documentation.

## Vision Transformers (ViT)

Expand Down
15 changes: 8 additions & 7 deletions docs/tutorial_audiocraft.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,15 @@ Let's run Meta's [AudioCraft](https://github.com/facebookresearch/audiocraft), t
- `10.7 GB` for `audiocraft` container image
- Space for checkpoints

## Clone and set up `jetson-containers`
4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:

```bash
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt
```

```
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt
```
## How to start

Use `run.sh` and `autotag` script to automatically pull or build a compatible container image.
Expand Down
30 changes: 14 additions & 16 deletions docs/tutorial_live-llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,22 +2,14 @@

!!! abstract "Recommended"

Follow the chat-based [LLaVA tutorial](tutorial_llava.md) first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation to familiarize yourself with VLMs and make sure the models are working.
Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation to familiarize yourself with VLMs and test the models first.

This multimodal agent runs a vision-language model on a live camera feed or video stream, repeatedly applying the same prompts to it:

<a href="https://youtu.be/X-OXxPiUTuU" target="_blank"><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/live_llava.gif"></a>

This example uses the popular [LLaVA](https://llava-vl.github.io/) model (based on Llama and [CLIP](https://openai.com/research/clip)) and has been quantized with 4-bit precision to be deployed on Jetson Orin. It's using an optimized multimodal pipeline from the [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) package and the MLC/TVM inferencing runtime, and acts as a building block for creating always-on edge applications that can trigger user-promptable alerts and actions with the flexibility of VLMs.

### Clone and set up `jetson-containers`

```
git clone https://github.com/dusty-nv/jetson-containers
cd jetson-containers
sudo apt update; sudo apt install -y python3-pip
pip3 install -r requirements.txt
```
## Running the Live Llava Demo

!!! abstract "What you need"
Expand All @@ -27,21 +19,26 @@ pip3 install -r requirements.txt
<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
<span class="blobLightGreen3">Jetson Orin NX (16GB)</span>
<span class="blobLightGreen4">Jetson Orin Nano (8GB)</span><span title="Orin Nano 8GB can run Llava-7b, VILA-7b, and Obsidian-3B">⚠️</span>
2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):

<span class="blobPink1">JetPack 5 (L4T r35.x)</span>
<span class="blobPink2">JetPack 6 (L4T r36.x)</span>

3. Sufficient storage space (preferably with NVMe SSD).

- `25GB` for `local_llm` container image
- Space for models
- CLIP model : `1.7GB`
- llava-1.5-7b model : `10.5GB`
- `22GB` for `local_llm` container image
- Space for models (`>10GB`)
4. Follow the chat-based [LLaVA tutorial](tutorial_llava.md) first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation.
4. Follow the chat-based [LLaVA](tutorial_llava.md) and [NanoVLM](tutorial_nano-vlm.md) tutorials first and see the [`local_llm`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm) documentation.

5. Supported VLM models in [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat):

- [`liuhaotian/llava-v1.5-7b`](https://huggingface.co/liuhaotian/llava-v1.5-7b), [`liuhaotian/llava-v1.5-13b`](https://huggingface.co/liuhaotian/llava-v1.5-13b), [`liuhaotian/llava-v1.6-vicuna-7b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b), [`liuhaotian/llava-v1.6-vicuna-13b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b)
- [`Efficient-Large-Model/VILA-7b`](https://huggingface.co/Efficient-Large-Model/VILA-7b), [`Efficient-Large-Model/VILA-13b`](https://huggingface.co/Efficient-Large-Model/VILA-13b)
- [`NousResearch/Obsidian-3B-V0.5`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5)
- [`Llava-7b`](https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b), [`VILA-7b`](https://huggingface.co/Efficient-Large-Model/VILA-7b), and [`Obsidian-3B`](https://huggingface.co/NousResearch/Obsidian-3B-V0.5) can be run on Orin Nano 8GB.
The [`VideoQuery`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/agents/video_query.py) agent processes an incoming camera or video feed on prompts in a closed loop with Llava.

```bash
Expand All @@ -50,12 +47,13 @@ The [`VideoQuery`](https://github.com/dusty-nv/jetson-containers/blob/master/pac
$(./autotag local_llm) \
python3 -m local_llm.agents.video_query --api=mlc --verbose \
--model liuhaotian/llava-v1.5-7b \
--max-context-len 768 \
--max-new-tokens 32 \
--video-input /dev/video0 \
--video-output webrtc://@:8554/output \
--prompt "How many fingers am I holding up?"
```
> refer to [Enabling HTTPS/SSL](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#enabling-httpsssl) to generate self-signed SSL certificates for enabling client-side browser webcams.
> <small>Refer to [Enabling HTTPS/SSL](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#enabling-httpsssl) to generate self-signed SSL certificates for enabling client-side browser webcams.</small>
This uses [`jetson_utils`](https://github.com/dusty-nv/jetson-utils) for video I/O, and for options related to protocols and file formats, see [Camera Streaming and Multimedia](https://github.com/dusty-nv/jetson-inference/blob/master/docs/aux-streaming.md). In the example above, it captures a V4L2 USB webcam connected to the Jetson (`/dev/video0`) and outputs a WebRTC stream that can be viewed from a browser at `https://HOSTNAME:8554`. When HTTPS/SSL is enabled, it can also capture from the browser's webcam.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial_llamaspeak.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Talk live with Llama using Riva ASR/TTS, and chat about images with Llava!
* [`llamaspeak:v1`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/llamaspeak) - uses text-generation-webui loaders for LLM models (llama.cpp, exllama, AutoGPTQ, Transformers)
* [`llamaspeak:v2`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) - uses AWQ/MLC from [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) package, web chat voice agent

llamaspeak v2 has multimodal support for chatting about images with quantized Llava-1.5:
llamaspeak v2 has multimodal support for chatting about images with quantized vision-language models:

<a href="https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#local_llm" target=”_blank”><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/llamaspeak_llava_clip.gif"></a>
> [Multimodal Voice Chat with LLaVA-1.5 13B on NVIDIA Jetson AGX Orin](https://www.youtube.com/watch?v=9ObzbbBTbcc) (container: [`local_llm`](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm))
Expand Down
Loading

0 comments on commit 0c1958c

Please sign in to comment.