Merge pull request #55 from dusty-nv/20231218-llava

20231218 llava
NVIDIA-AI-IOT · Dec 21, 2023 · 781858c · 781858c
2 parents 7d32308 + a1abc40
commit 781858c
Show file tree

Hide file tree

Showing 2 changed files with 52 additions and 5 deletions.
diff --git a/docs/images/tgwui_multimodal_llava_fish.jpg b/docs/images/tgwui_multimodal_llava_fish.jpg
diff --git a/docs/tutorial_llava.md b/docs/tutorial_llava.md
@@ -1,11 +1,20 @@
 # Tutorial - LLaVA
 
-[LLaVA](https://llava-vl.github.io/) is a leading multimodal vision/language model that you can run locally on Jetson to answer questions about image prompts and queries.  Internally, it uses the [CLIP](https://openai.com/research/clip) vision encoder to transform images into a common embedding space that the LLM (which is the same as Llama architecture) can understand with text.  Below we will cover a few methods to Llava on Jetson, some with quantization for improved performance:
+[LLaVA](https://llava-vl.github.io/) is a popular multimodal vision/language model that you can run locally on Jetson to answer questions about image prompts and queries.  Llava uses the [CLIP](https://openai.com/research/clip) vision encoder to transform images into a common embedding space so that its LLM (which is the same as Llama architecture) can understand text.  Below we'll cover a few methods to run Llava on Jetson, using quantization for improved performance.
 
 1. [Chat with Llava using `text-generation-webui`](#1-chat-with-llava-using-text-generation-webui)
 2. [Run from the terminal with `llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli)
+3. [Quantized GGUF with llama.cpp](#3-quantized-gguf-with-llamacpp)
 
-![](./images/tgwui_multimodal_llava_spacewalk.png)
+| Llava-1.5-13B (Jetson AGX Orin)                                           | Quantization | Tokens/sec |  Memory |
+|---------------------------------------------------------------------------|:------------:|:----------:|:-------:|
+| [`text-generation-webui`](#1-chat-with-llava-using-text-generation-webui) | 4-bit (GPTQ) |     2.3    |  8.8 GB |
+| [`llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli)          |  FP16 (None) |     4.2    | 27.7 GB |
+| [`llama.cpp`](#3-quantized-gguf-with-llamacpp)                            | 4-bit (Q4_K) |    10.1    |  9.2 GB |
+
+The latest Llava-1.5 is used in this tutorial.  It comes in 7B and 13B variants, however the 13B model has significantly improved accuracy.
+
+![](./images/tgwui_multimodal_llava_fish.jpg)
 
 ### Clone and set up `jetson-containers`
 
@@ -37,7 +46,7 @@ pip3 install -r requirements.txt
             - CLIP model : `1.7GB`
             - Llava-v1.5-13B-GPTQ model : `7.25GB`
 
-The [oobabooga](https://github.com/oobabooga/text-generation-webui) chat UI from the [LLM tutorial](tutorial_text-generation.md) has a multimodal extension for Llava, and it supports 4-bit quantization using AutoGPTQ.  If you already used text-generation-webui before 12/2023, do `sudo docker pull $(./autotag text-generation-webui)` to update to the latest container.
+The [oobabooga](https://github.com/oobabooga/text-generation-webui) chat UI from the [LLM tutorial](tutorial_text-generation.md) has a multimodal extension for Llava, and it supports AutoGPTQ quantization.  If you already used text-generation-webui before 12/2023, do `sudo docker pull $(./autotag text-generation-webui)` to update to the latest container.
 
 ### Download Model
 
@@ -108,7 +117,7 @@ This example uses the upstream [Llava codebase](https://github.com/haotian-liu/L
     --model-path liuhaotian/llava-v1.5-13b \
     --image-file /data/images/hoover.jpg
 ```
-<small>This may run only on Jetson AGX Orin 64GB due to memory requirements.</small>
+<small>Unquantized 13B may run only on Jetson AGX Orin 64GB due to memory requirements.</small>
 
 <!-- 
 
@@ -156,4 +165,42 @@ python3 -m llava.serve.model_worker \
     --controller http://localhost:10000 --port 40000 \
     --worker http://localhost:40000 \
     --model-path $(huggingface-downloader liuhaotian/llava-llama-2-13b-chat-lightning-preview)
-``` -->
+``` 
+-->
+
+## 3. Quantized GGUF with `llama.cpp`
+
+[llama.cpp](https://github.com/ggerganov/llama.cpp) is one of the faster LLM API's, and can apply a variety of quantization methods to Llava to reduce its memory usage and runtime.  It uses CUDA for LLM inference on the GPU.  There are pre-quantized versions of Llava-1.5 available in GGUF format for 4-bit and 5-bit:
+
+* [mys/ggml_llava-v1.5-7b](https://huggingface.co/mys/ggml_llava-v1.5-7b)
+* [mys/ggml_llava-v1.5-13b](https://huggingface.co/mys/ggml_llava-v1.5-13b)
+
+```bash
+./run.sh --workdir=/opt/llama.cpp/bin $(./autotag llama_cpp:gguf) \
+  /bin/bash -c './llava-cli \
+    --model $(huggingface-downloader mys/ggml_llava-v1.5-13b/ggml-model-q4_k.gguf) \
+    --mmproj $(huggingface-downloader mys/ggml_llava-v1.5-13b/mmproj-model-f16.gguf) \
+    --n-gpu-layers 999 \
+    --image /data/images/hoover.jpg \
+    --prompt "What does the sign say"'
+```
+
+| Quantization | Bits | Response                            | Tokens/sec |  Memory |
+|--------------|:----:|-------------------------------------|:----------:|:-------:|
+| `Q4_K`       |   4  | The sign says "Hoover Dam, Exit 9." |    10.17   |  9.2 GB |
+| `Q5_K`       |   5  | The sign says "Hoover Dam exit 9."  |    9.73    | 10.4 GB |
+
+A lower temperature like 0.1 is recommended for better quality (`--temp 0.1`), and if you omit `--prompt` it will describe the image:
+
+```bash
+./run.sh --workdir=/opt/llama.cpp/bin $(./autotag llama_cpp:gguf) \
+  /bin/bash -c './llava-cli \
+    --model $(huggingface-downloader mys/ggml_llava-v1.5-13b/ggml-model-q4_k.gguf) \
+    --mmproj $(huggingface-downloader mys/ggml_llava-v1.5-13b/mmproj-model-f16.gguf) \
+    --n-gpu-layers 999 \
+    --image /data/images/lake.jpg'
+
+In this image, a small wooden pier extends out into a calm lake, surrounded by tall trees and mountains. The pier seems to be the only access point to the lake. The serene scene includes a few boats scattered across the water, with one near the pier and the others further away. The overall atmosphere suggests a peaceful and tranquil setting, perfect for relaxation and enjoying nature.
+```
+
+You can put your own images in the mounted `jetson-containers/data` directory.  The C++ code for llava-cli can be found [here](https://github.com/ggerganov/llama.cpp/tree/master/examples/llava).  The llama-cpp-python bindings also [support Llava](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#multi-modal-models), however they are significantly slower from Python for some reason (potentially the pre/post-processing)