Skip to content

Commit

Permalink
Merge pull request #56 from dusty-nv/20231218-llava
Browse files Browse the repository at this point in the history
20231218 llava
  • Loading branch information
dusty-nv authored Dec 22, 2023
2 parents 781858c + 9e61575 commit 27ad2f7
Showing 1 changed file with 83 additions and 4 deletions.
87 changes: 83 additions & 4 deletions docs/tutorial_llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,15 @@

1. [Chat with Llava using `text-generation-webui`](#1-chat-with-llava-using-text-generation-webui)
2. [Run from the terminal with `llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli)
3. [Quantized GGUF with llama.cpp](#3-quantized-gguf-with-llamacpp)
3. [Quantized GGUF models with `llama.cpp`](#3-quantized-gguf-models-with-llamacpp)
4. [Optimized Multimodal Pipeline with `local_llm`](#4-optimized-multimodal-pipeline-with-local_llm)

| Llava-1.5-13B (Jetson AGX Orin) | Quantization | Tokens/sec | Memory |
|---------------------------------------------------------------------------|:------------:|:----------:|:-------:|
| [`text-generation-webui`](#1-chat-with-llava-using-text-generation-webui) | 4-bit (GPTQ) | 2.3 | 8.8 GB |
| [`text-generation-webui`](#1-chat-with-llava-using-text-generation-webui) | 4-bit (GPTQ) | 2.3 | 9.7 GB |
| [`llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli) | FP16 (None) | 4.2 | 27.7 GB |
| [`llama.cpp`](#3-quantized-gguf-with-llamacpp) | 4-bit (Q4_K) | 10.1 | 9.2 GB |
| [`llama.cpp`](#3-quantized-gguf-models-with-llamacpp) | 4-bit (Q4_K) | 10.1 | 9.2 GB |
| [`local_llm`](#4-optimized-multimodal-pipeline-with-local_llm) | 4-bit (MLC) | 21.1 | 8.7 GB |

The latest Llava-1.5 is used in this tutorial. It comes in 7B and 13B variants, however the 13B model has significantly improved accuracy.

Expand Down Expand Up @@ -168,7 +170,7 @@ python3 -m llava.serve.model_worker \
```
-->

## 3. Quantized GGUF with `llama.cpp`
## 3. Quantized GGUF models with `llama.cpp`

[llama.cpp](https://github.com/ggerganov/llama.cpp) is one of the faster LLM API's, and can apply a variety of quantization methods to Llava to reduce its memory usage and runtime. It uses CUDA for LLM inference on the GPU. There are pre-quantized versions of Llava-1.5 available in GGUF format for 4-bit and 5-bit:

Expand Down Expand Up @@ -204,3 +206,80 @@ In this image, a small wooden pier extends out into a calm lake, surrounded by t
```

You can put your own images in the mounted `jetson-containers/data` directory. The C++ code for llava-cli can be found [here](https://github.com/ggerganov/llama.cpp/tree/master/examples/llava). The llama-cpp-python bindings also [support Llava](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#multi-modal-models), however they are significantly slower from Python for some reason (potentially the pre/post-processing)

## 4. Optimized Multimodal Pipeline with `local_llm`

The optimized [local_llm](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm) container using MLC/TVM for quantization and inference provides the highest performance in this tutorial on Jetson. It efficiently manages the CLIP embeddings and KV cache. You can find the Python code for the chat program used in this example [here](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/__main__.py).

``` bash
./run.sh $(./autotag local_llm) \
python3 -m local_llm --api=mlc \
--model liuhaotian/llava-v1.5-13b
```

This starts an interactive console-based chat with Llava, and on the first run the model will automatically be downloaded from HuggingFace and quantized using MLC and W4A16 precision (which can take some time). See [here](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#text-chat) for command-line options for the local_llm [`__main__.py`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/__main__.py)

You'll end up at a `>> PROMPT:` in which you can enter the path or URL of an image file, followed by your question about the image. You can follow-up with multiple questions about the same image. Llava-1.5 does not understand multiple images in the same chat, so when changing images, first reset the chat history by entering `clear` or `reset` as the prompt. You can also automate this from the command-line:

```
./run.sh $(./autotag local_llm) \
python3 -m local_llm --api=mlc \
--model liuhaotian/llava-v1.5-13b \
--prompt '/data/images/hoover.jpg' \
--prompt 'what does the road sign say?' \
--prompt 'what kind of environment is it?' \
--prompt 'reset' \
--prompt '/data/images/lake.jpg' \
--prompt 'please describe the scene.' \
--prompt 'are there any hazards to be aware of?'
```

**Results** [[hoover.jpg]](https://github.com/dusty-nv/jetson-containers/blob/master/data/images/hoover.jpg) [[lake.jpg]](https://github.com/dusty-nv/jetson-containers/blob/master/data/images/lake.jpg)

```
>> PROMPT: /data/images/hoover.jpg
>> PROMPT: what does the road sign say?
The road sign says "Hoover Dam exit 2".
>> PROMPT: what kind of environment is it?
It is a mountainous environment, with a road going through the mountains.
>> PROMPT: /data/images/lake.jpg
>> PROMPT: please describe the scene.
The image features a wooden pier extending out into a large body of water, possibly a lake. The pier is situated near a forest, creating a serene and peaceful atmosphere. The water appears to be calm, and the pier seems to be the only structure in the area. The scene is captured during the day, with the sunlight illuminating the landscape.
>> PROMPT: are there any hazards to be aware of?
The image does not provide any specific hazards to be aware of. However, it is essential to be cautious while walking on a pier, as it may be slippery or have loose boards. Additionally, one should be mindful of the water depth and currents, as well as any potential wildlife in the area.
```

#### Benchmarks

| Model | Response | Tokens/sec | Memory |
|-----------------|-------------------------------------------|:----------:|:------:|
| `llava-1.5-7b` | The road sign says "Hoover Dam 1/2 Mile." | 42.2 | 6.4 GB |
| `llava-1.5-13b` | The road sign says "Hoover Dam exit 2". | 21.1 | 8.7 GB |

#### JSON

Llava-1.5 can also output JSON, which the authors cover in the [paper](https://arxiv.org/abs/2310.03744), and can be used to programatically query information about the image:

```
./run.sh $(./autotag local_llm) \
python3 -m local_llm --api=mlc \
--model liuhaotian/llava-v1.5-13b \
--prompt '/data/images/hoover.jpg' \
--prompt 'extract any text from the image as json'
```
```
{
"sign": "Hoover Dam",
"exit": "2",
"distance": "1 1/2 mile"
}
```

#### Web UI

To use local_llm with a web UI instead, see the [Voice Chat](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#voice-chat) section of the documentation:

<a href="https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#local_llm" target="_blank"><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/llamaspeak_llava_clip.gif"></a>

0 comments on commit 27ad2f7

Please sign in to comment.