Skip to content

Commit

Permalink
updated llava tutorial with local_llm
Browse files Browse the repository at this point in the history
  • Loading branch information
dusty-nv committed Dec 22, 2023
1 parent 9e682c3 commit 9e61575
Showing 1 changed file with 15 additions and 6 deletions.
21 changes: 15 additions & 6 deletions docs/tutorial_llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,15 @@

1. [Chat with Llava using `text-generation-webui`](#1-chat-with-llava-using-text-generation-webui)
2. [Run from the terminal with `llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli)
3. [Quantized GGUF with `llama.cpp`](#3-quantized-gguf-with-llamacpp)
3. [Quantized GGUF models with `llama.cpp`](#3-quantized-gguf-models-with-llamacpp)
4. [Optimized Multimodal Pipeline with `local_llm`](#4-optimized-multimodal-pipeline-with-local_llm)

| Llava-1.5-13B (Jetson AGX Orin) | Quantization | Tokens/sec | Memory |
|---------------------------------------------------------------------------|:------------:|:----------:|:-------:|
| [`text-generation-webui`](#1-chat-with-llava-using-text-generation-webui) | 4-bit (GPTQ) | 2.3 | 8.8 GB |
| [`text-generation-webui`](#1-chat-with-llava-using-text-generation-webui) | 4-bit (GPTQ) | 2.3 | 9.7 GB |
| [`llava.serve.cli`](#2-run-from-the-terminal-with-llavaservecli) | FP16 (None) | 4.2 | 27.7 GB |
| [`llama.cpp`](#3-quantized-gguf-with-llamacpp) | 4-bit (Q4_K) | 10.1 | 9.2 GB |
| [`local_llm`](#4-optimized-multimodal-pipeline-with-local_llm) | 4-bit (MLC) | 21.1 | 8.5 GB |
| [`llama.cpp`](#3-quantized-gguf-models-with-llamacpp) | 4-bit (Q4_K) | 10.1 | 9.2 GB |
| [`local_llm`](#4-optimized-multimodal-pipeline-with-local_llm) | 4-bit (MLC) | 21.1 | 8.7 GB |

The latest Llava-1.5 is used in this tutorial. It comes in 7B and 13B variants, however the 13B model has significantly improved accuracy.

Expand Down Expand Up @@ -170,7 +170,7 @@ python3 -m llava.serve.model_worker \
```
-->

## 3. Quantized GGUF with `llama.cpp`
## 3. Quantized GGUF models with `llama.cpp`

[llama.cpp](https://github.com/ggerganov/llama.cpp) is one of the faster LLM API's, and can apply a variety of quantization methods to Llava to reduce its memory usage and runtime. It uses CUDA for LLM inference on the GPU. There are pre-quantized versions of Llava-1.5 available in GGUF format for 4-bit and 5-bit:

Expand Down Expand Up @@ -254,6 +254,11 @@ The image does not provide any specific hazards to be aware of. However, it is e

#### Benchmarks

| Model | Response | Tokens/sec | Memory |
|-----------------|-------------------------------------------|:----------:|:------:|
| `llava-1.5-7b` | The road sign says "Hoover Dam 1/2 Mile." | 42.2 | 6.4 GB |
| `llava-1.5-13b` | The road sign says "Hoover Dam exit 2". | 21.1 | 8.7 GB |

#### JSON

Llava-1.5 can also output JSON, which the authors cover in the [paper](https://arxiv.org/abs/2310.03744), and can be used to programatically query information about the image:
Expand All @@ -273,4 +278,8 @@ Llava-1.5 can also output JSON, which the authors cover in the [paper](https://a
}
```


#### Web UI

To use local_llm with a web UI instead, see the [Voice Chat](https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#voice-chat) section of the documentation:

<a href="https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/local_llm#local_llm" target="_blank"><img src="https://raw.githubusercontent.com/dusty-nv/jetson-containers/docs/docs/images/llamaspeak_llava_clip.gif"></a>

0 comments on commit 9e61575

Please sign in to comment.