From 487b4f0c2fe61e8986fbf9d0dc36b6f91fb0c65a Mon Sep 17 00:00:00 2001 From: Dustin Franklin Date: Wed, 25 Sep 2024 14:07:39 -0400 Subject: [PATCH] added Llama-Vision --- docs/llama_vlm.md | 97 +++++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 1 + 2 files changed, 98 insertions(+) create mode 100644 docs/llama_vlm.md diff --git a/docs/llama_vlm.md b/docs/llama_vlm.md new file mode 100644 index 00000000..c05ddaa0 --- /dev/null +++ b/docs/llama_vlm.md @@ -0,0 +1,97 @@ +# Llama 3.2 Vision + +The latest additions to Meta's family of foundation LLMs include multimodal vision/language models (VLMs) in 11B and 90B sizes with high-resolution image inputs (1120x1120) and cross-attention with base completion and instruction-tuned chat variants: + +* [`Llama-3.2-11B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision) +* [`Llama-3.2-11B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) +* [`Llama-3.2-90B-Vision`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision) +* [`Llama-3.2-90B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-90B-Vision-Instruct) + +While quantization and optimization efforts are underway, we have started with running the unquantized 11B model in a container based on HuggingFace Transformers that has been updated with the latest support for Llama-3.2-Vision a jump start on trying out these exciting new multimodal models - thanks to Meta for continuing to release open Llama models! + +!!! abstract "What you need" + + 1. One of the following Jetson devices: + + Jetson AGX Orin (64GB) + Jetson AGX Orin (32GB) + + 2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack): + + JetPack 6 (L4T r36) + + 3. Sufficient storage space (preferably with NVMe SSD). + + - `12.8GB` for `llama-vision` container image + - Space for models (`>25GB`) + + 4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}: + + ```bash + git clone https://github.com/dusty-nv/jetson-containers + bash jetson-containers/install.sh + ``` + + 5. Request access to the gated models [here](https://huggingface.co/meta-llama) with your HuggingFace API key. + + +## Code Example + +Today Llama-3.2-11B-Vision is able to be run on Jetson AGX Orin in FP16 via HuggingFace Transformers. Here's a simple code example from the model card for using it: + +```python +import time +import requests +import torch + +from PIL import Image +from transformers import MllamaForConditionalGeneration, AutoProcessor + +model_id = "meta-llama/Llama-3.2-11B-Vision" +model = MllamaForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16) +processor = AutoProcessor.from_pretrained(model_id) + +prompt = "<|image|><|begin_of_text|>If I had to write a haiku for this one" +url = "https://llava-vl.github.io/static/images/view.jpg" +raw_image = Image.open(requests.get(url, stream=True).raw) + +inputs = processor(text=prompt, images=raw_image, return_tensors="pt").to(model.device) +output = model.generate(**inputs, do_sample=False, max_new_tokens=32) +``` + + + +``` +If I had to write a haiku for this one, it would be: + +A dock on a lake. +A mountain in the distance. +A long exposure. +``` + +Initial testing seems that Llama-3.2-Vision has more conversational abilities than VLMs typically retain after VQA alignment. This [llama_vision.py](https://github.com/dusty-nv/jetson-containers/blob/master/packages/vlm/llama-vision/llama_vision.py) script has interactive completion and image loading to avoid re-loading the model. It can be launched from the container like this: + +```bash +jetson-containers run \ + -e HUGGINGFACE_TOKEN=YOUR_API_KEY \ + $(autotag llama-vision) \ + python3 /opt/llama_vision.py \ + --model "meta-llama/Llama-3.2-11B-Vision" \ + --image "/data/images/hoover.jpg" \ + --prompt "I'm out in the" \ + --max-new-tokens 32 \ + --interactive +``` + +After processing the initial [image](https://github.com/dusty-nv/jetson-containers/blob/master/data/images/hoover.jpg), it will ask you to submit another prompt or image: + +``` +total 4.8346s (39 tokens, 8.07 tokens/sec) + +Enter prompt or image path/URL: + +>> +``` + +We will update this page and container as support for the Llama-3.2-Vision architecture is added to quantization APIs like MLC and llama.cpp for GGUF, which will reduce the memory and latency. + diff --git a/mkdocs.yml b/mkdocs.yml index 524376f1..666312b8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -94,6 +94,7 @@ nav: - LLaVA: tutorial_llava.md - Live LLaVA: tutorial_live-llava.md - NanoVLM: tutorial_nano-vlm.md + - Llama 3.2 Vision: llama_vlm.md - Vision Transformers (ViT): - vit/index.md - EfficientViT: vit/tutorial_efficientvit.md