Skip to content

Commit

Permalink
added API examples
Browse files Browse the repository at this point in the history
  • Loading branch information
dusty-nv committed Mar 17, 2024
1 parent 14342f1 commit 3c9606b
Show file tree
Hide file tree
Showing 2 changed files with 139 additions and 1 deletion.
137 changes: 137 additions & 0 deletions docs/tutorial_api-examples.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Tutorial - API Examples

It's good to know the code for generating text with LLM inference, and ancillary things like tokenization, chat templates, and prompting. On this page we give Python examples of running various LLM APIs, and their benchmarks.

!!! abstract "What you need"

1. One of the following Jetson devices:

<span class="blobDarkGreen4">Jetson AGX Orin (64GB)</span>
<span class="blobDarkGreen5">Jetson AGX Orin (32GB)</span>
<span class="blobLightGreen3">Jetson Orin NX (16GB)</span>
<span class="blobLightGreen4">Jetson Orin Nano (8GB)</span><span title="Orin Nano 8GB can run 7B quantized models">⚠️</span>
2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack):

<span class="blobPink2">JetPack 5 (L4T r35.x)</span>
<span class="blobPink2">JetPack 6 (L4T r36.x)</span>

3. Sufficient storage space (preferably with NVMe SSD).

- `22GB` for `l4t-text-generation` container image
- Space for models (`>10GB`)
## Transformers

The HuggingFace Transformers API is the de-facto API that models are released for, often serving as the reference implementation. It's not terribly fast, but it does have broad model support, and also supports quantization (AutoGPTQ, AWQ). This uses streaming:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from threading import Thread

model_name='meta-llama/Llama-2-7b-chat-hf'
model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda')

tokenizer = AutoTokenizer.from_pretrained(model_name)
streamer = TextIteratorStreamer(tokenizer)

prompt = [{'role': 'user', 'content': 'Can I get a recipe for French Onion soup?'}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
return_tensors='pt'
).to(model.device)

Thread(target=lambda: model.generate(inputs, max_new_tokens=256, streamer=streamer)).start()

for text in streamer:
print(text, end='', flush=True)
```

To run this (it can be found [here](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/transformers/test.py){:target="_blank"}), you can mount a directory containing the script or your jetson-containers directory:

```bash
./run.sh --volume $PWD/packages/llm:/mount --workdir /mount \
$(./autotag l4t-text-generation) \
python3 transformers/test.py
```

We use the `l4t-text-generation` container because it includes the quantization libraries in addition to Transformers, for running the quanztized versions of the models like `TheBloke/Llama-2-7B-Chat-GPTQ`

### Benchmarks

The [`huggingface-benchmark.py`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/transformers/huggingface-benchmark.py){:target="_blank"} script will benchmark the models:

```bash
./run.sh --volume $PWD/packages/llm/transformers:/mount --workdir /mount \
$(./autotag l4t-text-generation) \
python3 huggingface-benchmark.py --model meta-llama/Llama-2-7b-chat-hf
```

```
* meta-llama/Llama-2-7b-chat-hf AVG = 20.7077 seconds, 6.2 tokens/sec memory=10173.45 MB
* TheBloke/Llama-2-7B-Chat-GPTQ AVG = 12.3922 seconds, 10.3 tokens/sec memory=7023.36 MB
* TheBloke/Llama-2-7B-Chat-AWQ AVG = 11.4667 seconds, 11.2 tokens/sec memory=4662.34 MB
```

## local_llm

The `local_llm` container uses the optimized MLC/TVM library for inference, like on the [Benchmarks](benchmarks.md) page:

<a href="benchmarks.html"><img width="600px" src="overrides/images/graph_llm-text-generation.svg"/></a>

```python
from local_llm import LocalLM, ChatHistory, ChatTemplates
from termcolor import cprint

# load model
model = LocalLM.from_pretrained(
model='meta-llama/Llama-2-7b-chat-hf',
quant='q4f16_ft',
api='mlc'
)

# create the chat history
chat_history = ChatHistory(model, system_prompt="You are a helpful and friendly AI assistant.")

while True:
# enter the user query from terminal
print('>> ', end='', flush=True)
prompt = input().strip()

# add user prompt and generate chat tokens/embeddings
chat_history.append(role='user', msg=prompt)
embedding, position = chat_history.embed_chat()

# generate bot reply
reply = model.generate(
embedding,
streaming=True,
kv_cache=chat_history.kv_cache,
stop_tokens=chat_history.template.stop,
max_new_tokens=256,
)

# append the output stream to the chat history
bot_reply = chat_history.append(role='bot', text='')

for token in reply:
bot_reply.text += token
cprint(token, color='blue', end='', flush=True)

print('\n')

# save the inter-request KV cache
chat_history.kv_cache = reply.kv_cache
```

This [example](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/chat/example.py){:target="_blank"} keeps an interactive chat running with text being entered from the terminal. You can start it like this:

```python
./run.sh $(./autotag local_llm) \
python3 -m local_llm.chat.example
```

Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag.


3 changes: 2 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ nav:
- text-generation-webui: tutorial_text-generation.md
- llamaspeak: tutorial_llamaspeak.md
- Small LLM (SLM) 🆕: tutorial_slm.md
- API Examples 🆕: tutorial_api-examples.md
- Text + Vision (VLM):
- Mini-GPT4: tutorial_minigpt4.md
- LLaVA: tutorial_llava.md
Expand All @@ -103,7 +104,7 @@ nav:
- AudioCraft: tutorial_audiocraft.md
- Whisper: tutorial_whisper.md
- Metropolis Microservices:
- First steps: tutorial_mmj.md
- First steps 🆕: tutorial_mmj.md
# - Tools:
# - LangChain: tutorial_distillation.md
- Tips:
Expand Down

0 comments on commit 3c9606b

Please sign in to comment.