Skip to content

Commit

Permalink
updated examples
Browse files Browse the repository at this point in the history
  • Loading branch information
dusty-nv committed Apr 18, 2024
1 parent 779fedd commit 58f5085
Show file tree
Hide file tree
Showing 5 changed files with 50 additions and 21 deletions.
Binary file modified docs/images/nano_llm_docs.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 1 addition & 3 deletions docs/overrides/main.html
Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,15 @@
{% extends "base.html" %}

<!-- Announcement bar -->
{#
{% block announce %}
<style>
.md-announce a { color: #76b900; text-decoration: underline;}
.md-announce a:focus { color: hsl(82, 100%, 72%); text-decoration: underline; }
.md-announce a:hover { color: hsl(82, 100%, 72%); text-decoration: underline;}
</style>
<div class="md-announce">View the <a href="research.html#past-meetings">recording</a> of the last Jetson AI Lab Research Group</a> meeting! The next meeting is on 4/17 at 9am PST.</div>
<div class="md-announce">Meta Llama 3 has been released! See the latest <a href="tutorial_nano-llm.html">examples</a> and <a href="/benchmarks.html">benchmarks</a> on Orin.</div>

{% endblock %}
#}

{% block scripts %}
<script src="//assets.adobedtm.com/5d4962a43b79/814eb6e9b4e1/launch-4bc07f1e0b0b.min.js"></script>
Expand Down
19 changes: 14 additions & 5 deletions docs/tutorial_api-examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,14 @@ It's good to know the code for generating text with LLM inference, and ancillary

- `22GB` for `l4t-text-generation` container image
- Space for models (`>10GB`)

4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:

```bash
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
```
## Transformers

The HuggingFace Transformers API is the de-facto API that models are released for, often serving as the reference implementation. It's not terribly fast, but it does have broad model support, and also supports quantization (AutoGPTQ, AWQ). This uses streaming:
Expand Down Expand Up @@ -80,12 +87,12 @@ The [`NanoLLM`](https://dusty-nv.github.io/NanoLLM) library uses the optimized M

<a href="benchmarks.html"><iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2126319913&amp;format=interactive"></iframe></a>

```python
```python title="<a href='https://dusty-nv.github.io/NanoLLM' target='_blank'>> NanoLLM Reference Documentation</a>"
from nano_llm import NanoLLM, ChatHistory, ChatTemplates

# load model
model = NanoLLM.from_pretrained(
model='meta-llama/Llama-2-7b-chat-hf',
model='meta-llama/Meta-Llama-3-8B-Instruct',
quantization='q4f16_ft',
api='mlc'
)
Expand Down Expand Up @@ -127,10 +134,12 @@ while True:
This [example](https://github.com/dusty-nv/NanoLLM/blob/main/nano_llm/chat/example.py){:target="_blank"} keeps an interactive chat running with text being entered from the terminal. You can start it like this:

```python
jetson-containers run $(autotag nano_llm) \
jetson-containers run \
--env HUGGINGFACE_TOKEN=hf_abc123def \
$(autotag nano_llm) \
python3 -m nano_llm.chat.example
```

Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag.
Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag. And for authenticated models, request access through HuggingFace (like with [Llama](https://huggingface.co/meta-llama){:target="_blank"}) and substitute your account's API token above.


46 changes: 34 additions & 12 deletions docs/tutorial_nano-llm.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
# NanoLLM - Optimized LLM Inference

[`NanoLLM`](https://dusty-nv.github.io/NanoLLM) is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.
[`NanoLLM`](https://dusty-nv.github.io/NanoLLM){:target="_blank"} is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.

<a href="https://dusty-nv.github.io/NanoLLM" target="_blank"><img src="./images/nano_llm_docs.jpg" style="max-width: 50%; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.4);"></img></a>

It provides <a href="tutorial_api-examples.html#nanollm" target="_blank">similar APIs</a> to HuggingFace, backed by highly-optimized inference libraries and quantization tools:

```python
```python title="<a href='https://dusty-nv.github.io/NanoLLM' target='_blank'>NanoLLM Reference Documentation</a>"
from nano_llm import NanoLLM

model = NanoLLM.from_pretrained(
"meta-llama/Llama-2-7b-hf", # HuggingFace repo/model name, or path to HF model checkpoint
api='mlc', # supported APIs are: mlc, awq, hf
api_token='hf_abc123def', # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
quantization='q4f16_ft' # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
"meta-llama/Meta-Llama-3-8B-Instruct", # HuggingFace repo/model name, or path to HF model checkpoint
api='mlc', # supported APIs are: mlc, awq, hf
api_token='hf_abc123def', # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
quantization='q4f16_ft' # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
)

response = model.generate("Once upon a time,", max_new_tokens=128)
Expand All @@ -22,22 +22,44 @@ for token in response:
print(token, end='', flush=True)
```

## Containers

To test a chat session with Llama from the command-line, install [`jetson-containers`](https://github.com/dusty-nv/jetson-containers){:target="_blank"} and run NanoLLM like this:

```
git clone https://github.com/dusty-nv/jetson-containers
bash jetson-containers/install.sh
```
```
jetson-containers run \
--env HUGGINGFACE_TOKEN=hf_abc123def \
$(autotag nano_llm) \
python3 -m nano_llm.chat --api mlc \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--prompt "Can you tell me a joke about llamas?"
```

If you haven't already, request access to the [Llama models](https://huggingface.co/meta-llama){:target="_blank"} on HuggingFace and substitute your account's API token above.

## Resources

Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:

| | |
| :---------- | :----------------------------------- |
| **[Benchmarks](./benchmarks.md){:target="_blank"}** | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend |
| **[API Examples](./tutorial_api-examples.md#nanollm){:target="_blank"}** | Python code examples for completion and multi-turn chat |
| **[Llamaspeak](./tutorial_llamaspeak.md){:target="_blank"}** | Talk verbally with LLMs using low-latency ASR/TTS speech models |
| **[Benchmarks](./benchmarks.md){:target="_blank"}** | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend. |
| **[API Examples](./tutorial_api-examples.md#nanollm){:target="_blank"}** | Python code examples for chat, completion, and multimodal. |
| **[Documentation](https://dusty-nv.github.io/NanoLLM){:target="_blank"}** | Reference documentation for the NanoLLM model and agent APIs. |
| **[Llamaspeak](./tutorial_llamaspeak.md){:target="_blank"}** | Talk verbally with LLMs using low-latency ASR/TTS speech models. |
| **[Small LLM (SLM)](./tutorial_slm.md){:target="_blank"}** | Focus on language models with reduced footprint (7B params and below) |
| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts |
| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot RAG support |
| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts. |
| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot image tagging and RAG support. |


<div><iframe width="500" height="280" src="https://www.youtube.com/embed/UOjqF3YCGkY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<iframe width="500" height="280" src="https://www.youtube.com/embed/8Eu6zG0eEGY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>

<div><iframe width="500" height="280" src="https://www.youtube.com/embed/hswNSZTvEFE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
<iframe width="500" height="280" src="https://www.youtube.com/embed/wZq7ynbgRoE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</div>

2 changes: 1 addition & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ nav:
- Text (LLM):
- text-generation-webui: tutorial_text-generation.md
- llamaspeak: tutorial_llamaspeak.md
- NanoLLM: tutorial_nano-llm.md
- NanoLLM 🆕: tutorial_nano-llm.md
- Small LLM (SLM): tutorial_slm.md
- API Examples: tutorial_api-examples.md
- Text + Vision (VLM):
Expand Down

0 comments on commit 58f5085

Please sign in to comment.