updated examples

NVIDIA-AI-IOT · Apr 18, 2024 · 58f5085 · 58f5085
1 parent 779fedd
commit 58f5085
Show file tree

Hide file tree

Showing 5 changed files with 50 additions and 21 deletions.
diff --git a/docs/images/nano_llm_docs.jpg b/docs/images/nano_llm_docs.jpg
diff --git a/docs/overrides/main.html b/docs/overrides/main.html
@@ -2,17 +2,15 @@
 {% extends "base.html" %}
 
 <!-- Announcement bar -->
-{#
 {% block announce %}
   <style>
     .md-announce a { color: #76b900; text-decoration: underline;}
     .md-announce a:focus { color: hsl(82, 100%, 72%);  text-decoration: underline; }
     .md-announce a:hover { color: hsl(82, 100%, 72%); text-decoration: underline;}
   </style>
-    <div class="md-announce">View the <a href="research.html#past-meetings">recording</a> of the last Jetson AI Lab Research Group</a> meeting!  The next meeting is on 4/17 at 9am PST.</div>
+    <div class="md-announce">Meta Llama 3 has been released!  See the latest <a href="tutorial_nano-llm.html">examples</a> and <a href="/benchmarks.html">benchmarks</a> on Orin.</div>
 
 {% endblock %}
-#}
 
 {% block scripts %}
 <script src="//assets.adobedtm.com/5d4962a43b79/814eb6e9b4e1/launch-4bc07f1e0b0b.min.js"></script>

diff --git a/docs/tutorial_api-examples.md b/docs/tutorial_api-examples.md
@@ -20,7 +20,14 @@ It's good to know the code for generating text with LLM inference, and ancillary
 
         - `22GB` for `l4t-text-generation` container image
         - Space for models (`>10GB`)
-		 
+
+    4. Clone and setup [`jetson-containers`](https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md){:target="_blank"}:
+
+		```bash
+		git clone https://github.com/dusty-nv/jetson-containers
+		bash jetson-containers/install.sh
+		``` 
+		
 ## Transformers
 
 The HuggingFace Transformers API is the de-facto API that models are released for, often serving as the reference implementation.  It's not terribly fast, but it does have broad model support, and also supports quantization (AutoGPTQ, AWQ).  This uses streaming:
@@ -80,12 +87,12 @@ The [`NanoLLM`](https://dusty-nv.github.io/NanoLLM) library uses the optimized M
 
 <a href="benchmarks.html"><iframe width="600" height="371" seamless frameborder="0" scrolling="no" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vTJ9lFqOIZSfrdnS_0sa2WahzLbpbAbBCTlS049jpOchMCum1hIk-wE_lcNAmLkrZd0OQrI9IkKBfGp/pubchart?oid=2126319913&amp;format=interactive"></iframe></a>
 
-```python
+```python title="<a href='https://dusty-nv.github.io/NanoLLM' target='_blank'>> NanoLLM Reference Documentation</a>"
 from nano_llm import NanoLLM, ChatHistory, ChatTemplates
 
 # load model
 model = NanoLLM.from_pretrained(
-    model='meta-llama/Llama-2-7b-chat-hf', 
+    model='meta-llama/Meta-Llama-3-8B-Instruct', 
     quantization='q4f16_ft', 
     api='mlc'
 )
@@ -127,10 +134,12 @@ while True:
 This [example](https://github.com/dusty-nv/NanoLLM/blob/main/nano_llm/chat/example.py){:target="_blank"} keeps an interactive chat running with text being entered from the terminal.  You can start it like this:
 
 ```python
-jetson-containers run $(autotag nano_llm) \
+jetson-containers run \
+  --env HUGGINGFACE_TOKEN=hf_abc123def \
+  $(autotag nano_llm) \
     python3 -m nano_llm.chat.example
 ```
 
-Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag.
+Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag.  And for authenticated models, request access through HuggingFace (like with [Llama](https://huggingface.co/meta-llama){:target="_blank"}) and substitute your account's API token above.
 
 
diff --git a/docs/tutorial_nano-llm.md b/docs/tutorial_nano-llm.md
@@ -1,19 +1,19 @@
 # NanoLLM - Optimized LLM Inference
 
-[`NanoLLM`](https://dusty-nv.github.io/NanoLLM) is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.
+[`NanoLLM`](https://dusty-nv.github.io/NanoLLM){:target="_blank"} is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site.
 
 <a href="https://dusty-nv.github.io/NanoLLM" target="_blank"><img src="./images/nano_llm_docs.jpg" style="max-width: 50%; box-shadow: 2px 2px 4px rgba(0, 0, 0, 0.4);"></img></a>
 
 It provides <a href="tutorial_api-examples.html#nanollm" target="_blank">similar APIs</a> to HuggingFace, backed by highly-optimized inference libraries and quantization tools:
 
-```python
+```python title="<a href='https://dusty-nv.github.io/NanoLLM' target='_blank'>NanoLLM Reference Documentation</a>"
 from nano_llm import NanoLLM
 
 model = NanoLLM.from_pretrained(
-   "meta-llama/Llama-2-7b-hf",  # HuggingFace repo/model name, or path to HF model checkpoint
-   api='mlc',                   # supported APIs are: mlc, awq, hf
-   api_token='hf_abc123def',    # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
-   quantization='q4f16_ft'      # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
+   "meta-llama/Meta-Llama-3-8B-Instruct",  # HuggingFace repo/model name, or path to HF model checkpoint
+   api='mlc',                              # supported APIs are: mlc, awq, hf
+   api_token='hf_abc123def',               # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN)
+   quantization='q4f16_ft'                 # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights
 )
 
 response = model.generate("Once upon a time,", max_new_tokens=128)
@@ -22,22 +22,44 @@ for token in response:
    print(token, end='', flush=True)
 ```
 
+## Containers
+
+To test a chat session with Llama from the command-line, install [`jetson-containers`](https://github.com/dusty-nv/jetson-containers){:target="_blank"} and run NanoLLM like this:
+
+```
+git clone https://github.com/dusty-nv/jetson-containers
+bash jetson-containers/install.sh
+```
+```	
+jetson-containers run \
+  --env HUGGINGFACE_TOKEN=hf_abc123def \
+  $(autotag nano_llm) \
+  python3 -m nano_llm.chat --api mlc \
+    --model meta-llama/Meta-Llama-3-8B-Instruct \
+    --prompt "Can you tell me a joke about llamas?"
+```
+
+If you haven't already, request access to the [Llama models](https://huggingface.co/meta-llama){:target="_blank"} on HuggingFace and substitute your account's API token above.
+
 ## Resources
 
 Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab:
 
 |      |                     |
 | :---------- | :----------------------------------- |
-| **[Benchmarks](./benchmarks.md){:target="_blank"}** | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend |
-| **[API Examples](./tutorial_api-examples.md#nanollm){:target="_blank"}** | Python code examples for completion and multi-turn chat |
-| **[Llamaspeak](./tutorial_llamaspeak.md){:target="_blank"}** | Talk verbally with LLMs using low-latency ASR/TTS speech models |
+| **[Benchmarks](./benchmarks.md){:target="_blank"}** | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend. |
+| **[API Examples](./tutorial_api-examples.md#nanollm){:target="_blank"}** | Python code examples for chat, completion, and multimodal. |
+| **[Documentation](https://dusty-nv.github.io/NanoLLM){:target="_blank"}** | Reference documentation for the NanoLLM model and agent APIs. |
+| **[Llamaspeak](./tutorial_llamaspeak.md){:target="_blank"}** | Talk verbally with LLMs using low-latency ASR/TTS speech models. |
 | **[Small LLM (SLM)](./tutorial_slm.md){:target="_blank"}** | Focus on language models with reduced footprint (7B params and below) |
-| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts |
-| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot RAG support |
+| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts. |
+| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot image tagging and RAG support. |
 
 
 <div><iframe width="500" height="280" src="https://www.youtube.com/embed/UOjqF3YCGkY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
 <iframe width="500" height="280" src="https://www.youtube.com/embed/8Eu6zG0eEGY" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
 </div>
-
+<div><iframe width="500" height="280" src="https://www.youtube.com/embed/hswNSZTvEFE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+<iframe width="500" height="280" src="https://www.youtube.com/embed/wZq7ynbgRoE" style="display: inline-block;" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
+</div>  
 
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -82,7 +82,7 @@ nav:
     - Text (LLM):
       - text-generation-webui: tutorial_text-generation.md 
       - llamaspeak: tutorial_llamaspeak.md
-      - NanoLLM: tutorial_nano-llm.md
+      - NanoLLM 🆕: tutorial_nano-llm.md
       - Small LLM (SLM): tutorial_slm.md
       - API Examples: tutorial_api-examples.md
     - Text + Vision (VLM):