diff --git a/docs/images/nano_llm_docs.jpg b/docs/images/nano_llm_docs.jpg new file mode 100644 index 00000000..d024381b Binary files /dev/null and b/docs/images/nano_llm_docs.jpg differ diff --git a/docs/images/nano_llm_docs_chat.jpg b/docs/images/nano_llm_docs_chat.jpg new file mode 100644 index 00000000..634e3618 Binary files /dev/null and b/docs/images/nano_llm_docs_chat.jpg differ diff --git a/docs/tutorial_api-examples.md b/docs/tutorial_api-examples.md index 1ae89634..eff789f1 100644 --- a/docs/tutorial_api-examples.md +++ b/docs/tutorial_api-examples.md @@ -76,9 +76,9 @@ The [`huggingface-benchmark.py`](https://github.com/dusty-nv/jetson-containers/b ## NanoLLM -The [`NanoLLM`](https://dusty-nv.github.io/) library uses the optimized MLC/TVM library for inference, like on the [Benchmarks](benchmarks.md) page: +The [`NanoLLM`](https://dusty-nv.github.io/NanoLLM) library uses the optimized MLC/TVM library for inference, like on the [Benchmarks](benchmarks.md) page: - + ```python from nano_llm import NanoLLM, ChatHistory, ChatTemplates diff --git a/docs/tutorial_nano-llm.md b/docs/tutorial_nano-llm.md new file mode 100644 index 00000000..9fffa5ea --- /dev/null +++ b/docs/tutorial_nano-llm.md @@ -0,0 +1,43 @@ +# NanoLLM - Optimized LLM Inference + +[`NanoLLM`](https://dusty-nv.github.io/NanoLLM) is a lightweight, high-performance library using optimized inferencing APIs for quantized LLM’s, multimodality, speech services, vector databases with RAG, and web frontends. It's used to build many of the responsive, low-latency agents featured on this site. + + + +It provides similar APIs to HuggingFace, backed by highly-optimized inference libraries and quantization tools: + +```python +from nano_llm import NanoLLM + +model = NanoLLM.from_pretrained( + "meta-llama/Llama-2-7b-hf", # HuggingFace repo/model name, or path to HF model checkpoint + api='mlc', # supported APIs are: mlc, awq, hf + api_token='hf_abc123def', # HuggingFace API key for authenticated models ($HUGGINGFACE_TOKEN) + quantization='q4f16_ft' # q4f16_ft, q4f16_1, q8f16_0 for MLC, or path to AWQ weights +) + +response = model.generate("Once upon a time,", max_new_tokens=128) + +for token in response: + print(token, end='', flush=True) +``` + +## Resources + +Here's an index of the various tutorials & examples using NanoLLM on Jetson AI Lab: + +| | | +| :---------- | :----------------------------------- | +| **[Benchmarks](./benchmarks.md){:target="_blank"}** | Benchmarking results for LLM, SLM, VLM using MLC/TVM backend | +| **[API Examples](./tutorial_api-examples.md#nanollm){:target="_blank"}** | Python code examples for completion and multi-turn chat | +| **[Llamaspeak](./tutorial_llamaspeak.md){:target="_blank"}** | Talk verbally with LLMs using low-latency ASR/TTS speech models | +| **[Small LLM (SLM)](./tutorial_slm.md){:target="_blank"}** | Focus on language models with reduced footprint (7B params and below) | +| **[Live LLaVA](./tutorial_live-llava.md){:target="_blank"}** | Realtime live-streaming vision/language models on recurring prompts | +| **[Nano VLM](./tutorial_nano-vlm.md){:target="_blank"}** | Efficient multimodal pipeline with one-shot RAG support | + + +
+ +
+ + diff --git a/docs/tutorial_nano-vlm.md b/docs/tutorial_nano-vlm.md index 5e5f6ebc..37df93ad 100644 --- a/docs/tutorial_nano-vlm.md +++ b/docs/tutorial_nano-vlm.md @@ -47,7 +47,7 @@ The optimized [`NanoLLM`](https://dusty-nv.github.io/NanoLLM) library uses MLC/T ``` bash jetson-containers run $(autotag nano_llm) \ - python3 -m nano_llm --api=mlc \ + python3 -m nano_llm.chat --api=mlc \ --model liuhaotian/llava-v1.6-vicuna-7b \ --max-context-len 768 \ --max-new-tokens 128 @@ -63,7 +63,7 @@ During testing, you can specify prompts on the command-line that will run sequen ``` jetson-containers run $(autotag nano_llm) \ - python3 -m nano_llm --api=mlc \ + python3 -m nano_llm.chat --api=mlc \ --model liuhaotian/llava-v1.6-vicuna-7b \ --max-context-len 768 \ --max-new-tokens 128 \ @@ -91,7 +91,7 @@ When prompted, these models can also output in constrained JSON formats (which t ``` jetson-containers run $(autotag nano_llm) \ - python3 -m nano_llm --api=mlc \ + python3 -m nano_llm.chat --api=mlc \ --model liuhaotian/llava-v1.5-13b \ --prompt '/data/images/hoover.jpg' \ --prompt 'extract any text from the image as json' diff --git a/mkdocs.yml b/mkdocs.yml index b3860331..a547de29 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -1,4 +1,4 @@ -site_name: NVIDIA Jetson Generative AI Lab +site_name: NVIDIA Jetson AI Lab site_url: site_description: Showcasing generative AI projects that run on Jetson copyright: @@ -82,6 +82,7 @@ nav: - Text (LLM): - text-generation-webui: tutorial_text-generation.md - llamaspeak: tutorial_llamaspeak.md + - NanoLLM: tutorial_nano-llm.md - Small LLM (SLM): tutorial_slm.md - API Examples: tutorial_api-examples.md - Text + Vision (VLM):