diff --git a/docs/tutorial_api-examples.md b/docs/tutorial_api-examples.md new file mode 100644 index 00000000..0c7b49b7 --- /dev/null +++ b/docs/tutorial_api-examples.md @@ -0,0 +1,137 @@ +# Tutorial - API Examples + +It's good to know the code for generating text with LLM inference, and ancillary things like tokenization, chat templates, and prompting. On this page we give Python examples of running various LLM APIs, and their benchmarks. + +!!! abstract "What you need" + + 1. One of the following Jetson devices: + + Jetson AGX Orin (64GB) + Jetson AGX Orin (32GB) + Jetson Orin NX (16GB) + Jetson Orin Nano (8GB)⚠️ + + 2. Running one of the following versions of [JetPack](https://developer.nvidia.com/embedded/jetpack): + + JetPack 5 (L4T r35.x) + JetPack 6 (L4T r36.x) + + 3. Sufficient storage space (preferably with NVMe SSD). + + - `22GB` for `l4t-text-generation` container image + - Space for models (`>10GB`) + +## Transformers + +The HuggingFace Transformers API is the de-facto API that models are released for, often serving as the reference implementation. It's not terribly fast, but it does have broad model support, and also supports quantization (AutoGPTQ, AWQ). This uses streaming: + +```python +from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer +from threading import Thread + +model_name='meta-llama/Llama-2-7b-chat-hf' +model = AutoModelForCausalLM.from_pretrained(model_name, device_map='cuda') + +tokenizer = AutoTokenizer.from_pretrained(model_name) +streamer = TextIteratorStreamer(tokenizer) + +prompt = [{'role': 'user', 'content': 'Can I get a recipe for French Onion soup?'}] +inputs = tokenizer.apply_chat_template( + prompt, + add_generation_prompt=True, + return_tensors='pt' +).to(model.device) + +Thread(target=lambda: model.generate(inputs, max_new_tokens=256, streamer=streamer)).start() + +for text in streamer: + print(text, end='', flush=True) +``` + +To run this (it can be found [here](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/transformers/test.py){:target="_blank"}), you can mount a directory containing the script or your jetson-containers directory: + +```bash +./run.sh --volume $PWD/packages/llm:/mount --workdir /mount \ + $(./autotag l4t-text-generation) \ + python3 transformers/test.py +``` + +We use the `l4t-text-generation` container because it includes the quantization libraries in addition to Transformers, for running the quanztized versions of the models like `TheBloke/Llama-2-7B-Chat-GPTQ` + +### Benchmarks + +The [`huggingface-benchmark.py`](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/transformers/huggingface-benchmark.py){:target="_blank"} script will benchmark the models: + +```bash +./run.sh --volume $PWD/packages/llm/transformers:/mount --workdir /mount \ + $(./autotag l4t-text-generation) \ + python3 huggingface-benchmark.py --model meta-llama/Llama-2-7b-chat-hf +``` + +``` +* meta-llama/Llama-2-7b-chat-hf AVG = 20.7077 seconds, 6.2 tokens/sec memory=10173.45 MB +* TheBloke/Llama-2-7B-Chat-GPTQ AVG = 12.3922 seconds, 10.3 tokens/sec memory=7023.36 MB +* TheBloke/Llama-2-7B-Chat-AWQ AVG = 11.4667 seconds, 11.2 tokens/sec memory=4662.34 MB +``` + +## local_llm + +The `local_llm` container uses the optimized MLC/TVM library for inference, like on the [Benchmarks](benchmarks.md) page: + + + +```python +from local_llm import LocalLM, ChatHistory, ChatTemplates +from termcolor import cprint + +# load model +model = LocalLM.from_pretrained( + model='meta-llama/Llama-2-7b-chat-hf', + quant='q4f16_ft', + api='mlc' +) + +# create the chat history +chat_history = ChatHistory(model, system_prompt="You are a helpful and friendly AI assistant.") + +while True: + # enter the user query from terminal + print('>> ', end='', flush=True) + prompt = input().strip() + + # add user prompt and generate chat tokens/embeddings + chat_history.append(role='user', msg=prompt) + embedding, position = chat_history.embed_chat() + + # generate bot reply + reply = model.generate( + embedding, + streaming=True, + kv_cache=chat_history.kv_cache, + stop_tokens=chat_history.template.stop, + max_new_tokens=256, + ) + + # append the output stream to the chat history + bot_reply = chat_history.append(role='bot', text='') + + for token in reply: + bot_reply.text += token + cprint(token, color='blue', end='', flush=True) + + print('\n') + + # save the inter-request KV cache + chat_history.kv_cache = reply.kv_cache +``` + +This [example](https://github.com/dusty-nv/jetson-containers/blob/master/packages/llm/local_llm/chat/example.py){:target="_blank"} keeps an interactive chat running with text being entered from the terminal. You can start it like this: + +```python +./run.sh $(./autotag local_llm) \ + python3 -m local_llm.chat.example +``` + +Or for easy editing from the host device, copy the source into your own script and mount it into the container with the `--volume` flag. + + diff --git a/mkdocs.yml b/mkdocs.yml index b9a57676..45fe0d79 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -81,6 +81,7 @@ nav: - text-generation-webui: tutorial_text-generation.md - llamaspeak: tutorial_llamaspeak.md - Small LLM (SLM) 🆕: tutorial_slm.md + - API Examples 🆕: tutorial_api-examples.md - Text + Vision (VLM): - Mini-GPT4: tutorial_minigpt4.md - LLaVA: tutorial_llava.md @@ -103,7 +104,7 @@ nav: - AudioCraft: tutorial_audiocraft.md - Whisper: tutorial_whisper.md - Metropolis Microservices: - - First steps: tutorial_mmj.md + - First steps 🆕: tutorial_mmj.md # - Tools: # - LangChain: tutorial_distillation.md - Tips: