From 461d61d6bbd2fdcc35f03f99df46666c3cf0eacb Mon Sep 17 00:00:00 2001 From: cccclai Date: Tue, 29 Oct 2024 17:30:26 -0400 Subject: [PATCH] Add readme for other backends Differential Revision: D64997867 Pull Request resolved: https://github.com/pytorch/executorch/pull/6556 --- examples/models/llama/README.md | 7 ++++++- examples/models/llama/UTILS.md | 1 + examples/models/llama/non_cpu_backends.md | 24 +++++++++++++++++++++++ 3 files changed, 31 insertions(+), 1 deletion(-) create mode 100644 examples/models/llama/non_cpu_backends.md diff --git a/examples/models/llama/README.md b/examples/models/llama/README.md index 1ae6796b57..6fc66f6506 100644 --- a/examples/models/llama/README.md +++ b/examples/models/llama/README.md @@ -136,6 +136,8 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus

+[Please visit this section to try it on non-CPU backend, including CoreML, MPS, Qualcomm HTP or MediaTek](non_cpu_backends.md). + # Instructions ## Tested on @@ -242,6 +244,9 @@ You can export and run the original Llama 3 8B instruct model. Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size. + + If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md) + ## Step 3: Run on your computer to validate 1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59). @@ -261,7 +266,7 @@ You can export and run the original Llama 3 8B instruct model. cmake --build cmake-out -j16 --target install --config Release ``` -Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the session of Common Issues and Mitigations below for solutions. +Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the section of Common Issues and Mitigations below for solutions. 2. Build llama runner. ``` diff --git a/examples/models/llama/UTILS.md b/examples/models/llama/UTILS.md index c2ae26e483..27a7a5832d 100644 --- a/examples/models/llama/UTILS.md +++ b/examples/models/llama/UTILS.md @@ -37,6 +37,7 @@ For CoreML, there are 2 additional optional arguments: * `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though) * `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML +To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md). ## Download models from Hugging Face and convert from safetensor format to state dict diff --git a/examples/models/llama/non_cpu_backends.md b/examples/models/llama/non_cpu_backends.md new file mode 100644 index 0000000000..1ee594ebd8 --- /dev/null +++ b/examples/models/llama/non_cpu_backends.md @@ -0,0 +1,24 @@ + +# Running Llama 3/3.1 8B on non-CPU backends + +### QNN +Please follow [the instructions](https://pytorch.org/executorch/stable/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html) to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs. + +### MPS +Export: +``` +python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32 +``` + +After exporting the MPS model .pte file, the [iOS LLAMA](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) app can support running the model. ` --embedding-quantize 4,32` is an optional args for quantizing embedding to reduce the model size. + +### CoreML +Export: +``` +python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w +``` + +After exporting the CoreML model .pte file, please [follow the instruction to build llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama#step-3-run-on-your-computer-to-validate) with CoreML flags enabled as the instruction described. + +### MTK +Please [follow the instructions](https://github.com/pytorch/executorch/tree/main/examples/mediatek#llama-example-instructions) to deploy llama3 8b to an Android phones with MediaTek chip