From 461d61d6bbd2fdcc35f03f99df46666c3cf0eacb Mon Sep 17 00:00:00 2001
From: cccclai <chenlai@meta.com>
Date: Tue, 29 Oct 2024 17:30:26 -0400
Subject: [PATCH] Add readme for other backends

Differential Revision: D64997867

Pull Request resolved: https://github.com/pytorch/executorch/pull/6556
---
 examples/models/llama/README.md           |  7 ++++++-
 examples/models/llama/UTILS.md            |  1 +
 examples/models/llama/non_cpu_backends.md | 24 +++++++++++++++++++++++
 3 files changed, 31 insertions(+), 1 deletion(-)
 create mode 100644 examples/models/llama/non_cpu_backends.md
diff --git a/examples/models/llama/README.md b/examples/models/llama/README.md
index 1ae6796b57..6fc66f6506 100644
--- a/examples/models/llama/README.md
+++ b/examples/models/llama/README.md
@@ -136,6 +136,8 @@ Llama 3 8B performance was measured on the Samsung Galaxy S22, S24, and OnePlus
       </em>
 </p>
 
+[Please visit this section to try it on non-CPU backend, including CoreML, MPS, Qualcomm HTP or MediaTek](non_cpu_backends.md).
+
 # Instructions
 
 ## Tested on
@@ -242,6 +244,9 @@ You can export and run the original Llama 3 8B instruct model.
 
     Due to the larger vocabulary size of Llama 3, we recommend quantizing the embeddings with `--embedding-quantize 4,32` as shown above to further reduce the model size.
 
+
+    If you're interested in deploying on non-CPU backends, [please refer the non-cpu-backend section](non_cpu_backends.md)
+
 ## Step 3: Run on your computer to validate
 
 1. Build executorch with optimized CPU performance as follows. Build options available [here](https://github.com/pytorch/executorch/blob/main/CMakeLists.txt#L59).
@@ -261,7 +266,7 @@ You can export and run the original Llama 3 8B instruct model.
 
     cmake --build cmake-out -j16 --target install --config Release
     ```
-Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the session of Common Issues and Mitigations below for solutions.
+Note for Mac users: There's a known linking issue with Xcode 15.1. Refer to the section of Common Issues and Mitigations below for solutions.
 
 2. Build llama runner.
     ```
diff --git a/examples/models/llama/UTILS.md b/examples/models/llama/UTILS.md
index c2ae26e483..27a7a5832d 100644
--- a/examples/models/llama/UTILS.md
+++ b/examples/models/llama/UTILS.md
@@ -37,6 +37,7 @@ For CoreML, there are 2 additional optional arguments:
 * `--coreml-ios`: Specify the minimum iOS version to deploy (and turn on available optimizations). E.g. `--coreml-ios 18` will turn on [in-place KV cache](https://developer.apple.com/documentation/coreml/mlstate?language=objc) and [fused scaled dot product attention kernel](https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS18.transformers.scaled_dot_product_attention) (the resulting model will then need at least iOS 18 to run, though)
 * `--coreml-quantize`: Use [quantization tailored for CoreML](https://apple.github.io/coremltools/docs-guides/source/opt-quantization-overview.html). E.g. `--coreml-quantize b4w` will perform per-block 4-bit weight-only quantization in a way tailored for CoreML
 
+To deploy the large 8B model on the above backends, [please visit this section](non_cpu_backends.md).
 
 ## Download models from Hugging Face and convert from safetensor format to state dict
 
diff --git a/examples/models/llama/non_cpu_backends.md b/examples/models/llama/non_cpu_backends.md
new file mode 100644
index 0000000000..1ee594ebd8
--- /dev/null
+++ b/examples/models/llama/non_cpu_backends.md
@@ -0,0 +1,24 @@
+
+# Running Llama 3/3.1 8B on non-CPU backends
+
+### QNN
+Please follow [the instructions](https://pytorch.org/executorch/stable/llm/build-run-llama3-qualcomm-ai-engine-direct-backend.html) to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.
+
+### MPS
+Export:
+```
+python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32
+```
+
+After exporting the MPS model .pte file, the [iOS LLAMA](https://pytorch.org/executorch/main/llm/llama-demo-ios.html) app can support running the model. ` --embedding-quantize 4,32` is an optional args for quantizing embedding to reduce the model size.
+
+### CoreML
+Export:
+```
+python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w
+```
+
+After exporting the CoreML model .pte file, please [follow the instruction to build llama runner](https://github.com/pytorch/executorch/tree/main/examples/models/llama#step-3-run-on-your-computer-to-validate) with CoreML flags enabled as the instruction described.
+
+### MTK
+Please [follow the instructions](https://github.com/pytorch/executorch/tree/main/examples/mediatek#llama-example-instructions) to deploy llama3 8b to an Android phones with MediaTek chip