Android LLM Inference Engine Profiler
Key features:
- Unplugged testing wrapped in app, no adb needed, better in simulating real-world unplugged use.
- Temperature is kept below
$40^\circ \mathbf{C}$ before each test, and wait the app for a while to cool down automatically to avoid severe CPU&GPU throttling. - Charge level is kept to be above 50% to avoid phones from some vendors automatically activating power-saving.
- Support most of phones with Android API Level
$\geq 30$ .
Currently Supported Engines:
- MNN (Our Modified Version of MNN-3.0.4) (CPU/OpenCL)
- llama.cpp (Version b4735) (CPU)
- MediaPipe
- MLC-LLM
- ExecuteTorch
- mllm
Currently Supported Metrics:
- speed (tok/s)
- capacity consumption (uAh/tok)
- energy consumption (mJ/tok)
- perplexity
- accuracy
Currently Supported Models:
- Qwen Series (text-generation)
- Llama Series (text-generation)
- Gemma Series
- Phi-2
Currently Supported Test Mode:
- json/jsonl/parquet file stored dataset subset testing (subset because of high time/energy cost for large dataset testing on phone)
- designated/fixed length input test
The anroid demo is located in ./android2
directory.
You needs all the submoudles. add --recursive
for your git clone
.
git clone --recursive https://github.com/huangzhengxiang/LLM-Profiler.git
-
Convert model to mnn/gguf/tflite... format. (for model converting methods, please refer to each format's repository)
-
Push your model to
/data/local/tmp/llm/model
directory.
# example
adb shell mkdir /data/local/tmp/llm
adb shell mkdir /data/local/tmp/llm/model/
adb push model/qwen2_5-1_5b-int4-mnn/ /data/local/tmp/llm/model/
The release version is ready to use at android2\app\release\app-release.apk
or in GitHub Release. Install it on your cell phone.
After model and apk uploading. Install the apk and use it. Click 加载模型
first and then record your voice after you see it finished 模型加载完成
.
Several LLM inference engines are contained in this app: MNN-Habst (Ours), llama.cpp,
MNN-Habst is up-to-date with MNN master branch at commit: 5bd7ffc22a54f6436e387ec2a5cfde7e207feba1 (Version 3.0.4). Then, heterogeneity-aware backend selection and tuning (Habst algorithm) is added to the repo of MNN-Habst which is the submodule.
llama.cpp is added at commit: 73e2ed3ce3492d3ed70193dd09ae8aa44779651d (Version b4735), being the submodule.
Then, open project in Android Studio
and build.
Internal: Power_Normal
, Power_High
, Power_MemoryBound
, Power_SelectCore
. ("normal", "high", "memory", "select")
External Additional Option: "exhaustive", (requires an additional list of selective core group size. e.g., 8Gen3 [1,3,2,2], big->small, and the results are stored in a local file.), "tune_prefill" (tune prefill).
- multi-turn conversation dataset (ShareGPT-en): https://huggingface.co/datasets/shareAI/ShareGPT-Chinese-English-90k (./sharegpt_jsonl/common_en_70k.jsonl) (input prefill controlled decode)
- role play (RoleLLM): https://huggingface.co/datasets/ZenMoore/RoleBench (./rolebench-eng/role-generalization/role_specific/test.jsonl) (input prefill, controlled decode)
- math problem QA (math_qa): https://huggingface.co/datasets/allenai/math_qa (input prefill, free talk decode)
- Open Domain QA (truthful_qa): https://huggingface.co/datasets/truthfulqa/truthful_qa (input prefill, free talk decode)