This repository consists of methods to run Transformers in PyTorch and ONNX with operators dispatch to AIE.
The target models are all Transformer models including Generate AI, LLMs, Stable Diffusion and like.
- Extension of Pytorch with custom operators in C++
- Eager mode execution with quantization and in-place op replacement strategy
- Flash Attention v2 for OPT to reduce memory utilization and increase prefill phase performance
- State of the art AWQ for 3-bit and 4-bit quantization
- State-of-the-art SmoothQuant to condition weights for 8-bit quantization
- Dynamic quantization of Transformer models (Generative LLMs, Stable Diffusion, etc.)
- Model analysis with observer insertion
- Layer parameter caching, checkpointing
- Perplexity, MMLU and HumanEval accuracy measurement of LLMs
- Benchmarking LLMs with state-of-the-art methods
- Pytorch -> ONNX using Optimum ORT Quantizer framework and eager execution on ONNX-EP
- Automatic selection of custom compute kernels for optimal prompt/prefill phase latency
- Speculative Decoding with HF pipline supported for Llama2, OPT and CodeLlama2-7b
- GGUF model support with llama.cpp framework
- Common C++ backend for Pytorch, ONNX and GGUF frameworks
The following models are supported on RyzenAI with the 4 quantization recipes described in here.
Model Name | SmoothQuant | AWQ | AWQPlus | PerGroup | Quant Model Size |
---|---|---|---|---|---|
facebook/opt-125m | ✓ | ✓ | ✓ | ✓ | 0.07 |
facebook/opt-1.3b | ✓ | ✓ | ✓ | ✓ | 0.8 |
facebook/opt-2.7b | ✓ | ✓ | ✓ | ✓ | 1.4 |
facebook/opt-6.7b | ✓ | ✓ | ✓ | ✓ | 3.8 |
facebook/opt-13b | ✓ | ✓ | ✓ | 7.5 | |
llama-2-7b* | ✓ | ✓ | 3.9 | ||
llama-2-7b-chat* | ✓ | ✓ | ✓ | ✓ | 3.9 |
llama-2-13b* | ✓ | 7.2 | |||
llama-2-13b-chat* | ✓ | ✓ | ✓ | 7.2 | |
Meta-Llama-3-8B-Instruct * | ✓ | ✓ | ✓ | 4.8 | |
Meta-Llama-3-8B * | ✓ | ✓ | ✓ | 4.8 | |
Meta-Llama-3.1-8B * | ✓ | 4.8 | |||
Meta-Llama-3.2-1B-Early | ✓ | ✓ | ✓ | 0.3 | |
Meta-Llama-3.2-3B-Early | ✓ | 4.8 | |||
bigscience/bloom-560m | ✓ | ✓ | 1.6 | ||
bigscience/bloom-1b1 | ✓ | ✓ | 0.65 | ||
bigscience/bloom-3b | ✓ | ✓ | 1.7 | ||
bigcode/starcoder | ✓ | ✓ | ✓ | 8.0 | |
code-llama-2-7b* | ✓ | ✓ | ✓ | 3.9 | |
codellama/CodeLlama-7b-hf | ✓ | ✓ | ✓ | 3.9 | |
codellama/CodeLlama-7b-instruct-hf | ✓ | ✓ | ✓ | 3.9 | |
google/gemma-2b ** | ✓ | ✓ | ✓ | 1.2 | |
google/gemma-7b ** | ✓ | ✓ | ✓ | 4.0 | |
THUDM/chatglm-6b | ✓ | 3.3 | |||
THUDM/chatglm3-6b | ✓ | ✓ | ✓ | 4.1 | |
Qwen/Qwen-7b | ✓ | ✓ | ✓ | 4.1 | |
Qwen/Qwen1.5-7B | ✓ | ✓ | ✓ | 4.1 | |
Qwen/Qwen1.5-7B-Chat | ✓ | ✓ | ✓ | tbd | |
microsoft/phi-2 | ✓ | tbd | |||
microsoft/phi-3 | ✓ | tbd | |||
microsoft/Phi-3.5-mini-instruct | ✓ | ✓ | ✓ | tbd | |
mistralai/Mistral-7B-v0.1 | ✓ | tbd | |||
TinyLlama-1.1B-Chat-v1.0 | ✓ | tbd | |||
mamba-1.4b-hf ** | ✓ | tbd | |||
mamba-2.8b-hf ** | ✓ | tbd |
📌 Important
* Need local weights for these models.
** Needs transformers==4.39.1 ;
pip install transformers==4.39.1
and follow samerun_awq.py
commands.
The main branch is intended for continuous development. All developers must strictly adhere to Contribution Guidelines enumerated in subsequent sections.
- Request the board using ChangeGear. Setup the board using the instructions provided in this link.
- Run the unit testcases to confirm driver installation works correctly.
On the PC, install the following dependencies
- Install Anaconda
- Install Visual Studio 2022 Community Edition
- Install Git
- Install AIE driver as described in this link
Open Anaconda Command Prompt or Anconda Powershell on Windows PC and clone Transformers repo:
git config --global core.longpaths true
git clone --recurse-submodules https://gitenterprise.xilinx.com/VitisAI/transformers.git
cd transformers
Create conda environment:
conda update -n base -c defaults conda -y
conda env create --file=env.yaml
conda activate ryzenai-transformers
build_dependencies.bat
# build_dependencies.ps1
AWQ Model zoo has precomputed scales, clips and zeros for various LLMs including OPT, Llama. Get the precomputed results:
git lfs install
cd <transformers>\ext
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
copy <transformers>\models\llm\Qwen1.5-7B-Chat-w4-g128.pt <transformers>\ext\awq_cache\
copy <transformers>\models\llm\Qwen1.5-7B-w4-g128.pt <transformers>\ext\awq_cache\
copy <transformers>\models\llm\Qwen-7b-w4-g128.pt <transformers>\ext\awq_cache\
On Command Prompt
@REM use any unused drive letter, Z: for example
subst Z: %cd%
@REM switch to the Z: drive
Z:
You can remove the virtual drive with:
On Command Prompt
subst /d Z:
On Anaconda Command Prompt
## For PHX
.\setup_phx.bat
## For STX
.\setup_stx.bat
On Anaconda PowerShell
## For PHX
.\setup_phx.ps1
## For STX
.\setup_stx.ps1
Remember to setup the target environment again if you switch to or from a virtual drive!
pip install ops\cpp --force-reinstall
pip install ops\torch_cpp --force-reinstall
For running onnxruntime apps, please refer to Vitis-AI EP Installation instructions.
To measure MMLU on LLMs, download the data, extract it, rename the folder to "mmlu_data" and place it in <transformers>/models/llm
directory
The following figure shows default execution of Pytorch models on CPU.
This flow has a Pytorch C++ extension and the hardware acceleration is within C++ extension. The C++ app has stationary weights, padding, tiling, calling AIE hardware and intermediate accumulation on CPU. Dynamic quantization from PyTorch is leveraged for this. It will be extended to a higher accuracy quantizer.
- Developers are required to use a fork of this repository to develop features and use it to create pull requests.
- Developers are required to add meaningful commit messages/PR titles.
- Code-checkin must happen at every low level submodule first before the check-ins to upper module is submitted.
- The PR should have the CI details from submodule to ensure traceability.
The figure below describes different components of this project for eager mode. At each level, developer is expected to write unit-tests, ensure they work on board. After all unit-tests work, model level performance analysis needs to be done.
Refer to LLM README
- All C++ and Python unit tests must pass
- OPT and Llama2 benchmark should not regress
- All models should generate good results with no degradation in performance
You can use pre-commit to run formatting and linting steps.
- After cloning the repository, run
pre-commit install
to let it run the linting steps prior to every commit. - You can also run it manually with
pre-commit run --from-ref origin/main --to-ref HEAD
.
📌 Note: The repository does not currently meet all the checks so if you run pre-commit run --all-files
, it will change many files.
We use isort, and black to format the Python files.
- Ensure that function annotations are used throughout the implementation.
📌 Note: Ensure RyzenAI is imported at the end of all imports in ops\python\*.py
.
We use clang-format to format the C/C++ files.
pre-commit run clang-format -a
# to format all files
.\ci\format_all.bat