Transformers

This repository consists of methods to run Transformers in PyTorch and ONNX with operators dispatch to AIE.

The target models are all Transformer models including Generate AI, LLMs, Stable Diffusion and like.

Features

Extension of Pytorch with custom operators in C++
Eager mode execution with quantization and in-place op replacement strategy
Flash Attention v2 for OPT to reduce memory utilization and increase prefill phase performance
State of the art AWQ for 3-bit and 4-bit quantization
State-of-the-art SmoothQuant to condition weights for 8-bit quantization
Dynamic quantization of Transformer models (Generative LLMs, Stable Diffusion, etc.)
Model analysis with observer insertion
Layer parameter caching, checkpointing
Perplexity, MMLU and HumanEval accuracy measurement of LLMs
Benchmarking LLMs with state-of-the-art methods
Pytorch -> ONNX using Optimum ORT Quantizer framework and eager execution on ONNX-EP
Automatic selection of custom compute kernels for optimal prompt/prefill phase latency
Speculative Decoding with HF pipline supported for Llama2, OPT and CodeLlama2-7b
GGUF model support with llama.cpp framework
Common C++ backend for Pytorch, ONNX and GGUF frameworks

Models supported with Pytorch flow

The following models are supported on RyzenAI with the 4 quantization recipes described in here.

Model Name	SmoothQuant	AWQ	AWQPlus	PerGroup	Quant Model Size
facebook/opt-125m	✓	✓	✓	✓	0.07
facebook/opt-1.3b	✓	✓	✓	✓	0.8
facebook/opt-2.7b	✓	✓	✓	✓	1.4
facebook/opt-6.7b	✓	✓	✓	✓	3.8
facebook/opt-13b		✓	✓	✓	7.5
llama-2-7b*	✓			✓	3.9
llama-2-7b-chat*	✓	✓	✓	✓	3.9
llama-2-13b*				✓	7.2
llama-2-13b-chat*		✓	✓	✓	7.2
Meta-Llama-3-8B-Instruct *		✓	✓	✓	4.8
Meta-Llama-3-8B *		✓	✓	✓	4.8
Meta-Llama-3.1-8B *				✓	4.8
Meta-Llama-3.2-1B-Early		✓	✓	✓	0.3
Meta-Llama-3.2-3B-Early				✓	4.8
bigscience/bloom-560m	✓			✓	1.6
bigscience/bloom-1b1	✓			✓	0.65
bigscience/bloom-3b	✓			✓	1.7
bigcode/starcoder		✓	✓	✓	8.0
code-llama-2-7b*		✓	✓	✓	3.9
codellama/CodeLlama-7b-hf		✓	✓	✓	3.9
codellama/CodeLlama-7b-instruct-hf		✓	✓	✓	3.9
google/gemma-2b **		✓	✓	✓	1.2
google/gemma-7b **		✓	✓	✓	4.0
THUDM/chatglm-6b				✓	3.3
THUDM/chatglm3-6b		✓	✓	✓	4.1
Qwen/Qwen-7b		✓	✓	✓	4.1
Qwen/Qwen1.5-7B		✓	✓	✓	4.1
Qwen/Qwen1.5-7B-Chat		✓	✓	✓	tbd
microsoft/phi-2				✓	tbd
microsoft/phi-3				✓	tbd
microsoft/Phi-3.5-mini-instruct		✓	✓	✓	tbd
mistralai/Mistral-7B-v0.1				✓	tbd
TinyLlama-1.1B-Chat-v1.0				✓	tbd
mamba-1.4b-hf **				✓	tbd
mamba-2.8b-hf **				✓	tbd

📌 Important

* Need local weights for these models.

** Needs transformers==4.39.1 ; pip install transformers==4.39.1 and follow same run_awq.py commands.

Jump to running LLMs in Pytorch - following installations instructions below first

Prequisites

The main branch is intended for continuous development. All developers must strictly adhere to Contribution Guidelines enumerated in subsequent sections.

Request PHX or STX Ryzen-AI PC

Request the board using ChangeGear. Setup the board using the instructions provided in this link.
Run the unit testcases to confirm driver installation works correctly.

Install dependencies

On the PC, install the following dependencies

Install Anaconda
Install Visual Studio 2022 Community Edition
Install Git
Install AIE driver as described in this link

Setup Transformers Env

Step 1: Download repository and setup conda environment

Open Anaconda Command Prompt or Anconda Powershell on Windows PC and clone Transformers repo:

git config --global core.longpaths true
git clone --recurse-submodules https://gitenterprise.xilinx.com/VitisAI/transformers.git
cd transformers

Create conda environment:

conda update -n base -c defaults conda -y
conda env create --file=env.yaml
conda activate ryzenai-transformers
build_dependencies.bat
# build_dependencies.ps1

AWQ Model zoo has precomputed scales, clips and zeros for various LLMs including OPT, Llama. Get the precomputed results:

git lfs install
cd <transformers>\ext
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
copy <transformers>\models\llm\Qwen1.5-7B-Chat-w4-g128.pt <transformers>\ext\awq_cache\
copy <transformers>\models\llm\Qwen1.5-7B-w4-g128.pt <transformers>\ext\awq_cache\
copy <transformers>\models\llm\Qwen-7b-w4-g128.pt <transformers>\ext\awq_cache\

⚠️ Warning: Windows has a path length limit that you may hit when building the project or installing the wheels, resulting in cryptic errors. To work around it, use a virtual drive to shorten the path the repository is cloned to:

On Command Prompt

@REM use any unused drive letter, Z: for example
subst Z: %cd%
@REM switch to the Z: drive
Z:

You can remove the virtual drive with:

On Command Prompt

subst /d Z:

Step 2: Setup target environment

On Anaconda Command Prompt

## For PHX
.\setup_phx.bat

## For STX
.\setup_stx.bat

On Anaconda PowerShell

## For PHX
.\setup_phx.ps1

## For STX
.\setup_stx.ps1

Remember to setup the target environment again if you switch to or from a virtual drive!

Step 3: Build dependencies

pip install ops\cpp --force-reinstall
pip install ops\torch_cpp --force-reinstall

Step 4: Install ONNX EP for running ONNX based flows

For running onnxruntime apps, please refer to Vitis-AI EP Installation instructions.

Step 5: Verify Installation

Python unit testcases
C++ unit testcases

Step 6: MMLU

To measure MMLU on LLMs, download the data, extract it, rename the folder to "mmlu_data" and place it in <transformers>/models/llm directory

Run LLMs

All LLMs in Pytorch
Speculative Decoding of LLMs in Pytorch
GGUF Models with llama.cpp

Eager Mode Execution Flow

The following figure shows default execution of Pytorch models on CPU.

This flow has a Pytorch C++ extension and the hardware acceleration is within C++ extension. The C++ app has stationary weights, padding, tiling, calling AIE hardware and intermediate accumulation on CPU. Dynamic quantization from PyTorch is leveraged for this. It will be extended to a higher accuracy quantizer.

Code Contribution Guidelines

Developers are required to use a fork of this repository to develop features and use it to create pull requests.
Developers are required to add meaningful commit messages/PR titles.
Code-checkin must happen at every low level submodule first before the check-ins to upper module is submitted.
The PR should have the CI details from submodule to ensure traceability.

The figure below describes different components of this project for eager mode. At each level, developer is expected to write unit-tests, ensure they work on board. After all unit-tests work, model level performance analysis needs to be done.

Hard Requirements

Refer to LLM README

All C++ and Python unit tests must pass
OPT and Llama2 benchmark should not regress
All models should generate good results with no degradation in performance

Pre-Commit

You can use pre-commit to run formatting and linting steps.

After cloning the repository, run pre-commit install to let it run the linting steps prior to every commit.
You can also run it manually with pre-commit run --from-ref origin/main --to-ref HEAD.

📌 Note: The repository does not currently meet all the checks so if you run pre-commit run --all-files, it will change many files.

Python Formatting

We use isort, and black to format the Python files.

Ensure that function annotations are used throughout the implementation.

📌 Note: Ensure RyzenAI is imported at the end of all imports in ops\python\*.py.

C++ Formatting

We use clang-format to format the C/C++ files.

pre-commit run clang-format -a

# to format all files
.\ci\format_all.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Transformers

Features

Models supported with Pytorch flow

Prequisites

Request PHX or STX Ryzen-AI PC

Install dependencies

Setup Transformers Env

Step 1: Download repository and setup conda environment

Step 2: Setup target environment

Step 3: Build dependencies

Step 4: Install ONNX EP for running ONNX based flows

Step 5: Verify Installation

Step 6: MMLU

Run LLMs

Eager Mode Execution Flow

Code Contribution Guidelines

Hard Requirements

Pre-Commit

Python Formatting

C++ Formatting

Files

README.md

Latest commit

History

README.md

File metadata and controls

Transformers

Features

Models supported with Pytorch flow

Prequisites

Request PHX or STX Ryzen-AI PC

Install dependencies

Setup Transformers Env

Step 1: Download repository and setup conda environment

Step 2: Setup target environment

Step 3: Build dependencies

Step 4: Install ONNX EP for running ONNX based flows

Step 5: Verify Installation

Step 6: MMLU

Run LLMs

Eager Mode Execution Flow

Code Contribution Guidelines

Hard Requirements

Pre-Commit

Python Formatting

C++ Formatting