Skip to content

Latest commit

 

History

History
 
 

python

Benchmarking script for large language models

This script provides a unified approach to estimate performance for Large Language Models. It is based on pipelines provided by Optimum-Intel and allows to estimate performance for pytorch and openvino models, using almost the same code and precollected models.

Usage

1. Start a Python virtual environment

python3 -m venv python-env
source python-env/bin/activate
pip install update --upgrade
pip install -r requirements.txt

2. Convert a model to OpenVINO IR

The conversion script for preparing benchmarking models, convert.py allows to reproduce IRs stored on shared drive.

Prerequisites: install conversion dependencies using requirements.txt

Usage:

python convert.py --model_id <model_id_or_path> --output_dir <out_dir>

Paramters:

  • --model_id - model_id for downloading from huggngface_hub (https://huggingface.co/models) or path with directory where pytorch model located.
  • --output_dir - output directory for saving OpenVINO model
  • --precision - (optional, default FP32), precision for model conversion FP32 or FP16
  • --save_orig - flag for saving original pytorch model, model will be located in <output_dir>/pytorch subdirectory.
  • --compress_weights - The weight compression option, INT8 - INT8 weights, 4BIT_DEFAULT - for 4-bit weights compression with predefined configuration, INT4_SYM - for INT4 compressed weights with symmetric quantization, INT4_ASYM - for INT4 compressed weights with assymetric quantization. You can specify multiple backends separated by a space.
  • --compress_weights_backends - (optional, default openvino) backends for weights compression, this option has an effect only with --compress_weights. You can specify multiple backends separated by a space.
  • --ratio - Compression ratio between primary and backup precision, e.g. INT4/INT8.
  • --group_size - Size of the group of weights that share the same quantization parameters

Usage example:

python convert.py --model_id meta-llama/Llama-2-7b-chat-hf --output_dir models/llama-2-7b-chat

the result of running the command will have the following file structure:

|-llama-2-7b-chat
  |-pytorch
    |-dldt
       |-FP32
          |-openvino_model.xml
          |-openvino_model.bin
          |-config.json
          |-added_tokens.json
          |-tokenizer_config.json
          |-tokenizer.json
          |-tokenizer.model
          |-special_tokens_map.json

3. Bechmarking

Prerequisites: install benchmarking dependencies using requirements.txt

pip install -r requirements.txt

note: You can specify the installed openvino version through pip install

# e.g. 
pip install openvino==2023.2.0

4. Run the following command to test the performance of one LLM model

python benchmark.py -m <model> -d <device> -r <report_csv> -f <framework> -p <prompt text> -n <num_iters>
# e.g.
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -p "What is openvino?" -n 2
python benchmark.py -m models/llama-2-7b-chat/pytorch/dldt/FP32 -pf prompts/llama-2-7b-chat_l.jsonl -n 2

Parameters:

  • -m - model path
  • -d - inference device (default=cpu)
  • -r - report csv
  • -f - framework (default=ov)
  • -p - interactive prompt text
  • -pf - path of JSONL file including interactive prompts
  • -n - number of benchmarking iterations, if the value greater 0, will exclude the first iteration. (default=0)
python ./benchmark.py -h # for more information

Running torch.compile()

The option --torch_compile_backend uses torch.compile() to speed up the PyTorch code by compiling it into optimized kernels using a selected backend.

Prerequisites: install benchmarking dependencies using requirements.txt

pip install -r requirements/requirements.txt

In order to run the torch.compile() on CUDA GPU, install additionally the nightly PyTorch version:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118

Add the option --torch_compile_backend with the desired backend: pytorch or openvino (default) while running the benchmarking script:

python ./benchmark.py -m models/llama-2-7b-chat/pytorch -d CPU --torch_compile_backend openvino

Note

If you encounter AttributeError, please check NOTES.md which provides solutions to the known errors.