Distil-Whisper is a distilled variant of the Whisper model by OpenAI proposed in the paper Robust Knowledge Distillation via Large-Scale Pseudo Labelling. Compared to Whisper, Distil-Whisper runs in several times faster with 50% fewer parameters, while performing to within 1% word error rate (WER) on out-of-distribution evaluation data.
In this tutorial, we consider how to run Distil-Whisper using OpenVINO. We will use the pre-trained model from the Hugging Face Transformers library. To simplify the user experience, the Hugging Face Optimum library is used to convert the model to OpenVINO™ IR format. To further improve OpenVINO Distil-Whisper model performance INT8
post-training quantization from NNCF is applied.
This notebook demonstrates how to perform automatic speech recognition (ASR) using the Distil-Whisper model and OpenVINO.
The tutorial consists of following steps:
- Download PyTorch model
- Run PyTorch model inference
- Convert and run the model using OpenVINO Integration with HuggingFace Optimum.
- Compare the performance of PyTorch and the OpenVINO model.
- Use the OpenVINO model with HuggingFace pipelines for long-form audio transcription.
- Apply post-training quantization from NNCF.
- Launch an interactive demo for speech recognition
This is a self-contained example that relies solely on its code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.