Merge remote-tracking branch 'SYSTRAN/master' into refacto/update_fas…

…ter-whisper_1.1.0
linto-ai · Nov 25, 2024 · 3649d91 · 3649d91
2 parents 8bca59b + 97a4785
commit 3649d91
Show file tree

Hide file tree

Showing 35 changed files with 3,754 additions and 509 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -7,7 +7,7 @@ Contributions are welcome! Here are some pointers to help you install the librar
 We recommend installing the module in editable mode with the `dev` extra requirements:
 
 ```bash
-git clone https://github.com/guillaumekln/faster-whisper.git
+git clone https://github.com/SYSTRAN/faster-whisper.git
 cd faster-whisper/
 pip install -e .[dev]
 ```

diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 MIT License
 
-Copyright (c) 2023 Guillaume Klein
+Copyright (c) 2023 SYSTRAN
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,3 +1,4 @@
-include faster_whisper/assets/silero_vad.onnx
+include faster_whisper/assets/silero_encoder_v5.onnx
+include faster_whisper/assets/silero_decoder_v5.onnx
 include requirements.txt
 include requirements.conversion.txt
diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-[![CI](https://github.com/guillaumekln/faster-whisper/workflows/CI/badge.svg)](https://github.com/guillaumekln/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
+[![CI](https://github.com/SYSTRAN/faster-whisper/workflows/CI/badge.svg)](https://github.com/SYSTRAN/faster-whisper/actions?query=workflow%3ACI) [![PyPI version](https://badge.fury.io/py/faster-whisper.svg)](https://badge.fury.io/py/faster-whisper)
 
 # Faster Whisper transcription with CTranslate2
 
@@ -12,58 +12,47 @@ This implementation is up to 4 times faster than [openai/whisper](https://github
 
 For reference, here's the time and memory usage that are required to transcribe [**13 minutes**](https://www.youtube.com/watch?v=0u7tTptBo9I) of audio using different implementations:
 
-* [openai/whisper](https://github.com/openai/whisper)@[6dea21fd](https://github.com/openai/whisper/commit/6dea21fd7f7253bfe450f1e2512a0fe47ee2d258)
-* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[3b010f9](https://github.com/ggerganov/whisper.cpp/commit/3b010f9bed9a6068609e9faf52383aea792b0362)
-* [faster-whisper](https://github.com/guillaumekln/faster-whisper)@[cce6b53e](https://github.com/guillaumekln/faster-whisper/commit/cce6b53e4554f71172dad188c45f10fb100f6e3e)
+* [openai/whisper](https://github.com/openai/whisper)@[v20240930](https://github.com/openai/whisper/tree/v20240930)
+* [whisper.cpp](https://github.com/ggerganov/whisper.cpp)@[v1.7.2](https://github.com/ggerganov/whisper.cpp/tree/v1.7.2)
+* [transformers](https://github.com/huggingface/transformers)@[v4.46.3](https://github.com/huggingface/transformers/tree/v4.46.3)
+* [faster-whisper](https://github.com/SYSTRAN/faster-whisper)@[v1.1.0](https://github.com/SYSTRAN/faster-whisper/tree/v1.1.0)
 
 ### Large-v2 model on GPU
 
-| Implementation | Precision | Beam size | Time | Max. GPU memory | Max. CPU memory |
-| --- | --- | --- | --- | --- | --- |
-| openai/whisper | fp16 | 5 | 4m30s | 11325MB | 9439MB |
-| faster-whisper | fp16 | 5 | 54s | 4755MB | 3244MB |
-| faster-whisper | int8 | 5 | 59s | 3091MB | 3117MB |
-
-*Executed with CUDA 11.7.1 on a NVIDIA Tesla V100S.*
+| Implementation | Precision | Beam size | Time | VRAM Usage |
+| --- | --- | --- | --- | --- |
+| openai/whisper | fp16 | 5 | 2m23s | 4708MB |
+| whisper.cpp (Flash Attention) | fp16 | 5 | 1m05s | 4127MB |
+| transformers (SDPA)[^1] | fp16 | 5 | 1m52s | 4960MB |
+| faster-whisper | fp16 | 5 | 1m03s | 4525MB |
+| faster-whisper (`batch_size=8`) | fp16 | 5 | 17s | 6090MB |
+| faster-whisper | int8 | 5 | 59s | 2926MB |
+| faster-whisper (`batch_size=8`) | int8 | 5 | 16s | 4500MB |
 
-### Small model on CPU
+### distil-whisper-large-v3 model on GPU
 
-| Implementation | Precision | Beam size | Time | Max. memory |
+| Implementation | Precision | Beam size | Time | YT Commons WER |
 | --- | --- | --- | --- | --- |
-| openai/whisper | fp32 | 5 | 10m31s | 3101MB |
-| whisper.cpp | fp32 | 5 | 17m42s | 1581MB |
-| whisper.cpp | fp16 | 5 | 12m39s | 873MB |
-| faster-whisper | fp32 | 5 | 2m44s | 1675MB |
-| faster-whisper | int8 | 5 | 2m04s | 995MB |
+| transformers (SDPA) (`batch_size=16`) | fp16 | 5 | 46m12s | 14.801 |
+| faster-whisper (`batch_size=16`) | fp16 | 5 | 25m50s | 13.527 |
 
-*Executed with 8 threads on a Intel(R) Xeon(R) Gold 6226R.*
+*GPU Benchmarks are Executed with CUDA 12.4 on a NVIDIA RTX 3070 Ti 8GB.*
+[^1]: transformers OOM for any batch size > 1
 
+### Small model on CPU
 
-### Distil-whisper
-
-| Implementation | Precision | Beam size | Time | Gigaspeech WER |
+| Implementation | Precision | Beam size | Time | RAM Usage |
 | --- | --- | --- | --- | --- |
-| distil-whisper/distil-large-v2 | fp16 | 4 |- | 10.36 |
-| [faster-distil-large-v2](https://huggingface.co/Systran/faster-distil-whisper-large-v2) | fp16 | 5 | - | 10.28 |
-| distil-whisper/distil-medium.en | fp16 | 4 | - | 11.21 |
-| [faster-distil-medium.en](https://huggingface.co/Systran/faster-distil-whisper-medium.en) | fp16 | 5 | - | 11.21 |
+| openai/whisper | fp32 | 5 | 6m58s | 2335MB |
+| whisper.cpp | fp32 | 5 | 2m05s | 1049MB |
+| whisper.cpp (OpenVINO) | fp32 | 5 | 1m45s | 1642MB |
+| faster-whisper | fp32 | 5 | 2m37s | 2257MB |
+| faster-whisper (`batch_size=8`) | fp32 | 5 | 1m06s | 4230MB |
+| faster-whisper | int8 | 5 | 1m42s | 1477MB |
+| faster-whisper (`batch_size=8`) | int8 | 5 | 51s | 3608MB |
 
-*Executed with CUDA 11.4 on a NVIDIA 3090.*
+*Executed with 8 threads on an Intel Core i7-12700K.*
 
-<details>
-<summary>testing details (click to expand)</summary>
-
-For `distil-whisper/distil-large-v2`, the WER is tested with code sample from [link](https://huggingface.co/distil-whisper/distil-large-v2#evaluation). for `faster-distil-whisper`, the WER is tested with setting:
-```python
-from faster_whisper import WhisperModel
-
-model_size = "distil-large-v2"
-# model_size = "distil-medium.en"
-# Run on GPU with FP16
-model = WhisperModel(model_size, device="cuda", compute_type="float16")
-segments, info = model.transcribe("audio.mp3", beam_size=5, language="en")
-```
-</details>
 
 ## Requirements
 
@@ -75,24 +64,29 @@ Unlike openai-whisper, FFmpeg does **not** need to be installed on the system. T
 
 GPU execution requires the following NVIDIA libraries to be installed:
 
-* [cuBLAS for CUDA 11](https://developer.nvidia.com/cublas)
-* [cuDNN 8 for CUDA 11](https://developer.nvidia.com/cudnn)
+* [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
+* [cuDNN 9 for CUDA 12](https://developer.nvidia.com/cudnn)
+
+**Note**: The latest versions of `ctranslate2` only support CUDA 12 and cuDNN 9. For CUDA 11 and cuDNN 8, the current workaround is downgrading to the `3.24.0` version of `ctranslate2`, for CUDA 12 and cuDNN 8, downgrade to the `4.4.0` version of `ctranslate2`, (This can be done with `pip install --force-reinstall ctranslate2==4.4.0` or specifying the version in a `requirements.txt`).
 
-There are multiple ways to install these libraries. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below.
+There are multiple ways to install the NVIDIA libraries mentioned above. The recommended way is described in the official NVIDIA documentation, but we also suggest other installation methods below. 
 
 <details>
 <summary>Other installation methods (click to expand)</summary>
 
+
+**Note:** For all these methods below, keep in mind the above note regarding CUDA versions. Depending on your setup, you may need to install the _CUDA 11_ versions of libraries that correspond to the CUDA 12 libraries listed in the instructions below.
+
 #### Use Docker
 
-The libraries are installed in this official NVIDIA Docker image: `nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04`.
+The libraries (cuBLAS, cuDNN) are installed in this official NVIDIA CUDA Docker images: `nvidia/cuda:12.3.2-cudnn9-runtime-ubuntu22.04`.
 
 #### Install with `pip` (Linux only)
 
 On Linux these libraries can be installed with `pip`. Note that `LD_LIBRARY_PATH` must be set before launching Python.
 
 ```bash
-pip install nvidia-cublas-cu11 nvidia-cudnn-cu11
+pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
 
 export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
 ```
@@ -117,13 +111,13 @@ pip install faster-whisper
 ### Install the master branch
 
 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/refs/heads/master.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/refs/heads/master.tar.gz"
 ```
 
 ### Install a specific commit
 
 ```bash
-pip install --force-reinstall "faster-whisper @ https://github.com/guillaumekln/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
+pip install --force-reinstall "faster-whisper @ https://github.com/SYSTRAN/faster-whisper/archive/a4f1cc8f11433e454c3934442b5e1a4ed5e865c3.tar.gz"
 ```
 
 </details>
@@ -159,18 +153,40 @@ for segment in segments:
 segments, _ = model.transcribe("audio.mp3")
 segments = list(segments)  # The transcription will actually run here.
 ```
-### Faster-distil-whisper
-For usage of `faster-distil-whisper`, please refer to: https://github.com/guillaumekln/faster-whisper/issues/533
+
+### Batched Transcription
+The following code snippet illustrates how to run batched transcription on an example audio file. `BatchedInferencePipeline.transcribe` is a drop-in replacement for `WhisperModel.transcribe`
+
+```python
+from faster_whisper import WhisperModel, BatchedInferencePipeline
+
+model = WhisperModel("turbo", device="cuda", compute_type="float16")
+batched_model = BatchedInferencePipeline(model=model)
+segments, info = batched_model.transcribe("audio.mp3", batch_size=16)
+
+for segment in segments:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
+```
+
+### Faster Distil-Whisper
+
+The Distil-Whisper checkpoints are compatible with the Faster-Whisper package. In particular, the latest [distil-large-v3](https://huggingface.co/distil-whisper/distil-large-v3)
+checkpoint is intrinsically designed to work with the Faster-Whisper transcription algorithm. The following code snippet 
+demonstrates how to run inference with distil-large-v3 on a specified audio file:
 
 ```python
-model_size = "distil-large-v2"
-# model_size = "distil-medium.en"
+from faster_whisper import WhisperModel
+
+model_size = "distil-large-v3"
+
 model = WhisperModel(model_size, device="cuda", compute_type="float16")
-segments, info = model.transcribe("audio.mp3", beam_size=5, 
-    language="en", max_new_tokens=128, condition_on_previous_text=False)
+segments, info = model.transcribe("audio.mp3", beam_size=5, language="en", condition_on_previous_text=False)
 
+for segment in segments:
+    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
 ```
-NOTE: Empirically, `condition_on_previous_text=True` will degrade the performance of `faster-distil-whisper` for long audio. Degradation on the first chunk was observed with `initial_prompt` too.
+
+For more information about the distil-large-v3 model, refer to the original [model card](https://huggingface.co/distil-whisper/distil-large-v3).
 
 ### Word-level timestamps
 
@@ -190,7 +206,7 @@ The library integrates the [Silero VAD](https://github.com/snakers4/silero-vad)
 segments, _ = model.transcribe("audio.mp3", vad_filter=True)
 ```
 
-The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
+The default behavior is conservative and only removes silence longer than 2 seconds. See the available VAD parameters and default values in the [source code](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/vad.py). They can be customized with the dictionary argument `vad_parameters`:
 
 ```python
 segments, _ = model.transcribe(
@@ -199,6 +215,7 @@ segments, _ = model.transcribe(
     vad_parameters=dict(min_silence_duration_ms=500),
 )
 ```
+Vad filter is enabled by default for batched transcription.
 
 ### Logging
 
@@ -213,13 +230,14 @@ logging.getLogger("faster_whisper").setLevel(logging.DEBUG)
 
 ### Going further
 
-See more model and transcription options in the [`WhisperModel`](https://github.com/guillaumekln/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
+See more model and transcription options in the [`WhisperModel`](https://github.com/SYSTRAN/faster-whisper/blob/master/faster_whisper/transcribe.py) class implementation.
 
 ## Community integrations
 
 Here is a non exhaustive list of open-source projects using faster-whisper. Feel free to add your project to the list!
 
 
+* [faster-whisper-server](https://github.com/fedirz/faster-whisper-server) is an OpenAI compatible server using `faster-whisper`. It's easily deployable with Docker, works with OpenAI SDKs/CLI, supports streaming, and live transcription.
 * [WhisperX](https://github.com/m-bain/whisperX) is an award-winning Python library that offers speaker diarization and accurate word-level timestamps using wav2vec2 alignment
 * [whisper-ctranslate2](https://github.com/Softcatala/whisper-ctranslate2) is a command line client based on faster-whisper and compatible with the original client from openai/whisper.
 * [whisper-diarize](https://github.com/MahmoudAshraf97/whisper-diarization) is a speaker diarization tool that is based on faster-whisper and NVIDIA NeMo.
@@ -269,6 +287,7 @@ model = faster_whisper.WhisperModel("username/whisper-large-v3-ct2")
 If you are comparing the performance against other Whisper implementations, you should make sure to run the comparison with similar settings. In particular:
 
 * Verify that the same transcription options are used, especially the same beam size. For example in openai/whisper, `model.transcribe` uses a default beam size of 1 but here we use a default beam size of 5.
+* Transcription speed is closely affected by the number of words in the transcript, so ensure that other implementations have a similar WER (Word Error Rate) to this one.
 * When running on CPU, make sure to set the same number of threads. Many frameworks will read the environment variable `OMP_NUM_THREADS`, which can be set when running your script:
 
 ```bash

diff --git a/benchmark/benchmark.m4a b/benchmark/benchmark.m4a
diff --git a/benchmark/evaluate_yt_commons.py b/benchmark/evaluate_yt_commons.py
@@ -0,0 +1,80 @@
+import argparse
+import json
+import os
+
+from io import BytesIO
+
+from datasets import load_dataset
+from jiwer import wer
+from pytubefix import YouTube
+from pytubefix.exceptions import VideoUnavailable
+from tqdm import tqdm
+from transformers.models.whisper.english_normalizer import EnglishTextNormalizer
+
+from faster_whisper import BatchedInferencePipeline, WhisperModel, decode_audio
+
+
+def url_to_audio(row):
+    buffer = BytesIO()
+    yt = YouTube(row["link"])
+    try:
+        video = (
+            yt.streams.filter(only_audio=True, mime_type="audio/mp4")
+            .order_by("bitrate")
+            .desc()
+            .last()
+        )
+        video.stream_to_buffer(buffer)
+        buffer.seek(0)
+        row["audio"] = decode_audio(buffer)
+    except VideoUnavailable:
+        print(f'Failed to download: {row["link"]}')
+        row["audio"] = []
+    return row
+
+
+parser = argparse.ArgumentParser(description="WER benchmark")
+parser.add_argument(
+    "--audio_numb",
+    type=int,
+    default=None,
+    help="Specify the number of validation audio files in the dataset."
+    " Set to None to retrieve all audio files.",
+)
+args = parser.parse_args()
+
+with open(os.path.join(os.path.dirname(__file__), "normalizer.json"), "r") as f:
+    normalizer = EnglishTextNormalizer(json.load(f))
+
+dataset = load_dataset("mobiuslabsgmbh/youtube-commons-asr-eval", streaming=True).map(
+    url_to_audio
+)
+model = WhisperModel("large-v3", device="cuda")
+pipeline = BatchedInferencePipeline(model, device="cuda")
+
+
+all_transcriptions = []
+all_references = []
+# iterate over the dataset and run inference
+for i, row in tqdm(enumerate(dataset["test"]), desc="Evaluating..."):
+    if not row["audio"]:
+        continue
+    result, info = pipeline.transcribe(
+        row["audio"][0],
+        batch_size=8,
+        word_timestamps=False,
+        without_timestamps=True,
+    )
+
+    all_transcriptions.append("".join(segment.text for segment in result))
+    all_references.append(row["text"][0])
+    if args.audio_numb and i == (args.audio_numb - 1):
+        break
+
+# normalize predictions and references
+all_transcriptions = [normalizer(transcription) for transcription in all_transcriptions]
+all_references = [normalizer(reference) for reference in all_references]
+
+# compute the WER metric
+word_error_rate = 100 * wer(hypothesis=all_transcriptions, reference=all_references)
+print("WER: %.3f" % word_error_rate)