Inconsistent Logging Behavior with WhisperX Model #14

flazouh · 2025-01-07T03:06:30Z

Observation with WhisperX Logs

I've been using WhisperX and noticed something odd about the logs. Sometimes, the model gives percentages like 55% during the process, but other times, it doesn't. Here are two examples of the log outputs:

Example 1

No language specified, the model will first detect the language for each audio file (which slows down the inference time).

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.1. To make this upgrade permanent, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`.

The model was trained with pyannote.audio 0.0.1, but yours is 3.0.1. If you don't revert pyannote.audio to 0.x, something might go wrong.

The model was trained with torch 1.10.0+cu102, but yours is 2.1.0+cu121. If you don't revert torch to 1.x, something might go wrong.

Detected language: en (1.00) in the first 30 seconds of the audio...

Example 2

100.0%
100.0%
100.0%
100.0%
100.0%

No language specified, the model will first detect the language for each audio file (which slows down the inference time).

The model was trained with pyannote.audio 0.0.1, but yours is 3.1.1. If you don't revert pyannote.audio to 0.x, something might go wrong.

The model was trained with torch 1.10.0+cu102, but yours is 2.1.0+cu121. If you don't revert torch to 1.x, something might go wrong.

It took 1876.43 milliseconds to load the model, 2004.21 milliseconds to load the audio, and 10330.61 milliseconds to transcribe it. It also took 12245.40 milliseconds to align the output.

The maximum amount of GPU memory allocated over runtime was 3.61 GB.

Key Observations

Example 1: No percentage updates during the transcription process
Example 2: Several lines of 100.0% logs appear before the final results are shown
Both examples indicate no language was specified and required detection, but their behaviors differ

Questions and Hypotheses

I'm curious to know:

Why do the logs sometimes show percentages (e.g., 100.0%) and sometimes not?
Could this behavior be related to:
- Differences in configuration?
- Runtime environment variations?
- Dependency versions (e.g., PyTorch, pyannote.audio)?

Steps to Reproduce

To investigate further:

Run WhisperX on different audio files without specifying a language
Observe the logs during the language detection and transcription phases
Compare outputs to identify when percentages are shown versus not

Any help appreciated @victor-upmeet

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Logging Behavior with WhisperX Model #14

Inconsistent Logging Behavior with WhisperX Model #14

flazouh commented Jan 7, 2025 •

edited

Loading

Inconsistent Logging Behavior with WhisperX Model #14

Inconsistent Logging Behavior with WhisperX Model #14

Comments

flazouh commented Jan 7, 2025 • edited Loading

Observation with WhisperX Logs

Example 1

Example 2

Key Observations

Questions and Hypotheses

Steps to Reproduce

flazouh commented Jan 7, 2025 •

edited

Loading