Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Logging Behavior with WhisperX Model #14

Open
flazouh opened this issue Jan 7, 2025 · 0 comments
Open

Inconsistent Logging Behavior with WhisperX Model #14

flazouh opened this issue Jan 7, 2025 · 0 comments

Comments

@flazouh
Copy link

flazouh commented Jan 7, 2025

Observation with WhisperX Logs

I've been using WhisperX and noticed something odd about the logs. Sometimes, the model gives percentages like 55% during the process, but other times, it doesn't. Here are two examples of the log outputs:

Example 1

No language specified, the model will first detect the language for each audio file (which slows down the inference time).

Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.1. To make this upgrade permanent, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/torch/whisperx-vad-segmentation.bin`.

The model was trained with pyannote.audio 0.0.1, but yours is 3.0.1. If you don't revert pyannote.audio to 0.x, something might go wrong.

The model was trained with torch 1.10.0+cu102, but yours is 2.1.0+cu121. If you don't revert torch to 1.x, something might go wrong.

Detected language: en (1.00) in the first 30 seconds of the audio...

Example 2

100.0%
100.0%
100.0%
100.0%
100.0%

No language specified, the model will first detect the language for each audio file (which slows down the inference time).

The model was trained with pyannote.audio 0.0.1, but yours is 3.1.1. If you don't revert pyannote.audio to 0.x, something might go wrong.

The model was trained with torch 1.10.0+cu102, but yours is 2.1.0+cu121. If you don't revert torch to 1.x, something might go wrong.

It took 1876.43 milliseconds to load the model, 2004.21 milliseconds to load the audio, and 10330.61 milliseconds to transcribe it. It also took 12245.40 milliseconds to align the output.

The maximum amount of GPU memory allocated over runtime was 3.61 GB.

Key Observations

  • Example 1: No percentage updates during the transcription process
  • Example 2: Several lines of 100.0% logs appear before the final results are shown
  • Both examples indicate no language was specified and required detection, but their behaviors differ

Questions and Hypotheses

I'm curious to know:

  1. Why do the logs sometimes show percentages (e.g., 100.0%) and sometimes not?
  2. Could this behavior be related to:
    • Differences in configuration?
    • Runtime environment variations?
    • Dependency versions (e.g., PyTorch, pyannote.audio)?

Steps to Reproduce

To investigate further:

  1. Run WhisperX on different audio files without specifying a language
  2. Observe the logs during the language detection and transcription phases
  3. Compare outputs to identify when percentages are shown versus not

Any help appreciated @victor-upmeet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant