long form inference #11

eschmidbauer · 2024-10-09T18:21:58Z

Is long form inference possible with whisper_trt ?
I tried inference on 4m16s audio clip and it appeared to only transcribe 30s, here is my script:

from whisper_trt import load_trt_model

model = load_trt_model("small.en")
result = model.transcribe("test.wav")

The text was updated successfully, but these errors were encountered:

jaybdub · 2024-10-16T16:38:16Z

It should be possible, but seems like we'll need to make some modifications to the transcribe function:

Line 162 in 268eff1

if int(mel.shape[2]) > whisper.audio.N_FRAMES:

Currently, it runs on a single 30s window.

John

eschmidbauer · 2024-10-16T17:54:03Z

It would be great to demonstrate long-form here perhaps by using sliding window

Provide feedback