LoRA for Whisper speech transcription #483
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Based off of a large chunk of work from the LLM LoRA example, this PR presents applying LoRA fine-tuning to the Whisper speech model. All of the relevant changes to run LoRA for Whisper are in a new directory called
lora
on{project-root}/whisper/lora
. The primary training & data-loading scripts arelora.py
,utils.py
. The LoRA layer definitions are implemented inwhisper/lora/models/lora.py
and mimic the existing work for LLM LoRA.Core changes for Whisper are:
train()
func to batch up audio & transcriptions pairs as inputsAll other changes are essentially ancillary changes to keep the whisper-lora example self-contained and easy to run and modify.
whisper/run_transcribe.py
whisper
inference code aa a sub-folder under the new{project-root}/whisper/lora/models
directory. This was primarily because I didn't want to re-use the existing whisper code ({project-root}/whisper
) with relative path imports etc in my new lora code ({project-root}/whisper/lora
). It seemed like doing so would work fine in some run configurations but may not at other times. To keep things simple and easily hackable/flexible I duplicated the existingwhisper
modeling code as-is into{project-root}/whisper/lora/models
. Therefore, almost the entirety of{project-root}/whisper/lora/models
should be identical to existing{project-root}/whisper/
.Tests/Runs
mozilla-foundation/common_voice_16_1
or its variantsThe telugu language phrase I used in my docs here roughly means “i have to go to the office on thursday”. But the model seems to have transcribed it to “I have to go to office on weekdays.“, which is a pretty good translation.
I didn’t intend to train a translator but i was kinda impressed that it did so. I should note: this quirky and delightful instance happened in one of my training runs; all the other times the transcription wasn't great. It's not super reproducible because I only ran the training for ~1000 iterations on ~38 or so training examples. I trained this on a m1max MacBook Pro with 64gb of memory for about ~10-12mins. The converging val loss I saw was ~50-70 after 1000 iterations.
Some other instances of training runs results in
or
or
which in all cases is definitely better than transcribing
বেবববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববববব�
which is the output of the original model.