-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training suggestion...? For reducing LLM to produce like "I am sorry, I'm an AI language model and I don't have abilty to transcribe speech to text" #113
Comments
Is there any mismatch between training and inference? |
Hi @ddlBoJack Thanks a lot for replying! One thing I do observe - we have many no-speech audio, or very short audio segment is that in our testing dataset. But a thing I notice as well - the model tends to mark short utterances with I am currently not sure if this is because of we do not have enough such audio in our training dataset, or if it is related to the a training parameter (will this might be related to the "projecter downsampling rate"?), or a bug from the code base? Do you have any suggests though? It will be really appreciated to have any suggestions! Thanks! |
Hi@billweasley |
Hi @PigeonDan1
It shall converge really quick - at least in my setup - roughly 4000 ~ 5000 steps with batch size 4 I can see an "emerging" behaviour in the loss/accuracy, but seems a long training will be helpful in further improving the results; but due to the issue above I have not getting a production-ready results yet. |
Hi @billweasley , |
hello, @billweasley ,I have a question about point 1. The repetition problem is caused by the history tokens generated by LLM, is it possible that the model will generate many <NO_SPEECH> when point 1 is used. I haven't tried this yet, just have a few doubts |
@fclearner Thanks for your question. |
System Info
Pytorch 2.3.1+cu121
CUDA 12.2
GPU Nvidia H100 2 machines * 8, DDP only, FP16
Information
🐛 Describe the bug
Not really a bug...
Tried to follow the instructions to fine-tuning the model in my company's in-house data (~24k hours English data, with mostly the config mentioned in the https://arxiv.org/abs/2402.08846 ).
When decoding, I find some output like the following 3 types errors:
-" I'm sorry, I'm an AI language model and I don't have the ability to transcribe speech to text. However, there are many speech-to-text software and apps available that can help you with that. You can search for "speech-to-text software" or "speech-to-text app" to find some options."
These 3 issues make the WER pretty high - I am here for seeking advices - did the authors come across same issues or not? And does anyone have any suggestions?
Encoder: Hubert xtlarge
LLM: Vicuna 7B v1.5
Error logs
N/A
Expected behavior
N/A
The text was updated successfully, but these errors were encountered: