Replies: 8 comments
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> nasim.alam086 |
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> reuben |
Beta Was this translation helpful? Give feedback.
-
>>> alchemi5t |
Beta Was this translation helpful? Give feedback.
-
>>> nasim.alam086 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> nasim.alam086
[November 17, 2019, 7:19pm]
Hi, slash
A few months ago, I trained a deep speech model for the Hindi-English
mixed dataset (mainly Hindi almost 80-90 %) of 1600 hrs of mono audio. I
got WER: 0.20, train loss: 35.36, and validation loss: 48.23 and good
transcription result on the test data.
When I added 300hrs of new audio(mono converted from stereo using sox)
from a similar environment but speech seems spoken fast because of the
long duration of the call (but I cut that into 1.5 to 10sec of audio
chunks). Now I again trained deep speech from scratch with new 300 hrs
of data (total: 1600+300= 1900 hrs). slash
Now I found that:
1. Model early stopped at10th epoch while previously (on 1600 hrs data)
early stopped at 13th epoch.
2. I got WER: 0.31, train loss: 51.1 and valid loss: 64.20.
I tested this model on two audio, first, new audio from the first 1600
hrs of data and second, data from new 300 hrs. I found the model gives
the same transcription as before for the first type of audio, but for
the second type of audio (audio from new 300 hrs of data), it skips lots
of words and transcription is also very poor.
prediction result improved when I give an audio chunk of almost 1-4 sec
and the chunk is amplitude peak normalized through Audacity.
So I then normalized my whole training data(1900 hrs) and trained from
scratch again, this time as well model early stopped at 10th epoch,
train loss becomes 50.09 from 51.1 and validation loss become 63 from
64.20 (minor changes in losses) and WER become 0.2982, but transcription
did not improve.
My question is:
1. What was the reason behind the model's behaviour?
2. Why model transcription is better on normalized audio chunks, Do I
need to give training audio of the same length?
3. Since I trained the model from new data (300 hrs) converted from
stereo to mono, the model's accuracy degraded, Does stereo to mono
data conversion affected the accuracy?
Any Help appreciated. slash
Thanks
[This is an archived TTS discussion thread from discourse.mozilla.org/t/training-and-validation-loss-increases-and-transcription-worsen-when-added-few-hours-of-new-audio-data-of-same-environment]
Beta Was this translation helpful? Give feedback.
All reactions