Training and validation loss increases and transcription worsen when added few hours of new audio data of same environment #933

JRMeyer · 2021-03-08T04:41:42Z

JRMeyer
Mar 8, 2021
Maintainer

>>> nasim.alam086
[November 17, 2019, 7:19pm]

Hi, slash
A few months ago, I trained a deep speech model for the Hindi-English
mixed dataset (mainly Hindi almost 80-90 %) of 1600 hrs of mono audio. I
got WER: 0.20, train loss: 35.36, and validation loss: 48.23 and good
transcription result on the test data.

When I added 300hrs of new audio(mono converted from stereo using sox)
from a similar environment but speech seems spoken fast because of the
long duration of the call (but I cut that into 1.5 to 10sec of audio
chunks). Now I again trained deep speech from scratch with new 300 hrs
of data (total: 1600+300= 1900 hrs). slash
Now I found that:

1. Model early stopped at10th epoch while previously (on 1600 hrs data)
early stopped at 13th epoch.
2. I got WER: 0.31, train loss: 51.1 and valid loss: 64.20.

I tested this model on two audio, first, new audio from the first 1600
hrs of data and second, data from new 300 hrs. I found the model gives
the same transcription as before for the first type of audio, but for
the second type of audio (audio from new 300 hrs of data), it skips lots
of words and transcription is also very poor.

prediction result improved when I give an audio chunk of almost 1-4 sec
and the chunk is amplitude peak normalized through Audacity.

So I then normalized my whole training data(1900 hrs) and trained from
scratch again, this time as well model early stopped at 10th epoch,
train loss becomes 50.09 from 51.1 and validation loss become 63 from
64.20 (minor changes in losses) and WER become 0.2982, but transcription
did not improve.

My question is:

1. What was the reason behind the model's behaviour?
2. Why model transcription is better on normalized audio chunks, Do I
need to give training audio of the same length?
3. Since I trained the model from new data (300 hrs) converted from
stereo to mono, the model's accuracy degraded, Does stereo to mono
data conversion affected the accuracy?

Any Help appreciated. slash
Thanks

[This is an archived TTS discussion thread from discourse.mozilla.org/t/training-and-validation-loss-increases-and-transcription-worsen-when-added-few-hours-of-new-audio-data-of-same-environment]

JRMeyer · 2021-03-08T04:41:44Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[November 18, 2019, 9:13am]

> Model early stopped at10th epoch while previously (on 1600 hrs data)
> early stopped at 13th epoch.

Early stopping might require finer tuning than you provided, I'd take
that with some restraint.

> When I added 300hrs of new audio(mono converted from stereo using sox)
> from a similar environment but speech seems spoken fast because of the
> long duration of the call (but I cut that into 1.5 to 10sec of audio
> chunks).

I don't completely understand your flow here.

> train loss becomes 50.09 from 51.1 and validation loss become 63 from
> 64.20 (minor changes in losses) and WER become 0.2982, but
> transcription did not improve.

Well, that's very low change, so it's not surprising the transcription
does not improves

> What was the reason behind the model's behaviour?

Sorry, but that's too generic, what are you exactly refering to?

> Why model transcription is better on normalized audio chunks, Do I
> need to give training audio of the same length?

I have not understood exactly how your added 300 hours are built, and we
don't know length distribution of others.

> Since I trained the model from new data (300 hrs) converted from
> stereo to mono, the model's accuracy degraded, Does stereo to mono
> data conversion affected the accuracy?

Devil lies in details. In theory, no. In practice, it depends on a lot
of factors, including how you uses sox. Since you are downgrading sound,
it should be possible to do it in such a way that does not looses audio
informations. But if you are not careful, you may introduce artifacts
that breaks.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:41:47Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> nasim.alam086
[November 19, 2019, 6:07am]

> I don't completely understand your flow here.

Thanks for the prompt reply. Let me clear you at this: slash
previously One month ago, I trained a model on 1600hrs of data which
contains 1.5 to 10 sec of audio chunks the average length of audio is
5-6 sec. The whole dataset was mono. So I got wer=0.20 on the model
trained on this dataset.

last weekend, I got 300 hrs of new data similar to the previous
dataset (same environment but seems like speech rate high means spoken
fast as compared to the previous one) and call recordings were
stereo. slash
So I converted these audio to mono using SoX: 'sox infile.wav
outfile.l.wav remix 1'. slash
Then I made this dataset distribution the same as the previous one
(1.5sec to 10sec audio chunks of each audio) and then mixed new 300hrs
of data with 1600 hrs of previous dataset.

The new trained model did not give me as a good result, WER: 0.2982,
train loss: 51.1 and Val loss: 64.20.

> Well, that's very low change, so it's not surprising the transcription
> does not improve

I normalized the peak amplitude of the whole data using FFmpeg-normalize
due to high speech rate in the new dataset (300 hrs).

slash [quote='lissyx, post:2, topic:48714' slash ] slash
Devil lies in details. In theory, no. In practice, it depends on a lot
of factors, including how you uses sox. Since you are downgrading sound,
it should be possible to do it in such a way that does not looses audio
informations. But if you are not careful, you may introduce artifacts
that breaks.

Thanks

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:41:50Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[November 19, 2019, 8:48am]

> sox infile.wav outfile.l.wav remix 1

You need to make sure this is not crippling your data. Especially, I
remember some unobvious behavior about dithering that would alter
results quite a lot.

> The new trained model did not give me as a good result, WER: 0.2982,
> train loss: 51.1 and Val loss: 64.20.

On which test set was this evaluated ?

> I normalized the peak amplitude of the whole data using
> FFmpeg-normalize due to high speech rate in the new dataset (300 hrs).

So you used sox deepspeech.commands ffmpeg right? That's likely to add different
artifacts, you should triple check that as well.

Would your normalization reduce the pace of speech ? Have you applied
that when you get the higher WER ?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:41:52Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> reuben
[November 19, 2019, 8:55am]

As well as what lissyx has already said, your normalization strategy
sounds like a bad idea to me. First of all, I don't see how amplitude
normalization is connected to rate of speech at all. Second, it sounds
like you had 1600 hours of data that is not normalized and then you
added 300 hours of normalized data. This discrepancy may very well be
what's hurting your accuracy on the second training run.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:41:55Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> alchemi5t
[November 19, 2019, 12:44pm]

Quick question. Where exactly are you getting this much data from?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:41:57Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> nasim.alam086
[November 20, 2019, 6:51am]

Thanks for the reply.

> On which test set was this evaluated ?

I evaluated this model on new calls (other than train and validation)
from 300 hrs dataset. I also evaluated it on test calls from the
previous 1600 hrs data.

> So you used sox deepspeech.commands ffmpeg right? That's likely to add
> different artifacts, you should triple check that as well.

OK sure.

> So you used sox deepspeech.commands ffmpeg right? That's likely to add
> different artifacts, you should triple check that as well.

Yes

> Would your normalization reduce the pace of speech ? Have you applied
> that when you get the higher WER ?

No, I don't able to realize that, it reduced the pace of speech but it
averaged the peak amplitude and the prediction on that normalized call
far better than the normal wav files. So I thought to normalize the
whole 1900 hrs of data and then train the model again.

> Have you applied that when you get the higher WER ?

The WER has not changed much, it was 0.34 on new test data(300 hrs only
dset) and 0.26 (on test data of first 1600 hrs dset only). Previously It
was 0.39 and 0.29.

So my question is, Is there any perfect way of conversion of audio from
stereo to mono apart from SOX, or Audacity, and slash
if am using SOX or audacity then how can I ensure artifacts retained.

Thanks,

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:42:00Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> nasim.alam086
[November 20, 2019, 7:01am]

> As well as what lissyx has already said, your normalization strategy
> sounds like a bad idea to me.

Sorry, its my bad, I forgot to mention that new dataset was having high
volume as well along with high rate of speech, so I thought
normalization would solve both this problem but It did not for pace of
speech. Thanks

> Second, it sounds like you had 1600 hours of data that is not
> normalized and then you added 300 hours of normalized data.

No, I normalized whole 1900 hours of data (prev 1600 hours and later 300
hrs).

> This discrepancy may very well be what's hurting your accuracy on the
> second training run.

Yes, I checked that as well, it happens that's why normalize the whole
dataset. Thanks a lot.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T04:42:03Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[November 20, 2019, 8:17am]

> I evaluated this model on new calls (other than train and validation)
> from 300 hrs dataset. I also evaluated it on test calls from the
> previous 1600 hrs data.

Ok, this is unclear. Can you show us the whole picture at once ?

> The WER has not changed much, it was 0.34 on new test data(300 hrs
> only dset) and 0.26 (on test data of first 1600 hrs dset only).
> Previously It was 0.39 and 0.29.

I don't understand, once you compain the WER is worse, now you say
different figures.

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training and validation loss increases and transcription worsen when added few hours of new audio data of same environment #933

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training and validation loss increases and transcription worsen when added few hours of new audio data of same environment #933

JRMeyer Mar 8, 2021 Maintainer

Replies: 8 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author