Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Parameter adjustments if training own voice #12

Open
Marcophono2 opened this issue Sep 14, 2021 · 0 comments
Open

Model Parameter adjustments if training own voice #12

Marcophono2 opened this issue Sep 14, 2021 · 0 comments

Comments

@Marcophono2
Copy link

Hello!

It seems that I am lacking a bit the general understanding of the embedding encoding and the synthesizer. Please allow me to post three questions here:

  1. If I only want to use and optimize Angela Merkel's voice, wouldn't it make sense to delete all voice inputs and only leave the voice of Merkel? Or is at least one other (female) voice needed to make it easier for the model to train itsself? What I did was to delete all male voices and leave Merkel's voice and that one from eva_k. Or would it have been more (time) efficient only to use the one target voice? And how about the parameter settings? In my case of two voices I changed the model settings to

speakers_per_batch = 2
utterances_per_speaker = 1000

But I have absolutely no idea if that makes sense. I have a Geforce 2070 with 8 GB RAM. I chose the parameters is that way to use nearly 100% of the RAM. But I also could have set

speakers_per_batch = 20
utterances_per_speaker = 100

to take the same amount of RAM.

  1. If I only want to train and optimize my own voice - am I right that I should only train the model with the (only existing) male voice and with my own audio samples I import by your cool Wikipedia read&record tool? And same parameter question like above.

  2. Today I used the toolbox for the very first time. My Angela voice was trained so far for about 12 hours and the result was impressing! Not perfect of course, 12 hours are not enough, I know, but impressing. But what I do not understand: If I enter a german text into the text field and press the synthesize&vocode button again and again the audio output quality always changes. But why? I thought always the same model (embedding and synthesizer) is used. Same input = same output. So why does it change every time?

Best regards
Marc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant