Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use piper-phonemize to convert text to token IDs #453

Merged
merged 19 commits into from
Nov 30, 2023

Conversation

csukuangfj
Copy link
Collaborator

The generated wave with piper-phonemize for tokenization listens much better.

For the following text

  The sun shone bleakly in the sky, its meager light struggling 
to penetrate the thick foliage of the  forest. Birds sang their 
songs up in the crowns of the trees, fluttering from one branch to the other.

with the following scripts:

#!/usr/bin/env bash

python3 ./python-api-examples/offline-tts.py \
  --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
  --vits-data-dir=./build-shared/install/share/espeak-ng-data \
  --output-filename=./with-piper-phonemize.wav \
  "The sun shone bleakly in the sky, its meager light struggling to penetrate the thick foliage of the forest. Birds sang their songs up in the crowns of the trees, fluttering from one branch to the other."

python3 ./python-api-examples/offline-tts.py \
  --vits-model=./vits-piper-en_US-lessac-medium/en_US-lessac-medium.onnx \
  --vits-lexicon=./vits-piper-en_US-lessac-medium/lexicon.txt \
  --vits-tokens=./vits-piper-en_US-lessac-medium/tokens.txt \
  --output-filename=./with-lexicon.wav \
  "The sun shone bleakly in the sky, its meager light struggling to penetrate the thick foliage of the forest. Birds sang their songs up in the crowns of the trees, fluttering from one branch to the other."

You can find the generated waves below.
(I have converted *.wav to *.mov since GitHub does not allow us to upload .wav)

with lexicon with piper-phonemize
https://github.com/k2-fsa/sherpa-onnx/assets/5284924/39bf86af-4dde-4ce5-9ef1-f2df0a421c75 https://github.com/k2-fsa/sherpa-onnx/assets/5284924/7fcccc17-fa30-4178-8cea-112a760780b4

Notice

piper-phonemize is able to split a long text into senetences.
For each sentence, piper-phonemize adds BOS and EOS to it, so you can find that there is a pause between sentences.
The frontend code for piper-phonemize is super simple. We don't need a lexicon.txt or a tokens.txt anymore.

There should be no OOVs any longer.

The proununciation of a word is not fixed in the lexicon, rather it is determined by its surrounding words.

TODO

  • Refactor the Lexicon class
  • Support very long text by processing them separately
  • Update the meta data for exported models

CC @anita-smith1 @synesthesiam @MXC48 @rmcpantoja @beqabeqa473

It should fix the following issues:

though there is still some difference ( a slight loss in pronunciation compared to the original coqui model

Why not make an Android port of piper_phonemize and use it in next gen TTS instead of a lexicon? These voices could be used in a screen reader in the future, and there will be many words will try to read that may not be in that lexicon.

I am not sure you can cover everything.

It might be better to make a condition and use piper_phonemize for piper models.

Yes, this will result in adding espeak-ng, but it will be much better
than adding words manually

single words seem to have poor pronunciations compared to same words in phrases.

@rmcpantoja
Copy link

rmcpantoja commented Nov 28, 2023

That's OK👍🏻

This was referenced Nov 29, 2023
@csukuangfj csukuangfj changed the title WIP: Use piper-phonemize to convert text to token IDs Use piper-phonemize to convert text to token IDs Nov 30, 2023
@csukuangfj csukuangfj merged commit 62dc3c3 into k2-fsa:master Nov 30, 2023
2 of 161 checks passed
@csukuangfj csukuangfj deleted the english-piper-phonemize branch November 30, 2023 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants