Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voice Breaks and Latency in Text-to-Speech Conversion #108

Open
Devloper-RG opened this issue Sep 19, 2024 · 5 comments
Open

Voice Breaks and Latency in Text-to-Speech Conversion #108

Devloper-RG opened this issue Sep 19, 2024 · 5 comments

Comments

@Devloper-RG
Copy link

I'm experiencing issues with breaks in the generated voice output, seemingly caused by latency in the text-to-speech (TTS) conversion process. The audio output has occasional breaks, which disrupt the flow of speech.

Steps I've tried:

Decreasing block size: This helped reduce some latency in delivering TTS audio output, but the issue persists.
Adjusting play_steps_s: I've decreased this parameter to minimize latency. However, setting play_steps_s below 0.5 causes errors, so I’ve kept it at 0.5 for now.

Any suggestions on how to further reduce the latency and improve the smoothness of the audio output would be greatly appreciated.

@eustlb
Copy link
Collaborator

eustlb commented Sep 19, 2024

Hey @Devloper-RG,
On what device are you running the pipeline?

@Devloper-RG
Copy link
Author

@eustlb
I'm running the server on a Google Cloud Platform (GCP) VM with 2 NVIDIA T4 GPUs, and the client is on my local machine.

@eustlb
Copy link
Collaborator

eustlb commented Sep 19, 2024

I never tried this setup, there are two possibilities for choppy audio:

  1. the connection between the server and the client is not fast enough (see this related issue) → we are working on switching from TCP to UDP to have faster audio packet transfer.
  2. it might be that a T4 is not enough to generate 43 tokens (play_step_s of 0.5) and to run DAC decoding in less than 0.5 seconds (so the time of the audio chunk) → we are working on enabling more performant torch compile modes (i.e. the ones that captures cuda graphs: reduce-overhead and max-autotune) with Parler-TTS + streaming that could make it work on a T4. Here what you can try is increase play_steps_s → you'll increase latency yet you'll also reduce the number of DAC decoding steps

What you can do is switching from Parler-TTS to MeloTTS by setting the --tts melo flag. You'll loose text-to-speech generation streaming that will increase latency, but also remove this point as a possibility for choppy audio

  1. If you still experience choppy audio, reason 1. given above is responsible.
  2. If not, then Parler-TTS is responsible. Increase play_steps_suntil you do not experience choppy audio anymore.

Also, can you give me the command you're running?

@andimarafioti
Copy link
Member

Also, beware that I don't think the code uses multiple gpus yet. So 2 T4s is the same as 1.

@Devloper-RG
Copy link
Author

@eustlb
I'll implement the solutions you suggested and will update you if they work or if I find the underlying issue. As requested, here are the terminal commands I used:
Client side: python listen_and_play.py --host <IP address>
Server side: python s2s_pipeline.py --recv_host 0.0.0.0 --send_host 0.0.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@andimarafioti @eustlb @Devloper-RG and others