You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Rui Liu,Member,IEEE1, Kailin Liang1, De Hu1,Tao Li 2, Dongchao Yang 3,Haizhou Li,Fellow,IEEE4,5
1 Inner Mongolia University
2 Northwestern Polytechnical University
3 The Chinese University of Hong Kong, Hongkong, China.
4 Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China
5 National University of Singapore, Singapore
Introduction
The cross-speaker emotion transfer (CSEF) in text-to-speech (TTS) synthesis task aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker.
Traditional CSEF works adopted speaker-emotion decoupling strategies and achieved remarkable emotion transfer performance.
However, in scenarios where the reference speech is contaminated with noise, extracting clean emotion features and decoupling the speaker and emotion features becomes challenging, thereby leading to a decrease in the effectiveness of emotion transfer.
To address the above issues, we propose a novel Noise-robust Cross-Speaker Emotion Transfer TTS model, termed NCE-TTS.
NCE-TTS integrates the noise-robust emotion information extraction and noise-robust speaker-emotion disentanglement into a unified framework with two new modules, including 1) Knowledge Distillation; and 2) Orthogonal Constraint.
The knowledge distillation aims to directly learn the emotion features of clean speech, from noisy speech, with a conditional diffusion model. The orthogonal constraint seeks to disentangle the deep emotion embedding and speaker embedding and further enhance the emotion-discriminative ability.
Unlike the traditional cascaded approach of first denoising and then extracting features, we have built a new training framework that achieves better emotion transfer results in noisy scenarios.
We conducted extensive experiments on a multi-speaker English emotional speech dataset ESD.
The objective and subjective results demonstrate that the proposed NCE-TTS can synthesize emotionally rich speech while preserving the target speaker's voice in various noisy scenarios, with a significant improvement compared to all advanced baselines.
Overview
The overview of NCE-TTS as the following picture shows.
In the following, we will show some generated samples by our proposed method.
We first show a direct comparison of the effects of NCE-TTS and baseline with different emotion types of emotional referenced audio.
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Angry ,clean
He was still in the forest!
Happy ,clean
He was still in the forest!
Neutral ,clean
He was still in the forest!
Sad ,clean
He was still in the forest!
Surprise ,clean
{: .table0}
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Angry ,5db
He was still in the forest!
Happy ,5db
He was still in the forest!
Neutral ,5db
He was still in the forest!
Sad ,5db
He was still in the forest!
Surprise ,5db
{: .table1_2}
2. Speaker Parallel Emotion Transfer on Various Noise Conditions.
In the following, we will demonstrate cases of Speaker Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.
3. Speaker Non-Parallel Emotion Transfer on Various Noise Conditions
In the following, we will demonstrate cases of Speaker Non-Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Angry ,clean
He was still in the forest!
Angry ,0db
He was still in the forest!
Angry ,5db
He was still in the forest!
Angry ,10db
He was still in the forest!
Angry ,0db-denoisy
He was still in the forest!
Angry ,5db-denoisy
He was still in the forest!
Angry ,10db-denoisy
{: .table1_2}
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Happy ,clean
He was still in the forest!
Happy ,0db
He was still in the forest!
Happy ,5db
He was still in the forest!
Happy ,10db
He was still in the forest!
Happy ,0db-denoisy
He was still in the forest!
Happy ,5db-denoisy
He was still in the forest!
Happy ,10db-denoisy
{: .table1_2}
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Neutral ,clean
He was still in the forest!
Neutral ,0db
He was still in the forest!
Neutral ,5db
He was still in the forest!
Neutral ,10db
He was still in the forest!
Neutral ,0db-denoisy
He was still in the forest!
Neutral ,5db-denoisy
He was still in the forest!
Neutral ,10db-denoisy
{: .table1_2}
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Sad ,clean
He was still in the forest!
Sad ,0db
He was still in the forest!
Sad ,5db
He was still in the forest!
Sad ,10db
He was still in the forest!
Sad ,0db-denoisy
He was still in the forest!
Sad ,5db-denoisy
He was still in the forest!
Sad ,10db-denoisy
{: .table1_2}
Content
Speaker reference
Emotion reference
GenerSpeech
Daft-Exprt
NoreSpeech
Vall-E
NCE-TTS
He was still in the forest!
Surprise ,clean
He was still in the forest!
Surprise ,0db
He was still in the forest!
Surprise ,5db
He was still in the forest!
Surprise ,10db
He was still in the forest!
Surprise ,0db-denoisy
He was still in the forest!
Surprise ,5db-denoisy
He was still in the forest!
Surprise ,10db-denoisy
{: .table1_2}
4. Visualization Study
In the following, We demonstrated the variation trends of pitch curves in synthesized audio under different noise and emotional conditions with the same content text and varying emotion references.