Rui Liu,Member,IEEE¹, Kailin Liang¹, De Hu¹,Tao Li ², Dongchao Yang ³,Haizhou Li,Fellow,IEEE^4，5

1 Inner Mongolia University

2 Northwestern Polytechnical University

3 The Chinese University of Hong Kong, Hongkong, China.

4 Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

5 National University of Singapore, Singapore

Introduction

The cross-speaker emotion transfer (CSEF) in text-to-speech (TTS) synthesis task aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. Traditional CSEF works adopted speaker-emotion decoupling strategies and achieved remarkable emotion transfer performance. However, in scenarios where the reference speech is contaminated with noise, extracting clean emotion features and decoupling the speaker and emotion features becomes challenging, thereby leading to a decrease in the effectiveness of emotion transfer. To address the above issues, we propose a novel Noise-robust Cross-Speaker Emotion Transfer TTS model, termed NCE-TTS. NCE-TTS integrates the noise-robust emotion information extraction and noise-robust speaker-emotion disentanglement into a unified framework with two new modules, including 1) Knowledge Distillation; and 2) Orthogonal Constraint. The knowledge distillation aims to directly learn the emotion features of clean speech, from noisy speech, with a conditional diffusion model. The orthogonal constraint seeks to disentangle the deep emotion embedding and speaker embedding and further enhance the emotion-discriminative ability. Unlike the traditional cascaded approach of first denoising and then extracting features, we have built a new training framework that achieves better emotion transfer results in noisy scenarios. We conducted extensive experiments on a multi-speaker English emotional speech dataset ESD. The objective and subjective results demonstrate that the proposed NCE-TTS can synthesize emotionally rich speech while preserving the target speaker's voice in various noisy scenarios, with a significant improvement compared to all advanced baselines.

Overview

The overview of NCE-TTS as the following picture shows.

In the following, we will show some generated samples by our proposed method.

1. Direct comparison of NCE-TTS with baseline model synthesized speech.

We first show a direct comparison of the effects of NCE-TTS and baseline with different emotion types of emotional referenced audio.

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Angry ,clean
He was still in the forest!		Happy ,clean
He was still in the forest!		Neutral ,clean
He was still in the forest!		Sad ,clean
He was still in the forest!		Surprise ,clean
{: .table0}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Angry ,5db
He was still in the forest!		Happy ,5db
He was still in the forest!		Neutral ,5db
He was still in the forest!		Sad ,5db
He was still in the forest!		Surprise ,5db
{: .table1_2}

2. Speaker Parallel Emotion Transfer on Various Noise Conditions.

In the following, we will demonstrate cases of Speaker Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
I smell the breath of an English.		Angry ,clean
I smell the breath of an English.		Angry ,0db
I smell the breath of an English.		Angry ,5db
I smell the breath of an English.		Angry ,10db
I smell the breath of an English.		Angry ,0db-denoisy
I smell the breath of an English.		Angry ,5db-denoisy
I smell the breath of an English.		Angry ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
I smell the breath of an English.		Happy ,clean
I smell the breath of an English.		Happy ,0db
I smell the breath of an English.		Happy ,5db
I smell the breath of an English.		Happy ,10db
I smell the breath of an English.		Happy ,0db-denoisy
I smell the breath of an English.		Happy ,5db-denoisy
I smell the breath of an English.		Happy ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
I smell the breath of an English.		Neutral ,clean
I smell the breath of an English.		Neutral ,0db
I smell the breath of an English.		Neutral ,5db
I smell the breath of an English.		Neutral ,10db
I smell the breath of an English.		Neutral ,0db-denoisy
I smell the breath of an English.		Neutral ,5db-denoisy
I smell the breath of an English.		Neutral ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
I smell the breath of an English.		Sad ,clean
I smell the breath of an English.		Sad ,0db
I smell the breath of an English.		Sad ,5db
I smell the breath of an English.		Sad ,10db
I smell the breath of an English.		Sad ,0db-denoisy
I smell the breath of an English.		Sad ,5db-denoisy
I smell the breath of an English.		Sad ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
I smell the breath of an English.		Surprise ,clean
I smell the breath of an English.		Surprise ,0db
I smell the breath of an English.		Surprise ,5db
I smell the breath of an English.		Surprise ,10db
I smell the breath of an English.		Surprise ,0db-denoisy
I smell the breath of an English.		Surprise ,5db-denoisy
I smell the breath of an English.		Surprise ,10db-denoisy
{: .table1_2}

3. Speaker Non-Parallel Emotion Transfer on Various Noise Conditions

In the following, we will demonstrate cases of Speaker Non-Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Angry ,clean
He was still in the forest!		Angry ,0db
He was still in the forest!		Angry ,5db
He was still in the forest!		Angry ,10db
He was still in the forest!		Angry ,0db-denoisy
He was still in the forest!		Angry ,5db-denoisy
He was still in the forest!		Angry ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Happy ,clean
He was still in the forest!		Happy ,0db
He was still in the forest!		Happy ,5db
He was still in the forest!		Happy ,10db
He was still in the forest!		Happy ,0db-denoisy
He was still in the forest!		Happy ,5db-denoisy
He was still in the forest!		Happy ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Neutral ,clean
He was still in the forest!		Neutral ,0db
He was still in the forest!		Neutral ,5db
He was still in the forest!		Neutral ,10db
He was still in the forest!		Neutral ,0db-denoisy
He was still in the forest!		Neutral ,5db-denoisy
He was still in the forest!		Neutral ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Sad ,clean
He was still in the forest!		Sad ,0db
He was still in the forest!		Sad ,5db
He was still in the forest!		Sad ,10db
He was still in the forest!		Sad ,0db-denoisy
He was still in the forest!		Sad ,5db-denoisy
He was still in the forest!		Sad ,10db-denoisy
{: .table1_2}

Content	Speaker reference	Emotion reference	GenerSpeech	Daft-Exprt	NoreSpeech	Vall-E	NCE-TTS
He was still in the forest!		Surprise ,clean
He was still in the forest!		Surprise ,0db
He was still in the forest!		Surprise ,5db
He was still in the forest!		Surprise ,10db
He was still in the forest!		Surprise ,0db-denoisy
He was still in the forest!		Surprise ,5db-denoisy
He was still in the forest!		Surprise ,10db-denoisy
{: .table1_2}

4. Visualization Study

In the following, We demonstrated the variation trends of pitch curves in synthesized audio under different noise and emotional conditions with the same content text and varying emotion references.

Content	clean emotion reference	5dB emotion reference
I am going to back home.	Angry , clean	Angry , 5dB
I am going to back home.	Happy , clean	Happy , 5dB
I am going to back home.	Neutral , clean	Angry , 5dB
I am going to back home.	Sad , clean	Sad , 5dB
I am going to back home.	Surprise , clean	Sad , 5dB
{: .table3}

5. Ablation Study

In the following, we demonstrate cases in ablation experiments.

Content	Speaker reference	Emotion reference	w/o L_ort	w/o L_op	w/o L_kd	NCE-TTS
It's part of my secret.		Angry ,clean
It's part of my secret.		Angry ,5db
{: .table4}

Content	Speaker reference	Emotion reference	w/o L_ort	w/o L_op	w/o L_kd	NCE-TTS
It's part of my secret.		Happy ,clean
It's part of my secret.		Happy ,5db
{: .table4}

Content	Speaker reference	Emotion reference	w/o L_ort	w/o L_op	w/o L_kd	NCE-TTS
It's part of my secret.		Neutral ,clean
It's part of my secret.		Neutral ,5db
{: .table4}

Content	Speaker reference	Emotion reference	w/o L_ort	w/o L_op	w/o L_kd	NCE-TTS
It's part of my secret.		Sad ,clean
It's part of my secret.		Sad ,5db
{: .table4}

Content	Speaker reference	Emotion reference	w/o L_ort	w/o L_op	w/o L_kd	NCE-TTS
It's part of my secret.		Surprise ,clean
It's part of my secret.		Surprise ,5db
{: .table4}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

Rui Liu,Member,IEEE¹, Kailin Liang¹, De Hu¹,Tao Li ², Dongchao Yang ³,Haizhou Li,Fellow,IEEE^4，5

1 Inner Mongolia University

2 Northwestern Polytechnical University

3 The Chinese University of Hong Kong, Hongkong, China.

4 Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

5 National University of Singapore, Singapore

Introduction

Overview

1. Direct comparison of NCE-TTS with baseline model synthesized speech.

2. Speaker Parallel Emotion Transfer on Various Noise Conditions.

3. Speaker Non-Parallel Emotion Transfer on Various Noise Conditions

4. Visualization Study

5. Ablation Study

Files

index.md

Latest commit

History

index.md

File metadata and controls

Rui Liu,Member,IEEE1, Kailin Liang1, De Hu1,Tao Li 2, Dongchao Yang 3,Haizhou Li,Fellow,IEEE4，5

1 Inner Mongolia University

2 Northwestern Polytechnical University

3 The Chinese University of Hong Kong, Hongkong, China.

4 Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

5 National University of Singapore, Singapore

Introduction

Overview

1. Direct comparison of NCE-TTS with baseline model synthesized speech.

2. Speaker Parallel Emotion Transfer on Various Noise Conditions.

3. Speaker Non-Parallel Emotion Transfer on Various Noise Conditions

4. Visualization Study

5. Ablation Study

Rui Liu,Member,IEEE¹, Kailin Liang¹, De Hu¹,Tao Li ², Dongchao Yang ³,Haizhou Li,Fellow,IEEE^4，5