Skip to content

Latest commit

 

History

History
341 lines (290 loc) · 93.8 KB

index.md

File metadata and controls

341 lines (290 loc) · 93.8 KB
<style> .container-lg { max-width: 1900px; margin-right: auto; margin-left: auto; } .pic_mod { display: block; margin: 0 auto; width: 900px; } .pic_pitch { display: block; margin: 0 auto; width: 260px; } .main-content { max-width: 1900px; margin-right: auto; margin-left: auto; } .page-header { color: #fff; text-align: center; background-color: #efaaff; background-image: linear-gradient(342deg, #ff5d5deb, #3ddbffe3); } .main-content h1, .main-content h2, .main-content h3, .main-content h5, .main-content h6 { margin-top: 2rem margin-bottom: 1rem; font-weight: normal; color: #2778aa; } .main-content h4 { margin-top: 1rem margin-bottom: 1rem; font-weight: normal; color: #606c71; } </style>

Rui Liu,Member,IEEE1, Kailin Liang1, De Hu1,Tao Li 2, Dongchao Yang 3,Haizhou Li,Fellow,IEEE4,5

1 Inner Mongolia University

2 Northwestern Polytechnical University

3 The Chinese University of Hong Kong, Hongkong, China.

4 Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen), China

5 National University of Singapore, Singapore

Introduction

The cross-speaker emotion transfer (CSEF) in text-to-speech (TTS) synthesis task aims to synthesize speech for a target speaker with the emotion transferred from reference speech recorded by another (source) speaker. Traditional CSEF works adopted speaker-emotion decoupling strategies and achieved remarkable emotion transfer performance. However, in scenarios where the reference speech is contaminated with noise, extracting clean emotion features and decoupling the speaker and emotion features becomes challenging, thereby leading to a decrease in the effectiveness of emotion transfer. To address the above issues, we propose a novel Noise-robust Cross-Speaker Emotion Transfer TTS model, termed NCE-TTS. NCE-TTS integrates the noise-robust emotion information extraction and noise-robust speaker-emotion disentanglement into a unified framework with two new modules, including 1) Knowledge Distillation; and 2) Orthogonal Constraint. The knowledge distillation aims to directly learn the emotion features of clean speech, from noisy speech, with a conditional diffusion model. The orthogonal constraint seeks to disentangle the deep emotion embedding and speaker embedding and further enhance the emotion-discriminative ability. Unlike the traditional cascaded approach of first denoising and then extracting features, we have built a new training framework that achieves better emotion transfer results in noisy scenarios. We conducted extensive experiments on a multi-speaker English emotional speech dataset ESD. The objective and subjective results demonstrate that the proposed NCE-TTS can synthesize emotionally rich speech while preserving the target speaker's voice in various noisy scenarios, with a significant improvement compared to all advanced baselines.

Overview

The overview of NCE-TTS as the following picture shows.

The overview of NCE-TTS

In the following, we will show some generated samples by our proposed method.

<style> .audio-player { width: 200px; } .audio-player2 { width: 150px; } </style>

1. Direct comparison of NCE-TTS with baseline model synthesized speech.

<style> .table0 th:nth-of-type(1) { width: 170px; } .table0 th:nth-of-type(2) { width: 210px; /* background-color: blue; */ } .table0 th:nth-of-type(3) { width: 240px; } .table0 th:nth-of-type(4) { width: 210px; } </style>

We first show a direct comparison of the effects of NCE-TTS and baseline with different emotion types of emotional referenced audio.

Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Angry ,clean
He was still in the forest! Happy ,clean
He was still in the forest! Neutral ,clean
He was still in the forest! Sad ,clean
He was still in the forest! Surprise ,clean
{: .table0}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Angry ,5db
He was still in the forest! Happy ,5db
He was still in the forest! Neutral ,5db
He was still in the forest! Sad ,5db
He was still in the forest! Surprise ,5db
{: .table1_2}

2. Speaker Parallel Emotion Transfer on Various Noise Conditions.

In the following, we will demonstrate cases of Speaker Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.

<style> .table1_2 th:nth-of-type(1) { width: 170px; } .table1_2 th:nth-of-type(2) { width: 210px; /* background-color: blue; */ } .table1_2 th:nth-of-type(3) { width: 240px; } .table1_2 th:nth-of-type(4) { width: 210px; } </style>
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
I smell the breath of an English. Angry ,clean
I smell the breath of an English. Angry ,0db
I smell the breath of an English. Angry ,5db
I smell the breath of an English. Angry ,10db
I smell the breath of an English. Angry ,0db-denoisy
I smell the breath of an English. Angry ,5db-denoisy
I smell the breath of an English. Angry ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
I smell the breath of an English. Happy ,clean
I smell the breath of an English. Happy ,0db
I smell the breath of an English. Happy ,5db
I smell the breath of an English. Happy ,10db
I smell the breath of an English. Happy ,0db-denoisy
I smell the breath of an English. Happy ,5db-denoisy
I smell the breath of an English. Happy ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
I smell the breath of an English. Neutral ,clean
I smell the breath of an English. Neutral ,0db
I smell the breath of an English. Neutral ,5db
I smell the breath of an English. Neutral ,10db
I smell the breath of an English. Neutral ,0db-denoisy
I smell the breath of an English. Neutral ,5db-denoisy
I smell the breath of an English. Neutral ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
I smell the breath of an English. Sad ,clean
I smell the breath of an English. Sad ,0db
I smell the breath of an English. Sad ,5db
I smell the breath of an English. Sad ,10db
I smell the breath of an English. Sad ,0db-denoisy
I smell the breath of an English. Sad ,5db-denoisy
I smell the breath of an English. Sad ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
I smell the breath of an English. Surprise ,clean
I smell the breath of an English. Surprise ,0db
I smell the breath of an English. Surprise ,5db
I smell the breath of an English. Surprise ,10db
I smell the breath of an English. Surprise ,0db-denoisy
I smell the breath of an English. Surprise ,5db-denoisy
I smell the breath of an English. Surprise ,10db-denoisy
{: .table1_2}

3. Speaker Non-Parallel Emotion Transfer on Various Noise Conditions

In the following, we will demonstrate cases of Speaker Non-Parallel Transfer. We have synthesized speech demonstrations under different emotion and noise conditions.

Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Angry ,clean
He was still in the forest! Angry ,0db
He was still in the forest! Angry ,5db
He was still in the forest! Angry ,10db
He was still in the forest! Angry ,0db-denoisy
He was still in the forest! Angry ,5db-denoisy
He was still in the forest! Angry ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Happy ,clean
He was still in the forest! Happy ,0db
He was still in the forest! Happy ,5db
He was still in the forest! Happy ,10db
He was still in the forest! Happy ,0db-denoisy
He was still in the forest! Happy ,5db-denoisy
He was still in the forest! Happy ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Neutral ,clean
He was still in the forest! Neutral ,0db
He was still in the forest! Neutral ,5db
He was still in the forest! Neutral ,10db
He was still in the forest! Neutral ,0db-denoisy
He was still in the forest! Neutral ,5db-denoisy
He was still in the forest! Neutral ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Sad ,clean
He was still in the forest! Sad ,0db
He was still in the forest! Sad ,5db
He was still in the forest! Sad ,10db
He was still in the forest! Sad ,0db-denoisy
He was still in the forest! Sad ,5db-denoisy
He was still in the forest! Sad ,10db-denoisy
{: .table1_2}
Content Speaker reference Emotion reference GenerSpeech Daft-Exprt NoreSpeech Vall-E NCE-TTS
He was still in the forest! Surprise ,clean
He was still in the forest! Surprise ,0db
He was still in the forest! Surprise ,5db
He was still in the forest! Surprise ,10db
He was still in the forest! Surprise ,0db-denoisy
He was still in the forest! Surprise ,5db-denoisy
He was still in the forest! Surprise ,10db-denoisy
{: .table1_2}

4. Visualization Study

In the following, We demonstrated the variation trends of pitch curves in synthesized audio under different noise and emotional conditions with the same content text and varying emotion references.

<style> .table3 th:nth-of-type(3) { width: 210px; /* background-color: red; */ } .table3 th:nth-of-type(5) { width: 210px; /* background-color: green; */ } </style>
Content speaker reference clean emotion reference Synthesized audio 5dB emotion reference Synthesized audio Pitch curve
I am going to back home.   Angry , clean Angry , 5dB The overview of NCE-TTS
I am going to back home. Happy , clean Happy , 5dB The overview of NCE-TTS
I am going to back home. Neutral , clean Angry , 5dB The overview of NCE-TTS
I am going to back home. Sad , clean Sad , 5dB The overview of NCE-TTS
I am going to back home. Surprise , clean Sad , 5dB The overview of NCE-TTS
{: .table3}

5. Ablation Study

In the following, we demonstrate cases in ablation experiments.

<style> .table4 th:nth-of-type(2) { width: 210px; } .table4 th:nth-of-type(3) { width: 210px; } .table4 th:nth-of-type(4) { width: 210px; } </style>
Content Speaker reference Emotion reference w/o Lort w/o Lop w/o Lkd NCE-TTS
It's part of my secret. Angry ,clean
It's part of my secret. Angry ,5db
{: .table4}
Content Speaker reference Emotion reference w/o Lort w/o Lop w/o Lkd NCE-TTS
It's part of my secret. Happy ,clean
It's part of my secret. Happy ,5db
{: .table4}
Content Speaker reference Emotion reference w/o Lort w/o Lop w/o Lkd NCE-TTS
It's part of my secret. Neutral ,clean
It's part of my secret. Neutral ,5db
{: .table4}
Content Speaker reference Emotion reference w/o Lort w/o Lop w/o Lkd NCE-TTS
It's part of my secret. Sad ,clean
It's part of my secret. Sad ,5db
{: .table4}
Content Speaker reference Emotion reference w/o Lort w/o Lop w/o Lkd NCE-TTS
It's part of my secret. Surprise ,clean
It's part of my secret. Surprise ,5db
{: .table4}