Speech Synthesis Paper

List of speech synthesis papers (-> more papers <-). Welcome to recommend more awesome papers 😀.

Repositories for collecting awesome speech paper:

awesome-speech-recognition-speech-synthesis-papers (from ponyzhang)
awesome-python-scientific-audio (from Fabian-Robert Stöter)
TTS-papers (from Eren Gölge)
awesome-speech-enhancement (from Vincent Liu)
speech-recognition-papers (from Xingchen Song)
awesome-tts-samples (from Seung-won Park)
awesome-speech-translation (from dqqcasia)
A Survey on Neural Speech Synthesis (from tts-tutorial)

What is the meaning of '★'? I add '★' to the papers which number of citations is over 50 (only in Acoustic Model, Vocoder and TTS towards Stylization). Beginner can read these paper first to get basic knowledge of Deep-Learning-based TTS model (#1).

Content

TTS Frontend
Acoustic Model
Vocoder
TTS towards Stylization
Voice Conversion
Singing
- Singing Voice Synthesis
- Singing Voice Conversion

TTS Frontend

Pre-trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis (Interspeech 2019)
A unified sequence-to-sequence front-end model for Mandarin text-to-speech synthesis (ICASSP 2020)
A hybrid text normalization system using multi-head self-attention for mandarin (ICASSP 2020)
Unified Mandarin TTS Front-end Based on Distilled BERT Model (2021-01)

Acoustic Model

Vocoder

Autoregressive Model

WaveNet^★: WaveNet: A Generative Model for Raw Audio (2016)
WaveRNN^★: Efficient Neural Audio Synthesis (ICML 2018)
WaveGAN^★: Adversarial Audio Synthesis (ICLR 2019)
LPCNet^★: LPCNet: Improving Neural Speech Synthesis Through Linear Prediction (ICASSP 2019)
Towards achieving robust universal neural vocoding (Interspeech 2019)
GAN-TTS: High Fidelity Speech Synthesis with Adversarial Networks (2019)
MultiBand-WaveRNN: DurIAN: Duration Informed Attention Network For Multimodal Synthesis (2019)
Chunked Autoregressive GAN for Conditional Waveform Synthesis (2021-10)
Improved LPCNet: Neural Speech Synthesis on a Shoestring: Improving the Efficiency of LPCNet (ICASSP 2022)
Bunched LPCNet2: Bunched LPCNet2: Efficient Neural Vocoders Covering Devices from Cloud to Edge (2022-03)

Non-Autoregressive Model

Parallel-WaveNet^★: Parallel WaveNet: Fast High-Fidelity Speech Synthesis (2017)
WaveGlow^★: WaveGlow: A Flow-based Generative Network for Speech Synthesis (2018)
Parallel-WaveGAN^★: Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2019)
MelGAN^★: MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (NeurIPS 2019)
MultiBand-MelGAN: Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech (2020)
VocGAN: VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network (Interspeech 2020)
WaveGrad: WaveGrad: Estimating Gradients for Waveform Generation (2020)
DiffWave: DiffWave: A Versatile Diffusion Model for Audio Synthesis (2020)
HiFi-GAN: HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis (NeurIPS 2020)
Parallel-WaveGAN (New): Parallel waveform synthesis based on generative adversarial networks with voicing-aware conditional discriminators (2020-10)
StyleMelGAN: StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization (ICASSP 2021)
Improved parallel WaveGAN vocoder with perceptually weighted spectrogram loss (SLT 2021)
Fre-GAN: Fre-GAN: Adversarial Frequency-consistent Audio Synthesis (Interspeech 2021)
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation (2021-07)
iSTFTNet: iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform (ICASSP 2022)
Parallel Synthesis for Autoregressive Speech Generation (2022-04)
Avocodo: Avocodo: Generative Adversarial Network for Artifact-free Vocoder (2022-06)

Others

(Robust vocoder): Towards Robust Neural Vocoding for Speech Generation: A Survey (2019)
(Source-filter model based): Neural source-filter waveform models for statistical parametric speech synthesis (TASLP 2019)
NHV: Neural Homomorphic Vocoder (Interspeech 2020)
Universal MelGAN: Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains (2020)
Binaural Speech Synthesis: Neural Synthesis of Binaural Speech From Mono Audio (ICLR 2021)
Checkerboard artifacts in neural vocoder: Upsampling artifacts in neural audio synthesis (ICASSP 2021)
Universal Vocoder Based on Parallel WaveNet: Universal Neural Vocoding with Parallel WaveNet (ICASSP 2021)
(Comparison of discriminator): GAN Vocoder: Multi-Resolution Discriminator Is All You Need (2021-03)
Vocoder Benchmark: VocBench: A Neural Vocoder Benchmark for Speech Synthesis (2021-12)
BigVGAN (Universal vocoder): BigVGAN: A Universal Neural Vocoder with Large-Scale Training (2022-06)

TTS towards Stylization

Expressive TTS

ReferenceEncoder-Tacotron^★: Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron (ICML 2018)
GST-Tacotron^★: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis (ICML 2018)
Predicting Expressive Speaking Style From Text In End-To-End Speech Synthesis (2018)
GMVAE-Tacotron2^★: Hierarchical Generative Modeling for Controllable Speech Synthesis (ICLR 2019)
BERT-TTS: Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models (2019)
(Multi-style Decouple): Multi-Reference Neural TTS Stylization with Adversarial Cycle Consistency (2019)
(Multi-style Decouple): Multi-reference Tacotron by Intercross Training for Style Disentangling,Transfer and Control in Speech Synthesis (Interspeech 2019)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
Robust and fine-grained prosody control of end-to-end speech synthesis (ICASSP 2019)
Flowtron (flow based): Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis (2020)
(local style): Fully-hierarchical fine-grained prosody modeling for interpretable speech synthesis (ICASSP 2020)
Controllable Neural Prosody Synthesis (Interspeech 2020)
GraphSpeech: GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis (2020-10)
BERT-TTS: Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis (2020-11)
(Global Emotion Style Control): Controllable Emotion Transfer For End-to-End Speech Synthesis (2020-11)
(Phone Level Style Control): Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis (2020-11)
(Phone Level Prosody Modelling): Mixture Density Network for Phone-Level Prosody Modelling in Speech Synthesis (ICASSP 2021)
(Phone Level Prosody Modelling): Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis (ICASSP 2021)
PeriodNet: PeriodNet: A non-autoregressive waveform generation model with a structure separating periodic and aperiodic components (ICASSP 2021)
PnG BERT: PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS (Interspeech 2021)
Towards Multi-Scale Style Control for Expressive Speech Synthesis (2021-04)
Learning Robust Latent Representations for Controllable Speech Synthesis (2021-05)
Diverse and Controllable Speech Synthesis with GMM-Based Phone-Level Prosody Modelling (2021-05)
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS (2021-06)
(Conversational Speech Synthesis): Controllable Context-aware Conversational Speech Synthesis (Interspeech 2021)
DeepRapper: DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling (ACL 2021)
Referee: Referee: Towards reference-free cross-speaker style transfer with low-quality data for expressive speech synthesis (2021)
(Text-Based Insertion TTS): Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration (Interspeech 2021)
On the Interplay Between Sparsity, Naturalness, Intelligibility, and Prosody in Speech Synthesis (2021-10)
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models (2021-10)
TTS for dubbing: Neural Dubber: Dubbing for Videos According to Scripts (NeurIPS 2021)
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis (SPECOM 2021)
MsEmoTTS: MsEmoTTS: Multi-scale emotion transfer, prediction, and control for emotional speech synthesis (2022-01)
Disentangling Style and Speaker Attributes for TTS Style Transfer (2022-01)
Word-level prosody modeling: Unsupervised word-level prosody tagging for controllable speech synthesis (ICASSP 2022)
ProsoSpeech: ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech (ICASSP 2022)
CampNet (speech editing):CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech Editing (2022-02)
vTTS (visual text): vTTS: visual-text to speech (2022-03)
CopyCat2: CopyCat2: A Single Model for Multi-Speaker TTS and Many-to-Many Fine-Grained Prosody Transfer (Interspeech 2022)
Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech (Interspeech 2022)
Expressive, Variable, and Controllable Duration Modelling in TTS (Interspeech 2022)

MultiSpeaker TTS

Meta-Learning for TTS^★: Sample Efficient Adaptive Text-to-Speech (ICLR 2019)
SV-Tacotron^★: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (NeurIPS 2018)
Deep Voice V3^★: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning (ICLR 2018)
Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings (ICASSP 2020)
MultiSpeech: MultiSpeech: Multi-Speaker Text to Speech with Transformer (2020)
SC-WaveRNN: Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions (Interspeech 2020)
MultiSpeaker Dataset: AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines (2020)
Life-long learning for multi-speaker TTS: Continual Speaker Adaptation for Text-to-Speech Synthesis (2021-03)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation (ICML 2021)
Effective and Differentiated Use of Control Information for Multi-speaker Speech Synthesis (Interspeech 2021)
Speaker Generation (2021-11)
Meta-Voice: Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning (2021-11)

New Perspective on TTS

PromptTTS: PromptTTS: Controllable Text-to-Speech with Text Descriptions (2022-11)
VALL-E: Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2023-01)
InstructTTS: InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt (2023-01)
Spear-TTS: Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision (2023-02)
FoundationTTS: FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model (2023-03)

Voice Conversion

ASR & TTS Based

(introduce PPG into voice conversion): Phonetic posteriorgrams for many-to-one voice conversion without parallel data training (2016)
A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data (2019)
TTS-Skins: TTS Skins: Speaker Conversion via ASR (2019)
Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations (IEEE/ACM TASLP 2019)
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization (Interspeech 2019)
Cotatron (combine text information with voice conversion system): Cotatron: Transcription-Guided Speech Encoder for Any-to-Many Voice Conversion without Parallel Data (Interspeech 2020)
(TTS & ASR): Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer (Interspeech 2020)
FragmentVC (wav to vec): FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and Fusing Fine-Grained Voice Fragments With Attention (2020)
Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram (ICASSP 2021)
(TTS & ASR): On Prosody Modeling for ASR+TTS based Voice Conversion (2021-07)
Cloning one's voice using very limited data in the wild (2021-10)

VAE & Auto-Encoder Based

VAE-VC (VAE based): Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder (2016)
(Speech representation learning by VQ-VAE): Unsupervised speech representation learning using WaveNet autoencoders (2019)
Blow (Flow based): Blow: a single-scale hyperconditioned flow for non-parallel raw-audio voice conversion (NeurIPS 2019)
AutoVC: AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss (2019)
F0-AutoVC: F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder (ICASSP 2020)
One-Shot Voice Conversion by Vector Quantization (ICASSP 2020)
SpeechSplit (auto-encoder): Unsupervised Speech Decomposition via Triple Information Bottleneck (ICML 2020)
NANSY: Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations (NeurIPS 2021)

GAN Based

CycleGAN-VC V1: Parallel-Data-Free Voice Conversion Using Cycle-Consistent Adversarial Networks (2017)
StarGAN-VC: StarGAN-VC: non-parallel many-to-many Voice Conversion Using Star Generative Adversarial Networks (2018)
CycleGAN-VC V2: CycleGAN-VC2: Improved CycleGAN-based Non-parallel Voice Conversion (2019)
CycleGAN-VC V3: CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion (2020)
MaskCycleGAN-VC: MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames (ICASSP 2021)

Singing

Singing Voice Synthesis

XiaoIce Band: XiaoIce Band: A Melody and Arrangement Generation Framework for Pop Music (KDD 2018)
Mellotron: Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens (2019)
ByteSing: ByteSing: A Chinese Singing Voice Synthesis System Using Duration Allocated Encoder-Decoder Acoustic Models and WaveRNN Vocoders (2020)
JukeBox: Jukebox: A Generative Model for Music (2020)
XiaoIce Sing: XiaoiceSing: A High-Quality and Integrated Singing Voice Synthesis System (2020)
HiFiSinger: HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis (2019)
Sequence-to-sequence Singing Voice Synthesis with Perceptual Entropy Loss (2020)
Learn2Sing: Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher (2020-11)
MusicBERT: MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training (ACL 2021)
SingGAN (Singing Voice Vocoder): SingGAN: Generative Adversarial Network For High-Fidelity Singing Voice Generation (AAAI 2022)
Background music generation: Video Background Music Generation with Controllable Music Transformer (ACM Multimedia 2021)
Multi-Singer (Singing Voice Vocoder): Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus (ACM Multimedia 2021)
Rapping-singing voice synthesis: Rapping-Singing Voice Synthesis based on Phoneme-level Prosody Control (SSW 11)
VISinger (VIST for Singing Voice Synthesis): VISinger: Variational Inference with Adversarial Learning for End-to-End Singing Voice Synthesis (2021-10)
Opencpop: Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis (2022-01)
Learning the Beauty in Songs: Neural Singing Voice Beautifier (ACL 2022)
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher (2022-03)
MusicLM: MusicLM: Generating Music From Text (2023-01)
SingSong: SingSong: Generating musical accompaniments from singing (2023-01)

Singing Voice Conversion

A Universal Music Translation Network (2018)
Unsupervised Singing Voice Conversion (Interspeech 2019)
PitchNet: PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network (ICASSP 2020)
DurIAN-SC: DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System (Interspeech 2020)
Speech-to-Singing Conversion based on Boundary Equilibrium GAN (Interspeech 2020)
PPG-based singing voice conversion with adversarial representation learning (2020)

Name		Name	Last commit message	Last commit date
Latest commit History 225 Commits
papers		papers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Synthesis Paper

Content

TTS Frontend

Acoustic Model

Autoregressive Model

Non-Autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-Autoregressive Model

Others

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

New Perspective on TTS

Voice Conversion

ASR & TTS Based

VAE & Auto-Encoder Based

GAN Based

Singing

Singing Voice Synthesis

Singing Voice Conversion

About

Releases

Packages

Contributors 7

License

wenet-e2e/speech-synthesis-paper

Folders and files

Latest commit

History

Repository files navigation

Speech Synthesis Paper

Content

TTS Frontend

Acoustic Model

Autoregressive Model

Non-Autoregressive Model

Alignment Study

Data Efficiency

Vocoder

Autoregressive Model

Non-Autoregressive Model

Others

TTS towards Stylization

Expressive TTS

MultiSpeaker TTS

New Perspective on TTS

Voice Conversion

ASR & TTS Based

VAE & Auto-Encoder Based

GAN Based

Singing

Singing Voice Synthesis

Singing Voice Conversion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Packages