SpeechTasks

This is a list of speech tasks and datasets, which can provide training data for Generative AI, AIGC, AI model training, intelligent speech tool development, and speech applications.

Continuously Updating!

I will add new tasks and datasets to this repo continously.

You are welcome to put an issue or email me at hwang258@jhu.edu, to point out any unlisted tasks and datasets!

Tasks and DataSets

Task	DataSets	Input Mode	Output Mode	Modeling Target	Level	Description
Accent Classification	AccentDB Extended Dataset	Audio	Label	Classification	Acoustic, Language	Accent classification involves the recognition and classification of specific speech accents.The task involves the recognition and classification of specific speech accents. The possible answers include American, Australian, Bangla, British, Indian, Malayalam, Odiya, Telugu, or Welsh. The objective is to correctly identify these accents based on the given speech samples, contributing to a system's ability to understand and interact with various speakers.
Accented Text-to-speech	L2-ARCTIC	Text, Audio	Audio	Generation	Acoustic, Language	Accented text-to-speech (TTS) synthesis aims to synthesize speech with a given foreign accent instead of native speech.
Acoustic Echo Cancellation	AEC Challenge	Audio	Audio	Regression	Acoustic	The Acoustic Echo Cancellation block is designed to remove echoes, reverberation, and unwanted added sounds from a signal that passes through an acoustic space.
Automatic Speech Recognition	LibriSpeech Common Voice VoxPopuli MLS Libri-light AISHELL GigaSpeech CoVoST Libriheavy TED-LIUM TIMIT WenetSpeech	Audio	Text	Classification	Content	Speech recognition, also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text, is a capability which enables a program to process human speech into a written format.
DeepFake Detection (Spoof Detection)	ASVspoof 2015 Dataset ASVspoof 2017 Dataset ASVspoof2019 ASVspoof2021 ASVspoof5 ADD Challenge In-the-Wild WaveFake SingFake	Audio	Binary Label	Binary Classification	Acoustic	Audio deepfake detection is a task that aims to distinguish genuine utterances from fake ones via machine learning techniques. <!--
Dialogue Act Classification	DailyTalk Dataset	Audio	Label	Classification	Understanding	Dialogue act classification aims to identify the primary purpose or function of an utterance within its dialogue context.The aim of this task is to identify the action in the audio. The possible answers could be question, inform, directive, or commissive. These identification tasks are important, as dialogue acts are central to understanding human conversation and dialogue-based AI system communication.
Dialogue Act Pairing	DailyTalk Dataset	Audio, Label	Binary Label	Binary Classification	Understanding	Dialogue act pairing involves assessing the congruence of dialogue acts—that is, whether a response dialogue act is appropriate given a query dialogue act. The objective is to determine whether a given dialogue act pairing is congruent or not. The answer could either be true or false. Being able to accurately judge the appropriateness of dialogue acts is key for a universal speech model to understand and participate in human conversations effectively.
Dialogue Emotion Classification	DailyTalk Dataset	Audio	Label	Classification	Emotion	Dialogue emotion classification is a task that assesses an AI model's ability to identify the most suitable emotion in a given dialogue extract. The main goal of this task is to correctly identify the communicated emotion in an audio clip. Possible answers include anger, disgust, fear, sadness, happiness, surprise, or no emotion. It is an evaluation of the model's capacity to interpret and distinguish emotions conveyed through speech, accounting both for linguistic content and paralinguistic indicators.
Dysarthric Speech Assessments	UASpeech TORGO	Audio	Scalar	Regression	Acoustic	Dysarthric speech assessments regarding speech intelligibility are conducted to check the patient’s status and track the effectiveness of treatments.
Dysarthric Speech Recognition	UASpeech TORGO	Audio	Text	Classification	Content	Dysarthric Speech Recognition is a task that aims to transcribe dysarthria speech which is a motor speech disorder caused by conditions like Parkinson’s disease or amyotrophic lateral sclerosis (ALS).
Emotion Recognition	Multimodal EmotionLines Dataset IEMOCAP MELD CREMA-D MSP-Podcast SAVEE MESD CMU-MOSEI MEAD	Audio	Label	Classification	Emotion	Emotion recognition aims to identify the most appropriate emotional category for a given utterance.Recognizing the emotion expressed in an utterance can be quite challenging. While we can sometimes identify emotion from the linguistic content alone, the more important factors often lie in paralinguistic features — like pitch, rhythm, and other prosodic elements. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner.
Emotional TTS	RAVDESS EMOV-DB LJSpeech Dataset IEMOCAP	Text,Label	Audio	Generation	Acoustic, Emotion	Emotional text-to-speech (TTS) aims to synthesize speech with specfic emotional types.
Enhancement Detection	LibriTTS-TestClean	Audio	Binary Label	Binary Classification	Acoustic	Enhancement detection is a task focused on determining whether a given audio has been created or modified by a speech enhancement model. The objective of enhancement detection is to ascertain if an audio file has been created or altered by a speech enhancement model. The expected answer is either yes or no. The task poses a challenging problem because the speech model must not only process the content of the speech but also detect minute modifications that might indicate enhancement.
Expressive TTS	Expresso	Text, Label	Audio	Generation	Acoustic, Understanding	Expressive text-to-speech (TTS) aims to synthesize speech with specfic reading types or improvised styles.
HowFarAreYou	3DSpeaker Dataset Spatial LibriSpeech	Audio	Scalar	Regression	Acoustic	The HowFarAreYou task aims to determine the distance of the speaker from the source of sound. The task's goal is to ascertain the approximate distance of a speaker, based on the provided audio or speech. The task's response could be an exact value, such as 0.4m, 2.0m, or 4.0m, indicating the speaker's distance from the sound source. Gauging the speaker's distance provides insights into the audio's spatial characteristics, which forms a crucial aspect of auditory scene analysis.
Instruct TTS	None available	Text	Audio	Generation	Acoustic, Understanding	Instruct text-to-speech (TTS) aims to synthesize speech with varying speaking styles to better reflect human speech patterns, given a certain instruction.
Intent Classification	FluentSpeechCommands Dataset SLURP ATIS Snips	Audio	Label	Classification	Understanding	Intent classification aims to identify the actionable item behind a spoken message. The objective of this task is to understand and categorize the intent performed by a spoken message. The recognized actions can vary, including activate, bring, change language, deactivate, decrease, or increase. Identifying the intent accurately is pivotal for building reliable speech-based applications and interfaces. We categorize this task into three types: Action, Location, and Object.
Keyword Spotting	Google Speech Commands V1 Dataset LibriPhrase	Audio, Text	Binary Label	Binary Classification	Content	Keyword spotting is a process that helps to detect keywords or phrases used in phone calls or audio recordings. These words and phrases can then be used to adjust the urgency of the call, train your employees, and gauge customer satisfaction.
Language Identification	VoxForge Dataset Common Voice VoxLingua107	Audio	Label	Classification	Language	Language Identification task is aimed to determine the language spoken in a given speech recording. The main goal of this task is to identify the language spoken in a specific speech recording. This is an essential part of speech processing, as it facilitates the understanding and translations for different languages. The language spoken could be German, English, Spanish, Italian, Russian, or French.
Laughter Synthesis	Laughterscape	Audio, Audio	Audio	Generation	Acoustic	Laughter Synthesis task is aimed to generate sound of laughter of a given speaker.
Multilingual Speech Recognition	Common Voice VoxLingua107 MLS FLEURS CMU Wilderness YODAS	Audio	Text	Classification	Content, Language	The task of Multilingual Speech Recognition (MSR) involves developing systems that can accurately transcribe speech data across multiple languages. Unlike traditional speech recognition systems that are designed for a specific language, MSR systems aim to handle diverse languages and dialects.
MultiSpeaker Detection	LibriSpeech-TestClean Dataset VCTK Dataset	Audio	Binary Label	Binary Classification	Speaker	MultiSpeaker Detection aims to analyze the speech audio to determine whether there is more than one speaker present in it. The core objective of this task is to analyze the speech audio for the presence of more than one speaker. It is crucial for a universal speech model to detect this as the presence of multiple speakers can alter the context and understanding of the spoken content.
Noise Detection	LJSpeech dataset VCTK Dataset Musan Dataset	Audio	Binary Label	Binary Classification	Acoustic	Noise Detection aims to idenetify if the speech audio is clean or mixed with noises.The objective of noise detection is to ascertain if an audio file has been added the noise. The expected answer is either yes or no. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech.
Noise SNR Level Prediction	VCTK Dataset Musan Dataset	Audio	Scalar	Regression	Acoustic	Noise SNR Level Prediction aims to predict the signal-to-noise ratio of the speech audio.The objective of noise SNR level prediction is to evaluate the noise SNR level of an audio file. The expected answer could be zero, five, ten, fifteen or zero. There are many types of noises - like music, speech, gaussian or others. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degree of noise degradation.
Non-verbal Voice Recognition	CNVVE	Audio	Label	Classification	Content	Non-verbal Voice Recognition is to recognize non-verbal or non-lexical voice expressions, like humming.
Offensive Language Identification	OLID	Audio	Label	Classification	Understanding	Offensive Language Identification is to identify the type and the target of offensive texts in social media.
Overlapping Speech Detection	AMI meeting corpora DIHARD I Challenge Data DIHARD II Challenge Data VoxConverse	Audio	Label, Timestamp	Classification	Content, Speaker	Overlapped speech detection (OSD) is a task that estimates onsets and offsets of segments (i.e., a small part of an audio clip) within an audio clip (i.e., utterance, session, conversation as a whole) where more than one speaker is speaking simultaneously.
Reverberation Detection	LJSpeech Dataset VCTK Dataset RIRs Noises Dataset	Audio	Binary Label	Binary Classification	Acoustic	Reverberation Detection aims to detect if the speech audio is clean or mixed with room impulse responses (RIRs) and noises, that is to say reverberation noises. The objective of reverberation detection is to ascertain if an audio file has been added the reverberation noises. The expected answer is either clean or noisy. The reverberation noises can be originated from large room, medium room or small room. The task poses a challenging problem because the speech model must not only process the content of the speech but also understand the degradation of speech in reververation cases.
Sarcasm Detection	MUStARD Dataset	Audio	Binary Label	Binary Classification	Understanding	Sarcasm Detection aims to detect if the sarcasm or the irony present in the speech audio. The objective of sarcasm detection is to recognize the presence of sarcasm or ironic expressions in the speech. The expected answer is either true or false. The task poses a challenging problem because the speech model should understand upper level of the semantic information.
Slot Filling	SLURP ATIS Snips	Audio	Text	Classification	Understanding	The goal of Slot Filling is to identify from a running dialog different slots, which correspond to different parameters of the user’s query. For instance, when a user queries for nearby restaurants, key slots for location and preferred food are required for a dialog system to retrieve the appropriate information. Thus, the main challenge in the slot-filling task is to extract the target entity.
Speaker Counting	MUStARD Dataset	Audio	Label	Classification	Speaker	Speaker Counting aims to identify the total number of speaker in speech audio. The objective of speaker counting is to determine the number of speakers in the audio recording. The expected answer should be one, two, three, four, or five. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers.
Speaker Diarization	CHIME 5 CHIME 6 DIHARD II LibriCSS AISHELL-4 VoxConverse	Audio	Label, Timestamp	Classification	Speaker	Speaker diarisation is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
Speaker Identification	LibriSpeech-TestClean Dataset VCTK Dataset VoxCeleb1 VoxCeleb2 CN-Celeb AVSpeech VoxTube	Audio	Label	Classification	Speaker	Speaker recognition deals with the identification of the speaker in an audio stream.
Speaker Verification	LibriSpeech-TestClean Dataset VCTK Dataset VoxCeleb1 VoxCeleb2 CN-Celerb	Audio, Audio	Binary Label	Binary Classification	Speaker	Speaker verification aims to verify whether the two given speech audios are from the same speaker. The objective of speaker verification is to exam if the patterns in the two audio recordings come from the same speaker. The expected answer is either yes or no. The task poses a challenging problem because the speech model should undersdand the pattern of different speakers.
Speech Edit	LibriTTS VCTK Dataset LJSpeech Dataset	Audio, Text	Audio	Generation	Acoustic, Content	Speech edit allows the user to edit the recorded speech, e.g., insert missed words, replace mispronounced words, and/or remove unwanted speech or non-speech events, without degrading the quality and naturalness of the edited speech.
Speech Command Recognition	Google Speech Commands V1 Dataset	Audio	Label	Classification	Content	Speech Command Recognition aims to identify the spoken command. The objective of speech command recognition is to comprehend and grasp the command presented in the speech. The expected answer should be yes, no, up, down, left, right, on, off, stop, go, zero, one, two, three, four, five, six, seven, eight, nine, bed, bird, cat, dog, happy, house, marvin, sheila, tree, wow, or silence. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Speech Dereverberation	Reverb-WSJ0 WHAMR! CHIME 5 CHIME 6	Audio	Audio	Regression	Acoustic	Speech Dereverberation is the process by which the effects of reverberation are removed from sound, after such reverberant sound has been picked up by microphones.
Speech Detection	LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset InaGVAD	Audio	Binary Label	Binary Classification	Content	Speech Detection, also known as voice activity detection and speech activity detection, aims to identify whether the given audio clip contains real speech or not. The objective of speech detection is to analyze the audio and determine whether it consists of real speech or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand not only the content information from the speech audios but the pattern of human voice.
Speech Enhancement	VoiceBank+DEMAND DNS-Challenge WHAM! WHAMR!	Audio	Audio	Regression	Acoustic	Speech enhancement aims to improve speech quality by using various algorithms. The objective of enhancement is improvement in intelligibility and/or overall perceptual quality of degraded speech signal using audio signal processing techniques.
Speech Separation	WSJ0-2mix LibriMix Real-M WHAM! WHAMR! CHIME 5 CHIME 6 AISHELL-4	Audio	Audio, Audio	Regression	Speaker	Speech Separation is the extraction of multiple speech signals from a mixture.
Speech Text Matching	LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset	Audio, Text	Binary Label	Binary Classification	Content	Speech Text Matching aims to determine if the speech and text are matched. The objective of speech text matching is to assess whether the speech and text share the same underlying message or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Speech-to-speech Translation	CVSS CoVoST 2	Audio	Audio	Generation	Language, Content	Speech-to-speech translation consists on translating speech from one language to speech in another language. This can be done with a cascade of automatic speech recognition (ASR), text-to-text machine translation (MT), and text-to-speech (TTS) synthesis sub-systems, which is text-centric.
Speech-to-text Translation	MUST-C	Audio	Text	Generation	Language, Content	Speech-to-text translation, consisting of automatic speech recognition (ASR) and machine translation (MT), refers to the process where spoken language is not only converted to text but also translated into another language.
Speech Quality Assessment	VCC2018 BVCC	Audio	Scalar	Regression	Acoustic	Speech Quality Assessment is a task that estimates the quality of speech, like mean-opinion-score (MOS).
Spoken Question Answering	Spoken-SQuAD ODSQA NMSQA	Audio	Text	Generation	Understanding	Spoken Question Answering (SQA) aims to find the answer from a spoken document given a question in either text or spoken form. SQA is crucial for personal assistants when replying to the questions from the user’s spoken queries.
Spoken Term Detection	LJSpeech dataset LibriSpeech-TestClean Dataset LibriSpeech-TestOther Dataset	Audio, Text	Binary Label	Binary Classification	Content	Spoken Term Detection aims to check for the existence of the given word in the speech. The objective of spoken term detection is to analyze the speech and indicate whether the word is mentioned or not. The expected answer is either yes or no. The task poses a challenging problem because the speech model should understand the content information from the speech audios.
Stress Detection	MIR-SD Dataset	Audio	Binary Label	Binary Classification	Acoustic	Stress Detection aims to determine the stress placement in English vocabulary. The objective of stress detection is to analyze the stress patterns in English words. The expected answer should be zero, one, two, three, four, or five. For a universal speech model, understanding these paralinguistic features is crucial, as they distinguish speech from mere text in a significant manner.
Target Speaker Extraction	WSJ0-2mix LibriMix Real-M WHAM! WHAMR! CHIME 5 CHIME 6	Audio,Audio	Audio	Regression	Speaker	Target Speaker Extraction aims to segregate the speech of a target speaker from a mixture of interfering speakers with the help of auxiliary information.
Text-To-Speech Synthesis	LJ Speech LibriTTS AISHELL 3 LibriTTS-R YTTTS	Text	Audio	Generation	Acoustic	Text-to-speech (TTS) synthesis converts normal language text into speech.
Vocal Sound Classification	VocalSound	Audio	Label	Classification	Acoustic	Vocal Sound Classificatio aims at automatic human vocal sound recognition for laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Voice Conversion	LibriTTS VCTK Dataset ESD	Audio, Audio	Audio	Generation	Acoustic, Speaker	Voice Conversion is a technology that modifies the speech of a source speaker and makes their speech sound like that of another target speaker without changing the linguistic information.

And more

Tasks with different input & output modes

Input	Output	Tasks
Audio	Audio	Acoustic Echo Cancellation, Speech Dereverberation, Speech Enhancement, Speech-to-speech Translation
Audio	Audio, Audio	Speech Separation, Voice Conversion
Audio	Binary Label	DeepFake Detection, Dialogue Act Classification, Enhancement Detection, MultiSpeaker Detection, Noise Detection, Reverberation Detection, Sarcasm Detection, Speech Detection, Spoof Detection, Stress Detection
Audio	Label	Accent Classification, Dialogue Act Classification, Dialogue Emotion Classification, Emotion Recognition, Intent Classification, Language Identification, Non-verbal Voice Recognition, Offensive Language Identification, Speaker Counting, Speaker Identification, Speech Command Recognition, Vocal Sound Classification
Audio	Label, Timestamp	Speaker Diarization, Overlapping Speech Detection
Audio	Scalar	Dysarthric Speech Assessments, HowFarAreYou, Noise SNR Level Prediction, Speech Quality Assessment
Audio	Text	Automatic Speech Recognition, Dysarthric Speech Recognition, Multilingual Speech Recognition, Slot Filling, Speech-to-text Translation, Spoken Question Answering
Audio, Audio	Audio	Laughter Synthesis, Target Speaker Extraction
Audio, Audio	Binary Label	Speaker Verification
Audio, Label	Binary Label	Dialogue Act Pairing, Keyword Spotting
Audio, Text	Audio	Accented Text-to-speech, Speech Edit
Audio, Text	Binary Label	Speech Text Matching, Spoken Term Detection
Text	Audio	Instruct TTS, Text-To-Speech Synthesis
Text, Label	Audio	Emotional TTS, Expressive TTS

Tasks with different Level

Level	Tasks
Acoustic	Accent Classification, Accented Text-to-speech, Acoustic Echo Cancellation, DeepFake Detection, Dysarthric Speech Assessments, Emotional TTS, Enhancement Detection, Expressive TTS, HowFarAreYou, Instruct TTS, Laughter Synthesis, Noise Detection, Noise SNR Level Prediction, Reverberation Detection, Speech Edit, Speech Dereverberation, Speech Enhancement, Speech Quality Assessment, Spoof Detection, Stress Detection, Text-To-Speech Synthesis, Vocal Sound Classification, Voice Conversion
Content	Automatic Speech Recognition, Dysarthric Speech Recognition, Keyword Spotting, Multilingual Speech Recognition, Non-verbal Voice Recognition, Overlapping Speech Detection, Speech Edit, Speech Command Recognition, Speech Detection, Speech Text Matching, Speech-to-speech Translation, Speech-to-text Translation, Spoken Term Detection, Vocal Sound Classification
Emotion	Dialogue Emotion Classification, Emotion Recognition, Emotional TTS
Language	Accent Classification, Accented Text-to-speech, Language Identification, Multilingual Speech Recognition, Speech-to-speech Translation, Speech-to-text Translation
Speaker	MultiSpeaker Detection, Overlapping Speech Detection, Speaker Counting, Speaker Diarization, Speaker Identification, Speaker Verification, Speech Separation, Target Speaker Extraction, Voice Conversion
Understanding	Dialogue Act Classification, Dialogue Act Pairing, Expressive TTS, Instruct TTS, Intent Classification, Offensive Language Identification, Sarcasm Detection, Slot Filling, Spoken Question Answering

References

Dynamic-SUPERB
paperswithcode
kaggle
AI-ADL
INTERSPEECH 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SpeechTasks

Continuously Updating!

Table of Contents

Tasks and DataSets

And more

Tasks with different input & output modes

Tasks with different Level

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

SpeechTasks

Continuously Updating!

Table of Contents

Tasks and DataSets

And more

Tasks with different input & output modes

Tasks with different Level

References