Multimodal Speech Emotion Recognition using Audio and Text, IEEE SLT-18, [paper]
tensorflow==1.4 (tested on cuda-8.0, cudnn-6.0)
python==2.7
scikit-learn==0.20.0
nltk==3.3
- IEMOCAP [link] [paper]
- download IEMOCAP data from its original web-page (license agreement is required)
-
Get the preprocessed dataset [application link]
If you want to download the "preprocessed dataset," please ask the license to the IEMOCAP team first.
-
for the preprocessing, refer to codes in the "./preprocessing"
-
We cannot publish ASR-processed transcription due to the license issue (commercial API), however, we assume that it is moderately easy to extract ASR-transcripts from the audio signal by oneself. (we used google-cloud-speech-api)
-
Format of the data for our experiments:
MFCC : MFCC features of the audio signal (ex. train_audio_mfcc.npy)
[#samples, 750, 39] - (#sampels, sequencs(max 7.5s), dims)MFCC-SEQN : valid lenght of the sequence of the audio signal (ex. train_seqN.npy)
[#samples] - (#sampels)PROSODY : prosody features of the audio signal (ex. train_audio_prosody.npy)
[#samples, 35] - (#sampels, dims)TRANS : sequences of trasnciption (indexed) of a data (ex. train_nlp_trans.npy)
[#samples, 128] - (#sampels, sequencs(max))LABEL : targe label of the audio signal (ex. train_label.npy)
[#samples] - (#sampels)
- repository contains code for following models
Audio Recurrent Encoder (ARE)
Text Recurrent Encoder (TRE)
Multimodal Dual Recurrent Encoder (MDRE)
Multimodal Dual Recurrent Encoder with Attention (MDREA)
- refer "reference_script.sh"
- fianl result will be stored in "./TEST_run_result.txt"
- Please cite our paper, when you use our code | model | dataset
@inproceedings{yoon2018multimodal,
title={Multimodal Speech Emotion Recognition Using Audio and Text},
author={Yoon, Seunghyun and Byun, Seokhyun and Jung, Kyomin},
booktitle={2018 IEEE Spoken Language Technology Workshop (SLT)},
pages={112--118},
year={2018},
organization={IEEE}
}