This repository contains a toolkit for speech translation. It provides a Docker container with a ready to use pipeline containing the following components:
- a neural speech recognition system
- a sentence segmentation system
- an attention-based translation system
The speech recognition system processes the audio files and creates the transcription in the source language. Afterwords the sentence segmentation system adds punctuation and recases the output. Finally the output is translated by the machine translation system. We provide pipelines to train these model as well as pre-trained models for all components for the task of translating English lectures to German.
The system uses the following software:
- OpenNMT-py
- Moses
- XNMT
- Subword NMT
- Translation error rate
- BEER
- CharacTER
- SCTK
- mwerSegmenter
- NLTK
- LIUM Speaker Diarization
- CTC.ISL
- NMTGMinor
Requirements:
- 2019-09 : Recipe for How2 dataset (https://github.com/srvk/how2-dataset) using transformer architecture for ASR,MT and end-to-end SLT.
git clone https://github.com/isl-mt/SLT.KIT.git
cd SLT.KIT
docker build --build-arg CUDA=$CUDAVERSION -t slt.kit -f Dockerfile.ST-Baseline .
with CUDAVERSION = 8.0 or 9.0 or 9.1
- Starting the docker container (e.g. source language English (en) and target language German (de))
docker run -ti --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=$gpuid slt.kit
export sl=en
export tl=de
- The general file structure used by all models and systems is described in File structure
- This repository contains different systems that can be used to do speech translation
-
Cascaded systems: Systems that combine an ASR, sentence segmentation/puncation and MT component
- ctc-tedlium2.smallTED: Combination of the ctc-tedlium2 ASR system and the smallTED system for sentence segmentation and MT
- ctc-tedlium2.midSize: Combination of the ctc-tedlium2 ASR system and the midSize system for sentence segmentation and MT
-
ASR systems: Systems to transcribe the audio
- ctc-tedlium2: Simple LSTM network trained with the CTC loss that outputs BPE units
- las-tedlium2: Attention-based ASR system
-
Sentence segmentation/MT
- ted: System trained on the TED corpus
- midSize: System trained on TED and EPPS corpus
-
- English to German
- dev2010
- tst2010
- tst2013
- tst2014
- tst2015
The results reported here are generated by Rover'ing the output of the three ASR systems (CTC 300, CTC 10k and the attention-based ASR system) and using the MT system trained on the TED corpus.
SET | BLEU | TER | BEER | CharacTER | BLEU(ci) | TER(ci) |
---|---|---|---|---|---|---|
dev2010 | 13.98 | 71.78 | 45.88 | 78.50 | 15.05 | 69.68 |
tst2010 | 14.08 | 71.66 | 44.40 | 77.66 | 15.12 | 69.36 |
tst2013 | 13.73 | 72.81 | 44.02 | 71.45 | 14.61 | 70.78 |
tst2014 | 13.28 | 74.34 | 42.43 | 78.38 | 14.01 | 72.62 |
Furthermore, results for the MT system can be found here.