This project aim to build a toolkit for speech-to-text task (e.g. Speech Recognition and Speech Translation). It is powered by the NiuTensor.
We support the multilingual speech Recognition now! Try it now (https://asr-demo.niutrans.com)!
- Light and fast, all the code is developed by the C++ and CUDA.
- Independent development. well-optimized for both CPUs and GPUs.
You need to prepare C++(for NiuTrans.ST) & Python(for pre/post-process) environments for this project first.
recommand environment
- GCC=8.X.X (>=5.4.0 should be ok)
- CUDAToolkit=11.3 with cuDNN (11.3 or 11.7 is work, but does not support version 12)
- Cmake=3.2X.X
cd /PROJECT_DIR/NiuTrans.ST
mkdir build & cd build
cmake ../ -DUSE_CUDA=ON -DUSE_CUDNN=ON -DGPU_ARCH=P -DCUDA_TOOLKIT_ROOT='D:/CUDA Toolkit/v11.7'
You can use nvidia-smi -q | grep "Architecture"
to search your 'DGPU_ARCH'. More details please refer to NiuTensor
make -j$(nproc)
we provide test files in /PROJECT_DIR/test
.
/PROJECT_DIR/bin/NiuTrans.ST -config /path/to/test/config.txt
toolkit (unifying the audio format)
- ffmpeg
python package (Using python to process the audio if you prefer)
- torch (https://pytorch.org/get-started/previous-versions/)
- numpy
- webrtcvad
- librosa
- soundfile
/PROJECT_DIR/script/run.sh
*If you get some bugs that cannot be fixed, consider using an absolute path.*
Run run.sh
in Git Bash
cd /PROJECT_DIR/script/
bash run.sh
To use your own audio files, you can modify the audio path in /PROJECT_DIR/data/test-clean-head.tsv
.
${python} ${FeatureExactor} -l ${audio_tsv} -o ${bin_tsv} -p ${bin_dir} -v ${vad_mode} -d 30.0
FeatureExactor with exact feature from raw audio files that were listed in audio_tsv. Features will be saved in bin_tsv. Use vad_mode to choose the strength of VAD, '0' means not using VAD.
${NiuTrans} -config ${config}
NiuTrans.ST
${python} ${TokenDecoder} -v ${vocab} -t ${token} > ${output_sentence}
TokenDecoder will decode the tokens outputed from the model to words.
config_string="-fbank 80 // fixed in 80
-dev 0 // device id
-beam 2 // size of beam seach(use greedy search when 1)
-model ${model} // path to converted model file
-tgtvocab ${vocab} // path to converted vocab file
-maxlen 224
-lenalpha 1.0
-sbatch 2 // batch_size(sentence)
-wbatch 300000 // max_tokens
-lang en // language, only support en and zh(with fixed prompt)
-input ${input} // path to the list of input feature
-output ${output}" // path to save output tokens
Here, you can use '-input' to enter features of multiple audio data, or use '-inputAudio' to input a single raw audio data like:
skip the FeatureExactor step
config_string="-fbank 80 // fixed in 80
-dev 0 // device id
-beam 2 // size of beam seach(use greedy search when 1)
-model ${model} // path to converted model file
-tgtvocab ${vocab} // path to converted vocab file
-maxlen 224
-lenalpha 1.0
-sbatch 2 // batch_size(sentence)
-wbatch 300000 // max_tokens
-lang en // language, only support en and zh(with fixed prompt)
-inputAudio /PROJECT_DIR/data/flac/672-122797-0000.flac
// path to a raw audio file
-output ${output}" // path to save output tokens
Current team members are Yuhao Zhang, Xiangnan Ma, Kaiqi Kou, Erfeng He, Chenghao Gao.
We'd like thank Prof. XIAO, Tong and Prof. ZHU, Jingbo for their support.