Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support adding punctuations to the speech recogntion result #761

Merged
merged 8 commits into from
Apr 13, 2024

Conversation

csukuangfj
Copy link
Collaborator

@csukuangfj csukuangfj commented Apr 12, 2024

Usage

Download a model

Please see
https://github.com/k2-fsa/sherpa-onnx/releases/tag/punctuation-models

mkdir /tmp
cd /tmp
wget https://github.com/k2-fsa/sherpa-onnx/releases/download/punctuation-models/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2
tar xvf sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2
rm tar xvf sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12.tar.bz2

Build sherpa-onnx from source

cd /tmp
git clone https://github.com/k2-fsa/sherpa-onnx
cd sherpa-onnx
mkdir build
cd build
cmake ..
make

Test case 1 (Chinese + English)

cd /tmp
cd sherpa-onnx/build
./bin/sherpa-onnx-offline-punctuation \
  --debug=1 \
  --ct-transformer=/tmp/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx 
  '这是一个测试你好吗How are you我很好thank you are you ok谢谢你'

The output is

Num threads: 1
Elapsed seconds: 0.003 s
Input text: 这是一个测试你好吗How are you我很好thank you are you ok谢谢你
Output text: 这是一个测试,你好吗?How are you?我很好?thank you,are you ok。谢谢你。

Test case 2 (Chinese only)

cd /tmp
cd sherpa-onnx/build
./bin/sherpa-onnx-offline-punctuation \
  --debug=1 \
  --ct-transformer=/tmp/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx 
  '我们都是木头人不会说话不会动'

The output is

Num threads: 1
Elapsed seconds: 0.002 s
Input text: 我们都是木头人不会说话不会动
Output text: 我们都是木头人,不会说话不会动。

Test case 3 (English only)

cd /tmp
cd sherpa-onnx/build
./bin/sherpa-onnx-offline-punctuation \
  --debug=1 \
  --ct-transformer=/tmp/sherpa-onnx-punct-ct-transformer-zh-en-vocab272727-2024-04-12/model.onnx 
  'The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry'

The output is

Num threads: 1
Elapsed seconds: 0.002 s
Input text: The African blogosphere is rapidly expanding bringing more voices online in the form of commentaries opinions analyses rants and poetry
Output text: The African blogosphere is rapidly expanding,bringing more voices online in the form of commentaries,opinions,analyses,rants and poetry。

@csukuangfj csukuangfj changed the title WIP: Support adding punctuations to the speech recogntion result Support adding punctuations to the speech recogntion result Apr 13, 2024
@csukuangfj csukuangfj merged commit 329fe1a into k2-fsa:master Apr 13, 2024
212 of 222 checks passed
@csukuangfj csukuangfj deleted the punctuation branch April 13, 2024 04:16
@mark95
Copy link

mark95 commented Jun 19, 2024

I'm having a problem understanding how to train this model because it is in Chinese. Is it possible to train that model for other languages? What format dataset do I need to train it? Is there a training script for that model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants