Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ASR SEAME Recipe #1582

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions egs/librispeech/ASR/zipformer/export.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi amir, thank you for your contribution!

but im wondering perhaps you want to use a separate export.py rather than applying modification to the one used by librispeech?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both are based on same zipformer architecture, I think they can use the same export, the difference is in the parameters. I could also create a separate one if needed.

Original file line number Diff line number Diff line change
Expand Up @@ -29,13 +29,17 @@

(1) Export to torchscript model using torch.jit.script()

- For non-streaming model:

./zipformer/export.py \
--exp-dir ./zipformer/exp \
--tokens data/lang_bpe_500/tokens.txt \
--epoch 30 \
--avg 9 \
- For non-streaming model:

./zipformer_hat_seame/export.py \
--exp-dir ./zipformer_hat/exp \
--tokens data_seame/lang_bpe_4000/tokens.txt \
--epoch 20 \
--avg 5 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--jit 1

It will generate a file `jit_script.pt` in the given `exp_dir`. You can later
Expand Down Expand Up @@ -234,7 +238,7 @@ def get_parser():
parser.add_argument(
"--tokens",
type=str,
default="data/lang_bpe_500/tokens.txt",
default="data_libri/lang_bpe_500/tokens.txt",
help="Path to the tokens.txt",
)

Expand Down
23 changes: 23 additions & 0 deletions egs/seame/ASR/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Introduction

This recipe includes ASR models (zipformer, zipformer-hat, zipformer-hat-lid) trained and evaluated on SEAME dataset.
The SEAME corpora is Singaporean Codeswitched English and Mandarin.

This corpus comes defined with a training split and two development splits:

train -- A mix of codeswitched, Mandarin and Singaporean English
dev_sge -- A set of primarily Singaporean English though there is codeswitching
dev_man -- A set of primarily Mandarin though there is also some codeswitching


[./RESULTS.md](./RESULTS.md) contains the latest results.

# Zipformer-hat

Zipformer with hybrid autoregressive transducer (HAT) loss https://arxiv.org/abs/2003.07705
see https://github.com/k2-fsa/icefall/pull/1291

# Zipformer-hat-lid

Zipformer-hat with auxiliary LID encoder and blank sharing for synchronization between ASR and LID as described here (will be shared soon)

168 changes: 168 additions & 0 deletions egs/seame/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,168 @@
## Results

#### Zipformer

| | dev | test | comment |
|------------------------------------|------------|------------|------------------------------------------|
| modified beam search | 21.87 | 29.04 | --epoch 25, --avg 5, --max-duration 500 |

The training command:

```
export CUDA_VISIBLE_DEVICES="0,1,2,3"

./zipformer/train.py \
--world-size 4 \
--num-epochs 25 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp-asr-seame \
--causal 0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--prune-range 10 \
--max-duration 500
```

The decoding command:

```
./zipformer/decode.py \
--epoch 25 \
--avg 5 \
--beam-size 10
--exp-dir ./zipformer/exp-asr-seame \
--max-duration 800 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--decoding-method modified_beam_search
```

The pretrained model is available at: <https://huggingface.co/AmirHussein/zipformer-seame>


### Zipformer-HAT

| | dev | test | comment |
|------------------------------------|------------|------------|------------------------------------------|
| modified beam search | 22.00 | 29.92 | --epoch 20, --avg 5, --max-duration 500 |


The training command for reproducing is given below:

```
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5"

./zipformer_hat/train.py \
--world-size 4 \
--num-epochs 25 \
--start-epoch 1 \
--base-lr 0.045 \
--lr-epochs 6 \
--use-fp16 1 \
--exp-dir zipformer_hat/exp \
--causal 0 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--prune-range 10 \
--max-duration 500
```

The decoding command is:
```
## modified beam search
./zipformer_hat/decode.py \
--epoch 25 --avg 5 --use-averaged-model True \
--beam-size 10 \
--causal 0 \
--exp-dir zipformer_hat/exp \
--bpe-model data_seame/lang_bpe_4000/bpe.model \
--max-duration 1000 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--decoding-method modified_beam_search
```

A pre-trained model and decoding logs can be found at <https://huggingface.co/AmirHussein/zipformer-hat-seame>

### Zipformer-HAT-LID

| | dev | test | comment |
|------------------------------------|------------|------------|------------------------------------------|
| modified beam search | 20.04 | 26.91 | --epoch 15, --avg 5, --max-duration 500 |

The training command for reproducing is given below:

```
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5"

./zipformer_hat_lid/train.py \
--world-size 4 \
--lid True \
--num-epochs 25 \
--start-epoch 1 \
--base-lr 0.045 \
--use-fp16 1 \
--lid-loss-scale 0.3 \
--exp-dir zipformer_hat_lid/exp \
--causal 0 \
--lid-output-layer 3 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--lids "<en>,<zh>" \
--prune-range 10 \
--freeze-main-model False \
--use-lid-encoder True \
--use-lid-joiner True \
--lid-num-encoder-layers 2,2,2 \
--lid-downsampling-factor 2,4,2 \
--lid-feedforward-dim 256,256,256 \
--lid-num-heads 4,4,4 \
--lid-encoder-dim 256,256,256 \
--lid-encoder-unmasked-dim 128,128,128 \
--lid-cnn-module-kernel 31,15,31 \
--max-duration 500

```

The decoding command is:
```
## modified beam search
python zipformer_hat_lid/decode.py \
--epoch $epoch --avg 5 --use-averaged-model True \
--beam-size 10 \
--lid False \
--lids "<en>,<zh>" \
--exp-dir zipformer_hat_lid/exp \
--bpe-model data_seame/lang_bpe_4000/bpe.model \
--max-duration 800 \
--num-encoder-layers 2,2,2,2,2,2 \
--feedforward-dim 512,768,1024,1024,1024,768 \
--encoder-dim 192,256,256,256,256,256 \
--encoder-unmasked-dim 192,192,192,192,192,192 \
--decoding-method modified_beam_search \
--lid-output-layer 3 \
--use-lid-encoder True \
--use-lid-joiner True \
--lid-num-encoder-layers 2,2,2 \
--lid-downsampling-factor 2,4,2 \
--lid-feedforward-dim 256,256,256 \
--lid-num-heads 4,4,4 \
--lid-encoder-dim 256,256,256 \
--lid-encoder-unmasked-dim 128,128,128 \
--lid-cnn-module-kernel 31,15,31
```

A pre-trained model and decoding logs can be found at <https://huggingface.co/AmirHussein/zipformer-hat-lid-seame>


56 changes: 56 additions & 0 deletions egs/seame/ASR/local/cer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
#!/usr/bin/env python3
# Johns Hopkins University (authors: Amir Hussein)


"""
This file cer from icefall decoded "recogs" file:
id [ref] xxx
id [hyp] yxy
"""

import argparse

import jiwer


def get_args():
parser = argparse.ArgumentParser()
parser.add_argument("--dec-file", type=str, help="Decoded icefall recogs file")

return parser


def cer_(file):
hyp = []
ref = []
cer_results = 0
ref_lens = 0
with open(file, "r", encoding="utf-8") as dec:
for line in dec:
id, target = line.split("\t")
id = id[0:-2]
target, txt = target.split("=")
if target == "ref":
words = txt.strip().strip("[]").split(", ")
word_list = [word.strip("'") for word in words]
ref.append(" ".join(word_list))
elif target == "hyp":
words = txt.strip().strip("[]").split(", ")
word_list = [word.strip("'") for word in words]
hyp.append(" ".join(word_list))
for h, r in zip(hyp, ref):
if r:
cer_results += jiwer.cer(r, h) * len(r)

ref_lens += len(r)
print(cer_results / ref_lens)


def main():
parse = get_args()
args = parse.parse_args()
cer_(args.dec_file)


if __name__ == "__main__":
main()
Loading
Loading