Skip to content

Real-Time ASR with CNN-BiLSTM: End-to-End Live Streaming Using Lightning AI ⚡

License

Notifications You must be signed in to change notification settings

LuluW8071/Automatic-Speech-Recognition-with-PyTorch

Repository files navigation

🚀 End-to-End Automatic Speech Recognition

Code in Progress License Open Issues Closed Issues Open PRs Closed PRs Repo Size Last Commit

Model

This project focuses on creating a small-scale speech recognition system for transcribing audio inputs into text. The system employs a CNN1D + BiLSTM based Acoustic Model, designed specifically for small-scale datasets and faster training of ASR (Automatic Speech Recognition).

💻 Installation

  • Install the CUDA version of PyTorch for training or the CPU version for inference, then install the remaining dependencies:
    pip install -r requirements.txt

🚀 Usage

1. Dataset Conversion Script

Note

  • The dataset conversion script is designed to convert the CommonVoice dataset to the format required for training the speech recognition model.
  • Use the --not-convert flag to skip the conversion step and export only the dataset paths and utterances in JSON format.
py common_voice.py --file_path path/to/validated.tsv --save_json_path converted_clips --percent 20

2. Train the Model

Important

Two model choices are provided: GRU and LSTM-based, in train.py. Uncomment the one you want and comment out the other. However, the LSTM performs better due to its ability to capture longer contexts.

py train.py --train_json path/to/train.json --valid_json path/to/test.json \
--epochs 100 \
--batch_size 64 \
--lr 2e-4 \
--grad_clip 0.5 \
--accumulate_grad 2 \
--gpus 1 \
--w 8 \
--checkpoint_path path/to/checkpoint.ckpt

3. Export to TorchScript

python freeze_model.py --model_checkpoint path/to/model.ckpt

4. Run Inference

python engine.py --model_file path/to/optimized_model.pt

Experiment Results

This experiment used ~1,000 hours of audio with 670,000 utterances from Common Voice and my recordings, split 85% for training and 15% for testing. You can download the trained checkpoint and small 4-gram KENLM model from here.

Model Configuration

model hidden_size num_layers dropout n_feats num_classes
Bi-LSTM 512 2 0.1 128 29

Training Results

Loss Curve
Losses
Model Best Epoch Val Loss Avg. Greedy WER Avg. CTC+KenLM
Bi-LSTM 61 0.359 28.44% ~22-23%

Note

4-gram LibriSpeech KENLM was used for inference. If you build your own KenLM, the WER should be even lower.


📄 License

This project is licensed under the GNU License. See the LICENSE file for details.


This guide should help you effectively set up and use the speech recognition system. If you encounter any issues or have questions, feel free to reach out or submit a issue in the repository.

Releases

No releases published

Packages

No packages published

Languages