Porting OpenAI Whisper speech recognition to edge devices with hardware ML accelerators, enabling always-on live voice transcription. Current work includes Jetson Nano and Coral Edge TPU.
Part | Price (2023) |
---|---|
NVIDIA Jetson Nano Developer Kit (4G) | $149.00 |
ChanGeek CGS-M1 USB Microphone | $16.99 |
Noctua NF-A4x10 5V Fan (or similar, recommended) | $13.95 |
D-Link DWA-181 Wi-Fi Adapter (or similar, optional) | $21.94 |
The base.en
version of Whisper seems to work best for the Jetson Nano:
base
is the largest model size that fits into the 4GB of memory without modification.- Inference performance with
base
is ~10x real-time in isolation and ~1x real-time while recording concurrently. - Using the english-only
.en
version further improves WER (<5% on LibriSpeech test-clean).
Dilemma:
- Whisper and some of its dependencies require Python 3.8.
- The latest supported version of JetPack for Jetson Nano is 4.6.3, which is on Python 3.6.
- No easy way to update Python to 3.8 without losing CUDA support for PyTorch.
Workaround:
Attach the Jetson Nano to your computer via USB and get a shell, e.g. with screen on Linux:
screen /dev/ttyUSB0 115200
Or with PuTTY on Windows.
You'll be prompted to log in with the default credentials:
login: alex
password: arribada
First, follow the developer kit setup instructions, connect the Wi-Fi adapter and the microphone to USB, and ideally install a fan. (Also plugging in an Ethernet cable helps to make the downloads faster.) Then, get a shell on the Jetson Nano:
For the demo, the container should already be built. You can skip this step and proceed to Run.
We will use NVIDIA Docker containers to run inference. Get the source code and build the custom container:
git clone https://github.com/arribada/whisper-edge-demo.git whisper-edge-arribada
bash whisper-edge-arribada/build.sh
Launch inference:
bash whisper-edge-arribada/run.sh
You should see console output similar to this:
I0317 00:42:23.979984 547488051216 stream.py:75] Loading model "base.en"...
100%|#######################################| 139M/139M [00:30<00:00, 4.71MiB/s]
I0317 00:43:14.232425 547488051216 stream.py:79] Warming model up...
I0317 00:43:55.164070 547488051216 stream.py:86] Starting stream...
I0317 00:44:19.775566 547488051216 stream.py:51]
I0317 00:44:22.046195 547488051216 stream.py:51]
I0317 00:44:49.219501 547488051216 stream.py:51] Start speaking now to see the transcription!
Below is a script for demoing the transcription in real-time:
As the sun set, I couldn't help but admire the dolphins jumping out of the water, with seagulls flying overhead.
It's a beautiful scene, but there's a problem on my mind: bycatch.
You see, I'm a fisherman, and my family depends on our daily catch.
But sometimes, our nets unintentionally trap dolphins, whales, and other creatures, instead of the sharks and seals we're targeting.
This script will highlight the keywords programmed into this demo in green.
The stream.py
script run in the container accepts flags for different configurations (the default flags should work for the demo):
bash whisper-edge-arribada/run.sh --help
USAGE: stream.py [flags]
flags:
stream.py:
--channel_index: The index of the channel to use for transcription.
(default: '0')
(an integer)
--chunk_seconds: The length in seconds of each recorded chunk of audio.
(default: '10')
(an integer)
--input_device: The input device used to record audio.
(default: 'plughw:2,0')
--language: The language to use or empty to auto-detect.
(default: 'en')
--latency: The latency of the recording stream.
(default: 'low')
--model_name: The version of the OpenAI Whisper model to use.
(default: 'base.en')
--num_channels: The number of channels of the recorded audio.
(default: '1')
(an integer)
--sample_rate: The sample rate of the recorded audio.
(default: '16000')
(an integer)
Try --helpfull to get a list of all flags.
To see if the microphone is working properly, use alsa-utils
:
sudo apt-get -y install alsa-utils
# Is the USB device connected?
lsusb
# Is the correct recording device selected?
arecord -l
# Is the gain set properly?
alsamixer
# Does a test recording work?
arecord --format=S16_LE --duration=5 --rate=16000 --channels=1 --device=plughw:2,0 test.wav