Skip to content

Latest commit

 

History

History
240 lines (174 loc) · 9.71 KB

README.md

File metadata and controls

240 lines (174 loc) · 9.71 KB

Kaldi Models in Tensorflow Lite

CI

  • This repo contains tensorflow python code defining components in the typical Kaldi pipelines, such as those involving x-vector models for speaker ID and diarization.

  • The components are defined as tensorflow Model or Layer classes using regular tensorflow ops which can then easily be assembled and converted to Tensorflow Lite model format for inference.

  • This project still a work in progress. I hope to add more examples and expand the feature set soon.

Installation

# Doing this in a virtual environment is recommended. 
git clone https://github.com/shahruk10/kaldi-tflite
cd kaldi-tflite && pip install -e .

Implemented Layers

Usage

  • TODO: organize and expand this.

Quickstart

  • To build and initialize an x-vector extractor model using pre-trained weights:
#!/usr/bin/env python3

import librosa
import numpy as np

import tensorflow as tf
import kaldi_tflite as ktf

# Config file specifying feature extracting configs, nnet3 model config and weight paths. 
cfgPath = "data/tflite_models/0008_sitw_v2_1a.yml"

# Building and initializing model; Input = raw audio samples (between +/-32767).
# Output = x-vector with LDA and length normalization applied.
mdl = ktf.models.XvectorExtractorFromConfig(cfgPath, name="wav2xvec")
print(mdl.summary())


# Loading a test audio file; librosa converts samples to be within +/- 1.0; but
# Kaldi and this library expects +/- 32767.0 (i.e. int16 max).
inputFile = "kaldi_tflite/lib/testdata/librispeech_2.wav"
samples, _ = librosa.load(inputFile, sr=None)
samples = np.float32(samples * 32767.0)

# Adding a batch axis.
samples = samples.reshape(1, -1)

# Extract x-vectors.
xvec = mdl(samples)
print(xvec)
  • The sections below show a more detailed step-by-step process of building models in different ways using the layers and functionality implemented in this repo.

Creating a feature extraction model

  • We can use the traditional ways of defining a tensorflow / keras model that extract features from provided raw audio samples using the layers defined in the module. The example uses the Sequential model structure from keras, but we can easily use Functional or sub-classed models for more flexibility.
#!/usr/bin/env python3

import tensorflow as tf
import kaldi_tflite as ktf

# Defining MFCC feature extractor + CMVN. Input = raw audio samples (between +/-32767).
mfcc = tf.keras.models.Sequential([
    tf.keras.layers.Input((None,)),
    ktf.layers.Framing(dynamic_input_shape=True),
    ktf.layers.MFCC(num_mfccs=30, num_mels=30),
    ktf.layers.CMVN(center=True, window=200, norm_vars=False),
], name="wav2mfcc")

print(mfcc.summary())
  • To filter silent frames, we can use the VAD layer in a functional model.

  • WARNING: Using the default layer config, the VAD can be used with a batch size of 1 only, since the number of 'active' frames in different entries in a batch won't necessarily be the same. The workaround for this is setting return_indexes=False in the layer config, which will cause it to output a binary mask instead (active frames = 1, silent frames = 0), which can be used as necessary.

#!/usr/bin/env python3

import tensorflow as tf
import kaldi_tflite as ktf

# Defining MFCC feature extractor + VAD + CMVN. Input = raw audio samples (between +/-32767).

# Input layer.
i = tf.keras.layers.Input((None,))

# DSP layers.
framingLayer = ktf.layers.Framing(dynamic_input_shape=True)
mfccLayer = ktf.layers.MFCC(num_mfccs=30, num_mels=30)
cmvnLayer = ktf.layers.CMVN(center=True, window=200, norm_vars=False)
vadLayer = ktf.layers.VAD()

# Creating frames and computing MFCCs.
x = framingLayer(i)
x = mfccLayer(x)

# Computing VAD, which returns indexes of 'active' frames by default. But it can be made to
# output a mask instead as well. See layer configs.
activeFrames = vadLayer(x)
x = tf.gather_nd(x, activeFrames)

# The `tf.gather_nd` above removes the batch dimension, so we add one back.
x = tf.expand_dims(x, 0)

# Computing CMVN
x = cmvnLayer(x)

mfcc = tf.keras.Model(inputs=i, outputs=x, name="wav2mfcc")

print(mfcc.summary())

Creating a TDNN model

  • Creating a trainable TDNN model for extracting x-vectors following the architecture used in Kaldi recipes like this one:
# Defining x-vector model. Input = MFCC frames with 30 mfcc coefficients.
xvec = tf.keras.models.Sequential([
    tf.keras.layers.Input((None, 30)),
    ktf.layers.TDNN(512, context=[-2, -1, 0, 1, 2], activation="relu", name="tdnn1.affine"),
    ktf.layers.BatchNorm(name="tdnn1.batchnorm"),
    ktf.layers.TDNN(512, context=[-2, 0, 2], activation="relu", name="tdnn2.affine"),
    ktf.layers.BatchNorm(name="tdnn2.batchnorm"),
    ktf.layers.TDNN(512, context=[-3, 0, 3], activation="relu", name="tdnn3.affine"),
    ktf.layers.BatchNorm(name="tdnn3.batchnorm"),
    ktf.layers.TDNN(512, context=[0], activation="relu", name="tdnn4.affine"),
    ktf.layers.BatchNorm(name="tdnn4.batchnorm"),
    ktf.layers.TDNN(1500, context=[0], activation="relu", name="tdnn5.affine"),
    ktf.layers.BatchNorm(name="tdnn5.batchnorm"),
    ktf.layers.StatsPooling(left_context=0, right_context=10000, reduce_time_axis=True),
    ktf.layers.TDNN(512, context=[0], name="tdnn6.affine"),
], name="mfcc2xvec")

print(xvec.summary())

Initializing model weights from a Kaldi nnet3 model file

  • Models created using this library can be initialized using weights from Kaldi nnet3 model files. For the TDNN model in the example above, you can download it from the Kaldi website (SITW Xvector System 1a). Then we can load the nnet3 file and initialize our tensorflow model.
# Loading model weights from kaldi nnet3 file. For x-vector models,
# only TDNN and BatchNorm layers need to be initialized. We match
# the layers with components in the nnet3 model using names.
nnet3Path = "path/to/final.raw"
nnet3 = ktf.io.KaldiNnet3Reader(nnet3Path, True)
for layer in xvec.layers:
    try:
        layer.set_weights(nnet3.getWeights(layer.name))
    except KeyError:
        print(f"'{layer.name}' not found in nnet3 model, skipping")

Combining feature extraction with neural network

  • Feature extraction layers or models can be combined with other tensorflow layers to create a single model.
# Combined feature extraction and x-vector models.
mdl = tf.keras.models.Sequential([
    tf.keras.layers.Input((None,)),
    mfcc,
    xvec,
], name="wav2xvec")

print(mdl.summary())
  • For applications involving long audio streams, it is recommended to keep the feature extraction and neural networks separate as models to facililate processing chunks of frames at a time. The two tensorflow lite models can be made to work together with some glue code to pad and stream frames to each as required.

Saving and Converting to a Tensorflow Lite model

  • To convert the model to a tensorflow lite model, we first need to save the model to disk as a SavedModel. Then we can use the SavedModel2TFlite method to convert and optimize the model for Tensorflow Lite.
# Set the paths where the models will be saved.
savedMdl = "./my-model"
tfliteMdl = "./my-model.tflite"

# If you want to optimize the model for embedded devices, set this to True. This will
# cause the tensorflow lite model to be optimized for size and latency, as well as
# utilize use ARM's Neon extensions for math computations. (This may cause the model
# to actually run *slower* on x86 systems that don't use Neon).
optimize=False
targetDtypes = [ tf.float32 ]

mdl.save(savedMdl)
ktf.models.SavedModel2TFLite(savedMdl, tfliteMdl, optimize=optimize, target_dtypes=targetDtypes)

Related Projects

  • There are a couple of related projects out there that served as very useful reference, especially for the feature extraction layers:

    • kaldifeat: Kaldi feature extraction implemented in NumPy
    • torchaudio: Kaldi feature extraction implemented in PyTorch
    • zig-audio: SPTK feature extraction implemented in Zig

License

Apache License 2.0