Disable tf4micro agc and ns training for mWW #9

StuartIanNaylor · 2025-01-18T11:46:37Z

StuartIanNaylor
Jan 18, 2025

The specifically purchased Xmos for speech enhancement has a more advanced tflite model based NS Voice extraction scheme, but the pipeline still forces audio through the NS and AGC of TF4micro.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/microfrontend
Even if of comparitive quality to the Xmos solution its a duplication of ops that is done 1st on the Xmos.

esphome/home-assistant-voice-pe#279 (comment)

The reply of 'no plans' uses a strange chicken and egg argument, whilst the Xmos audio is there being forced through TF4micro NS and AGC which is trained in and it would seem no-one has tried to train with those features disabled to try the resultant model with the frontend features turned off.

Capturing the audio from the xmos algs via firmware to provide a constant wyoming broadcast or enable the USB Audio class driver provided by Xmos should be fairly straight forward.
The professional way would be to run the xmos tflite and process algs on a dev machine and force a dataset through processing at far higher rates than sample rate limits of the mic hardware.
Likely a constant Wyoming broadcast would be the easiest method as it just means removing mWW from the already implemented pipeline.

This leads into the training as the Xmos has been purchased in to provide 3m farfield voice extraction and due to being blind to what the Xmos algs do RiRs recorded @ 1.5m from a single mic are being applied to create a training dataset.
kahrendt/microWakeWord#28 (comment)
I don't think its possible not to be critical of the current dataset creation from Piper creating a 1000 samples with little variation to applying pre recorded RIRs @ a single fixed distance of many locations such as forrests and massive structures for a device that is to sit at home and provide 3m farfield in relatively normal sized rooms.
Where supposedly because the xmos alg is a farfield voice extraction these Rirs should already attentuated but unfortunately devs and community are currently totally blind to what level.
What RIRs and why so much research and papers at the start os the audio pipeline to deverberation and clean audio before processing is due to the huge problem you can not just train it in to a basic classification alg of a mixed resultant mono audio stream.
OHF-Voice/wake-word-collective#11 (comment)

There are so many basic 101 errors to the whole approach to the hardware purchased in and model training that if someone comfortable with the Microcontroller skills could provide the output from the Xmos algs then at least the community would not be blind.

I have tacked on some ideas to kahrendt/microWakeWord#28 (comment) but this blind situation is a crazy situation that surely can be rectified.
The missconceptions about training are huge especially the importance of the size, quality of needed datasets.
https://developers.google.com/machine-learning/crash-course/overfitting/data-characteristics

Even if tf4micro doesn't work with a ondevice training scheme its so shortsighted not to be capturing actual use data, that maybe the recouped ops of NS & AGC could create a rolling window to capture the KW aswell as streamed voice command...
https://ai.google.dev/edge/litert/models/ondevice_training
That data with an opt-in could be submitted to create over time gold standard datasets so opensource on a equal footing to big data.
Why current approaches seem so out of line with basic ML concepts where we have no idea to the data we are recieving and the current dataset is created on total guesswork...

Sadly when you do point these basic errors out they are being closed off without even approaching the problems in what can be said as closed source methods and actions.

that1guy · 2025-01-23T18:31:41Z

that1guy
Jan 23, 2025
Maintainer

I'm going to tap in @gnumpi to provide his thoughts here.

@gnumpi?

3 replies

StuartIanNaylor Jan 23, 2025
Author

One way you could improve the dataset problem is to use https://ai.google.dev/edge/litert/models/ondevice_training where a small captured dataset biases the weights of a larger pretrained model.
Dunno if you can do that with tf4micro but with tflite or LiteRT as its now called you can train in environment of use and its users to make it more accurate.

Thing is to start capturing data especially ondevice with good metadata to an opensource repo via an opt-in as the Google ML Crash Course hints at how far short current training is in dataset size.

StuartIanNaylor Jan 26, 2025
Author

Also the Mobilenet model chosen for mWW is much larger in terms of paramters https://arxiv.org/pdf/1907.09595 as quoted as the model in https://github.com/kahrendt/microWakeWord?tab=readme-ov-file whilst https://github.com/google-research/google-research/blob/master/kws_streaming/experiments/kws_experiments_12_labels.md#bc_resnet_1 can have as low as 10k params and is also supposedly streamable.

StuartIanNaylor Jan 31, 2025
Author

I had a go with https://github.com/google-research/google-research/tree/master/kws_streaming and a CRNN I know with synthetic data of my own creation, no RIR as how much no-one knows.
The resultant tflite model and test script is in https://github.com/StuartIanNaylor/tf-kws, augment scripts in https://github.com/StuartIanNaylor/create-dataset, dataset selection in https://github.com/StuartIanNaylor/syn-dataset
Just to test 'My Way' as best I can with synthetic and still far too few voices, on-device and real much more preferable but a bigger dataset of more voice variance would likely be best.
Also the dataset was recorded at a too high volume without enough headroom but actually did training on CPU and it will do as an example that can still be improved.
I find using the 'Noise' label is an inverse VAD and no VAD model is really needed.
I have the actual dataset here if someone wants to try training it and compare the tflite results of mWW...
https://drive.google.com/file/d/1QyzbPWyls253lLBb3Wn-vlg1T_qMHguJ/view?usp=drive_link
tflite with flex delegate here https://drive.google.com/file/d/1eHcoMFNgbFUlro3T28jSipAIaAsI0965/view?usp=drive_link

gnumpi · 2025-02-04T13:50:46Z

gnumpi
Feb 4, 2025
Collaborator

Sounds like a valid point you came up with. I have to say that I haven’t spent much time into improving mWW yet. It’s on my todo list, if it is ok I will ping you when I get there and we can further discuss this. Thanks for reporting your findings here!

1 reply

StuartIanNaylor Feb 4, 2025
Author

Yeah of course, ping anytime.
I deliberately did a different model CRNN-State from the Goggle KWS-streaming repo purely as a Tflite model comparison.
The dataset they create is without doubt hugely overfitted and the RIR bit is just strange.
I am in no rush as looking for a TTS to provide more 'voices' emotivoice has a lot of voices but some sound a bit synthetic.
https://github.com/coqui-ai/TTS ⓍTTSv2 does a great job and maybe cloning further voices is a slowe but better idea.
Capturing actual device recordings is and will always be best however.
I don't know if the training in https://github.com/kahrendt/microWakeWord is inline with current mWW code...
I will try to hack out the dataset creation and just train a mWW tflite model but its a pain as the dataset is stored as ready made spectrogram, which is another thing I don't like as listening to wavs is easy but in spectrogram its not really possible.
The situation with a fairly poor dataset likely contributes to results more than the tf4micro frontend AGC + NS but will try to create a model of one of 'my' datasets with them enabled and one without.
Being blind without samples of KW and command sentences at various distances of varying volumes means I haven't a clue what sort of scope the current arrangement of AGC provides.
The dataset should resemble average normal use, but just don't know what that is in terms of samples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FutureProofHomes

Disable tf4micro agc and ns training for mWW #9

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

FutureProofHomes

Disable tf4micro agc and ns training for mWW #9

StuartIanNaylor Jan 18, 2025

Replies: 2 comments · 4 replies

that1guy Jan 23, 2025 Maintainer

StuartIanNaylor Jan 23, 2025 Author

StuartIanNaylor Jan 26, 2025 Author

StuartIanNaylor Jan 31, 2025 Author

gnumpi Feb 4, 2025 Collaborator

StuartIanNaylor Feb 4, 2025 Author

StuartIanNaylor
Jan 18, 2025

Replies: 2 comments 4 replies

that1guy
Jan 23, 2025
Maintainer

StuartIanNaylor Jan 23, 2025
Author

StuartIanNaylor Jan 26, 2025
Author

StuartIanNaylor Jan 31, 2025
Author

gnumpi
Feb 4, 2025
Collaborator

StuartIanNaylor Feb 4, 2025
Author