Disable tf4micro agc and ns training for mWW #9
StuartIanNaylor
started this conversation in
Ideas
Replies: 2 comments 4 replies
-
I'm going to tap in @gnumpi to provide his thoughts here. |
Beta Was this translation helpful? Give feedback.
3 replies
-
Sounds like a valid point you came up with. I have to say that I haven’t spent much time into improving mWW yet. It’s on my todo list, if it is ok I will ping you when I get there and we can further discuss this. Thanks for reporting your findings here! |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The specifically purchased Xmos for speech enhancement has a more advanced tflite model based NS Voice extraction scheme, but the pipeline still forces audio through the NS and AGC of TF4micro.
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/experimental/microfrontend
Even if of comparitive quality to the Xmos solution its a duplication of ops that is done 1st on the Xmos.
esphome/home-assistant-voice-pe#279 (comment)
The reply of 'no plans' uses a strange chicken and egg argument, whilst the Xmos audio is there being forced through TF4micro NS and AGC which is trained in and it would seem no-one has tried to train with those features disabled to try the resultant model with the frontend features turned off.
Capturing the audio from the xmos algs via firmware to provide a constant wyoming broadcast or enable the USB Audio class driver provided by Xmos should be fairly straight forward.
The professional way would be to run the xmos tflite and process algs on a dev machine and force a dataset through processing at far higher rates than sample rate limits of the mic hardware.
Likely a constant Wyoming broadcast would be the easiest method as it just means removing mWW from the already implemented pipeline.
This leads into the training as the Xmos has been purchased in to provide 3m farfield voice extraction and due to being blind to what the Xmos algs do RiRs recorded @ 1.5m from a single mic are being applied to create a training dataset.
kahrendt/microWakeWord#28 (comment)
I don't think its possible not to be critical of the current dataset creation from Piper creating a 1000 samples with little variation to applying pre recorded RIRs @ a single fixed distance of many locations such as forrests and massive structures for a device that is to sit at home and provide 3m farfield in relatively normal sized rooms.
Where supposedly because the xmos alg is a farfield voice extraction these Rirs should already attentuated but unfortunately devs and community are currently totally blind to what level.
What RIRs and why so much research and papers at the start os the audio pipeline to deverberation and clean audio before processing is due to the huge problem you can not just train it in to a basic classification alg of a mixed resultant mono audio stream.
OHF-Voice/wake-word-collective#11 (comment)
There are so many basic 101 errors to the whole approach to the hardware purchased in and model training that if someone comfortable with the Microcontroller skills could provide the output from the Xmos algs then at least the community would not be blind.
I have tacked on some ideas to kahrendt/microWakeWord#28 (comment) but this blind situation is a crazy situation that surely can be rectified.
The missconceptions about training are huge especially the importance of the size, quality of needed datasets.
https://developers.google.com/machine-learning/crash-course/overfitting/data-characteristics
Even if tf4micro doesn't work with a ondevice training scheme its so shortsighted not to be capturing actual use data, that maybe the recouped ops of NS & AGC could create a rolling window to capture the KW aswell as streamed voice command...
https://ai.google.dev/edge/litert/models/ondevice_training
That data with an opt-in could be submitted to create over time gold standard datasets so opensource on a equal footing to big data.
Why current approaches seem so out of line with basic ML concepts where we have no idea to the data we are recieving and the current dataset is created on total guesswork...
Sadly when you do point these basic errors out they are being closed off without even approaching the problems in what can be said as closed source methods and actions.
Beta Was this translation helpful? Give feedback.
All reactions