Whisper #1268
Replies: 10 comments 22 replies
-
Hey, kudos, this is a nice achievement ! Did you figure out if one of the model is particularly responsible for the slowdown? TBH it may be all of them... Huge models on "non-small" computers were never a priority. |
Beta Was this translation helpful? Give feedback.
-
yes, both of them, but the encoder model is especially slower) I thought it wasn't a particularly large model, it's quite possible to run it on a mobile device: https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.objc My goal is to make smth like that but with Rust. I've seen commits related to fp16, maybe I should look in that direction? |
Beta Was this translation helpful? Give feedback.
-
Top 3 for the encoder (Intel). Pretty big surprise, Softmax dominates at 50%, not the matrix product. Softmax impl in tract is pretty simple, so there is a potential for big gains there. The matrix multiplier is at 26% only (it usually clocks at 85%-95% or so).
|
Beta Was this translation helpful? Give feedback.
-
For decoder, it's like I expected. Two first operators are the matrix multiplication, and they account for more than 95%.
|
Beta Was this translation helpful? Give feedback.
-
The decoder starts with 12 big products, each of them weighing nearly 4GF. tract runs them at 67GF/s on one core, which feels about right on this PC.
Do you know how the python runner runs the model ? I assume it leverages multiple core and/or GPU ? Also: on my PC, the parallelism by chunking on the example makes only 4 batchs, so we are running on four threads. My PC has 16 cores so internal parallelism at the matmul level could bring a 4x speedup. |
Beta Was this translation helpful? Give feedback.
-
Conclusion for now:
|
Beta Was this translation helpful? Give feedback.
-
Hello there. Revisiting this topic a bit. Would you be able to share what you had to do to derive your model from, I guess, one of the base whisper models ? |
Beta Was this translation helpful? Give feedback.
-
Hello. I have prepared code for export weights: https://github.com/igor-yusupov/whisper-exporter/tree/main |
Beta Was this translation helpful? Give feedback.
-
The main trick is that I also generate the mask "on the fly", while generating it so that its shape matches the object it is further used with: https://github.com/igor-yusupov/whisper-exporter/blob/main/src/decoder.py#L57 So far the code works only with base model, I plan to fix kv_cache so that it would be possible to generate different versions of models, I think tract can make slices from kv_cache so that we don't have to feed each cache element separately as it is now. I hope my code will help you. Happy New Year! |
Beta Was this translation helpful? Give feedback.
-
Hey @igor-yusupov! I have just merged a very aggressive optimisation of Softmax on main. It is opt-in, you will need to call the transformer explicitely between decluttering and optimising.
The optimisation is a very dirty trick on the exponential fonction and it's pretty brutal. It introduces up to 7% of noise on the exp() function. But if you feel like giving it a try, I'm curious to know if/how whisper tolerates this noise... On the performance side of things, I think it should give us a nearly x2 speedup on the encoder. |
Beta Was this translation helpful? Give feedback.
-
I have started to develop crate for whisper model based on tract: https://github.com/igor-yusupov/rusty-whisper/
So far it's all in pretty raw form, and it doesn't work accurately enough. In general, there is still a lot to be added).
Remember my issue #1202 ? I was able to change the architecture of the model so that I could run with tract.
The only bug right now that I can't fix is the inference speed. The inference speed is 3 times slower than the inference speed in python. I tried to speed it up using parallel computing, it became almost comparable to the python version without parallelism). I opened #1217 and if you can improve, I'll keep working on the crate.
Beta Was this translation helpful? Give feedback.
All reactions