Whisper #1268

igor-yusupov · 2023-11-26T06:01:21Z

igor-yusupov
Nov 26, 2023

I have started to develop crate for whisper model based on tract: https://github.com/igor-yusupov/rusty-whisper/
So far it's all in pretty raw form, and it doesn't work accurately enough. In general, there is still a lot to be added).

Remember my issue #1202 ? I was able to change the architecture of the model so that I could run with tract.

The only bug right now that I can't fix is the inference speed. The inference speed is 3 times slower than the inference speed in python. I tried to speed it up using parallel computing, it became almost comparable to the python version without parallelism). I opened #1217 and if you can improve, I'll keep working on the crate.

kali · 2023-11-27T07:11:56Z

kali
Nov 27, 2023
Maintainer

Hey, kudos, this is a nice achievement !

Did you figure out if one of the model is particularly responsible for the slowdown?

TBH it may be all of them... Huge models on "non-small" computers were never a priority.

0 replies

igor-yusupov · 2023-11-27T07:51:10Z

igor-yusupov
Nov 27, 2023
Author

yes, both of them, but the encoder model is especially slower)

I thought it wasn't a particularly large model, it's quite possible to run it on a mobile device: https://github.com/ggerganov/whisper.cpp/tree/master/examples/whisper.objc

My goal is to make smth like that but with Rust. I've seen commits related to fp16, maybe I should look in that direction?

1 reply

kali Nov 27, 2023
Maintainer

300MB is already in the realm of 4x or 5x the size of models for which tract has been specifically optimized. But that's good news, it means there may be a margin of progression there :)

If your target is intel, fp16 will be useless. It is beneficial on ARMV8.2 devices. It could help on Apple Silicon, but we have a f32 multiplier on AMX (apple undocumented matrix multiplier co-processsor), while the f16 will use the regular SIMD unit for now, so not sure which one will be faster. (it should be relatively easy to craft AMX fp16 multipliers)

Testing the code on fp16 could be interesting to know if at least the model works (some models do not tolerate it, but it's pretty rare).

kali · 2023-11-27T08:13:16Z

kali
Nov 27, 2023
Maintainer

Top 3 for the encoder (Intel). Pretty big surprise, Softmax dominates at 50%, not the matrix product. Softmax impl in tract is pretty simple, so there is a potential for big gains there. The matrix multiplier is at 26% only (it usually clocks at 85%-95% or so).

Most time consuming operations
 * Softmax                6 nodes: 1387.528 ms/i 49.3%
 * LirMatMulUnary        50 nodes: 754.545 ms/i 26.8%
 * Mul                   68 nodes: 171.252 ms/i  6.1%

0 replies

kali · 2023-11-27T08:53:37Z

kali
Nov 27, 2023
Maintainer

For decoder, it's like I expected. Two first operators are the matrix multiplication, and they account for more than 95%.

 * LirMatMulUnary        85 nodes:  83.543 ms/i 87.5%
 * MatMatMulPack         86 nodes:   8.048 ms/i  8.4%
 * Reshape               48 nodes:   1.405 ms/i  1.5%
 * Softmax               12 nodes:   0.925 ms/i  1.0%

1 reply

igor-yusupov Nov 27, 2023
Author

The main difference is in the tensor shapes that are fed into softmax. for the encoder it is (batch_size, 8, 1500, 1500). while for the decoder it is (batch_size, 8, token_len, 1500). in both cases softmax is counted by axis=-1. Usually 1 <= token_len <= 4.

kali · 2023-11-27T09:12:33Z

kali
Nov 27, 2023
Maintainer

The decoder starts with 12 big products, each of them weighing nearly 4GF. tract runs them at 67GF/s on one core, which feels about right on this PC.

  5.865 ms/i  6.0%  FMA(F32)       393216000         67.048 GF/s ┃┣┻┻ 4 LirMatMulUnary /decoder/blocks.5/cross_attn/value/Add
                                                                 ┃┃   * c_shape:1500,512,F32, c_m_axis:0 c_n_axis:1 geometry:Concrete(ConcreteMatrixGeometry { m: 1500, n: 512 })
                                                                 ┃┃   * Mult: m:1500 k:512 n:512 with (avx512_mmm_f32_16x12 16x12)
                                                                 ┃┃   * Ops: matmul(k=512) >>> colAdd >>> store
                                                                 ┃┃   ━━━ 1500,512,F32

Do you know how the python runner runs the model ? I assume it leverages multiple core and/or GPU ?

Also: on my PC, the parallelism by chunking on the example makes only 4 batchs, so we are running on four threads. My PC has 16 cores so internal parallelism at the matmul level could bring a 4x speedup.

5 replies

igor-yusupov Nov 27, 2023
Author

I don't use GPU with python.
I use it like that:

import onnxruntime
import numpy as np


providers = ["CPUExecutionProvider"]
enc_net = onnxruntime.InferenceSession(
    WEIGHT_ENC_PATH, providers=providers
)
mel = np.zeros((1, 80, 3000), dtype=np.float32)
output = enc_net.run(None, {"mel": mel})

igor-yusupov Nov 27, 2023
Author

I also break the sequence into chunks, but do the inferencing sequentially. I can't tell how many CPU cores onnxruntime uses during inference. But in my python script I don't use multiprocessing.

kali Nov 27, 2023
Maintainer

You can get an idea by running top or htop while running the example (in a loop maybe it it's too fast). I assume it will show saturation all of your system cores.

igor-yusupov Nov 27, 2023
Author

I check top while running python code and after that I made the same checking with rust code.
Yeah, both scripts used all my cores. It seems to be the reason why inferencing 1 chunk in python is faster, but inferencing 4 chunks in parallel in rust is comparable to inferencing 4 chunks sequentially in python.

kali Nov 27, 2023
Maintainer

All right. I think it confirms what I was thinking.

kali · 2023-11-27T09:53:05Z

kali
Nov 27, 2023
Maintainer

Conclusion for now:

internal parallelism for matrix multiplication would help with parallelism in the decoder and put tract in the ballpark of the python reference runner.
Big Softmax ops probably need optimisation.

0 replies

kali · 2023-12-15T15:01:44Z

kali
Dec 15, 2023
Maintainer

Hello there. Revisiting this topic a bit. Would you be able to share what you had to do to derive your model from, I guess, one of the base whisper models ?

1 reply

igor-yusupov Dec 16, 2023
Author

Hello. Yeah, I used the base model: https://github.com/openai/whisper/blob/main/whisper/model.py

I took out positional_embeddings, now I generate them myself and pass them to the model. I also changed the transfer of kv_cache, now I transfer each element of kv_cache separately.

Now the code for exporting in normal form is not ready yet, but I plan to add export.py script after I can convert all versions and test.

igor-yusupov · 2023-12-27T18:04:39Z

igor-yusupov
Dec 27, 2023
Author

Hello. I have prepared code for export weights: https://github.com/igor-yusupov/whisper-exporter/tree/main

0 replies

igor-yusupov · 2023-12-27T18:13:36Z

igor-yusupov
Dec 27, 2023
Author

The main trick is that I also generate the mask "on the fly", while generating it so that its shape matches the object it is further used with: https://github.com/igor-yusupov/whisper-exporter/blob/main/src/decoder.py#L57

So far the code works only with base model, I plan to fix kv_cache so that it would be possible to generate different versions of models, I think tract can make slices from kv_cache so that we don't have to feed each cache element separately as it is now.

I hope my code will help you.

Happy New Year!

1 reply

kali Dec 27, 2023
Maintainer

Thank you!

kali · 2024-01-23T18:49:15Z

kali
Jan 23, 2024
Maintainer

Hey @igor-yusupov!

I have just merged a very aggressive optimisation of Softmax on main. It is opt-in, you will need to call the transformer explicitely between decluttering and optimising.

.transform(tract_core::transform::get_transomer("softmax-fast-compact"))

The optimisation is a very dirty trick on the exponential fonction and it's pretty brutal. It introduces up to 7% of noise on the exp() function. But if you feel like giving it a try, I'm curious to know if/how whisper tolerates this noise...

On the performance side of things, I think it should give us a nearly x2 speedup on the encoder.

13 replies

igor-yusupov Jan 24, 2024
Author

and the next one: no method named `into_optimized` found for unit type `()` in the current scope

kali Jan 24, 2024
Maintainer

ok, so use tranform_into instead of tranform, and just add a & before the tract_core::transform::get_transformer...

Sorry, I should have tried this on my side.

let encoder = tract_onnx::onnx()
            .model_for_path(encoder_path)
            .unwrap()
            .into_typed()
            .unwrap()
            .into_decluttered()
            .unwrap()
            .transform_into(&tract_core::transform::get_transformer("softmax-fast-compact").unwrap())
            .unwrap()
            .into_optimized()
            .unwrap()
            .into_runnable()
            .unwrap();

kali Jan 24, 2024
Maintainer

Sorry for making this so painful. I'm gonna fix the API.


        let mut encoder = tract_onnx::onnx()
            .model_for_path(encoder_path)
            .unwrap()
            .into_typed()
            .unwrap()
            .into_decluttered()
            .unwrap();
        encoder
            .transform(&*tract_core::transform::get_transformer("softmax-fast-compact").unwrap())
            .unwrap();
        let encoder = encoder.into_optimized().unwrap().into_runnable().unwrap();

kali Jan 24, 2024
Maintainer

And I did check that last version ;)

igor-yusupov Jan 24, 2024
Author

Yeah, it works! And it definitely became faster. It seems that noise didn't affect on the quality.
Thank you very much!

Maybe someday it will be faster than whisper.cpp :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper #1268

{{title}}

Replies: 10 comments 22 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Whisper #1268

igor-yusupov Nov 26, 2023

Replies: 10 comments · 22 replies

kali Nov 27, 2023 Maintainer

igor-yusupov Nov 27, 2023 Author

kali Nov 27, 2023 Maintainer

kali Nov 27, 2023 Maintainer

kali Nov 27, 2023 Maintainer

igor-yusupov Nov 27, 2023 Author

kali Nov 27, 2023 Maintainer

igor-yusupov Nov 27, 2023 Author

igor-yusupov Nov 27, 2023 Author

kali Nov 27, 2023 Maintainer

igor-yusupov Nov 27, 2023 Author

kali Nov 27, 2023 Maintainer

kali Nov 27, 2023 Maintainer

kali Dec 15, 2023 Maintainer

igor-yusupov Dec 16, 2023 Author

igor-yusupov Dec 27, 2023 Author

igor-yusupov Dec 27, 2023 Author

kali Dec 27, 2023 Maintainer

kali Jan 23, 2024 Maintainer

igor-yusupov Jan 24, 2024 Author

kali Jan 24, 2024 Maintainer

kali Jan 24, 2024 Maintainer

kali Jan 24, 2024 Maintainer

igor-yusupov Jan 24, 2024 Author

igor-yusupov
Nov 26, 2023

Replies: 10 comments 22 replies

kali
Nov 27, 2023
Maintainer

igor-yusupov
Nov 27, 2023
Author

kali Nov 27, 2023
Maintainer

kali
Nov 27, 2023
Maintainer

kali
Nov 27, 2023
Maintainer

igor-yusupov Nov 27, 2023
Author

kali
Nov 27, 2023
Maintainer

igor-yusupov Nov 27, 2023
Author

igor-yusupov Nov 27, 2023
Author

kali Nov 27, 2023
Maintainer

igor-yusupov Nov 27, 2023
Author

kali Nov 27, 2023
Maintainer

kali
Nov 27, 2023
Maintainer

kali
Dec 15, 2023
Maintainer

igor-yusupov Dec 16, 2023
Author

igor-yusupov
Dec 27, 2023
Author

igor-yusupov
Dec 27, 2023
Author

kali Dec 27, 2023
Maintainer

kali
Jan 23, 2024
Maintainer

igor-yusupov Jan 24, 2024
Author

kali Jan 24, 2024
Maintainer

kali Jan 24, 2024
Maintainer

kali Jan 24, 2024
Maintainer

igor-yusupov Jan 24, 2024
Author