I'm doing 45ms 512x512 4 step LCM generations. #97

aifartist · 2023-11-30T06:09:27Z

aifartist
Nov 30, 2023

I retired as a performance architect from MSFT last year and playing with SD full time.
I may have the fastest pipelines anyone has. I have a 4090 and i9-13900K.
Using every trick I know I can gen 512x512 images in 45ms or just today hit 215ms for sdxl 1024x1024 4 step LCM images. This latest 215 number is 15% faster than torch.compile using stable-fast. I've been working with the author of stable-fast and he seems to have the fastest compiler tech. Instead of some annoying 5+ minute torch.compile his does the work in under 30 seconds.

The reason I'm interested in what you have here is that I've always wanted to see how fast 8 bit quantization might be. Even if quality is slightly off(?) I been doing real time videos(camera->sd15 LCM->video display. I can look like Biden, Tom Cruise, Emma Watson at over 15fps. Probably faster now as I wrote this demo back when LCM first arrived.

Does 8 bit yet work for an NVidia 4090?
I'd love to test it.

Amin456789 · 2023-11-30T07:42:48Z

Amin456789
Nov 30, 2023

u should wait for the upcoming update, it seems @FSSRepo managed to make cuda devices [nvidia gpus like urs] to work with quantized models u can see the progress in TAESD topic in pull requests section here

6 replies

Amin456789 Nov 30, 2023

that is great! SDHybrid will be amazing. about the 8bit thing, well i am no coder but if u work with the guys in here everything is possible!

aifartist Nov 30, 2023
Author

I'd like to help. Last night I was on with the stable-fast author and showed him a 15% improvement which his stuff compared with the already fast torch.compile(). 1024x1024 good sdxl images in .214ms!!! Nearly 5 1024x1024 images per second!

Amin456789 Nov 30, 2023

whoa nice! u should definitely bring ur optimizations in here too! it will be a blast!

aifartist Dec 3, 2023
Author

@FSSRepo @Amin456789
I just did my first twitter post last night with a demo of 70+ images per second.
https://twitter.com/Dan50412374/status/1731215092728148331
This morning using batching I hit 167 images per second.

Still eager to try the 8 bit stuff when it is working.

Amin456789 Dec 3, 2023

nice!

Green-Sky · 2023-11-30T11:37:12Z

Green-Sky
Nov 30, 2023

or just today hit 215ms for sdxl 1024x1024 4 step LCM images.

have you tried sdxl-turbo?

1 reply

aifartist Nov 30, 2023
Author

Yes, here is my post about sdxl-turbo. My reddit alias is just random:
https://www.reddit.com/r/StableDiffusion/comments/186z1xx/sdxlturbo_and_32_millisecond_usable_1_step_images/

FSSRepo · 2023-11-30T13:56:47Z

FSSRepo
Nov 30, 2023

@aifartist It's incredible that this project has reached your hands. I'm interested in what you propose, and if 8-bit quantization works in CUDA. The only thing is, don't expect exactly the same results in terms of accuracy, as an image generated with the fp16 model doesn't differ much from the one generated by fp32. However, there is a difference between fp16 and q8_0; the results vary as you lower the quantization level.

q5_0	q5_1	q8_0	f16

It is expected that a RTX 4090 would have such significant performance given the bandwidth and the number of compute blocks that can be executed simultaneously. Additionally, PyTorch benefits from highly optimized convolution algorithms. Here, using im2col and multiplying the matrix by kernels is too inefficient due to the high memory consumption of im2col. I am considering implementing Winograd, but I have not yet managed to fully understand how it works as I struggle with interpreting technical papers with equations into actual code.

0 replies

aifartist · 2023-11-30T19:59:34Z

aifartist
Nov 30, 2023
Author

@aifartist It's incredible that this project has reached your hands. I'm interested in what you propose, and if 8-bit quantization works in CUDA. The only thing is, don't expect exactly the same results in terms of accuracy, as an image generated with the fp16 model doesn't differ much from the one generated by fp32. However, there is a difference between fp16 and q8_0; the results vary as you lower the quantization level.

It is expected that a RTX 4090 would have such significant performance given the bandwidth and the number of compute blocks that can be executed simultaneously. Additionally, PyTorch benefits from highly optimized convolution algorithms. Here, using im2col and multiplying the matrix by kernels is too inefficient due to the high memory consumption of im2col. I am considering implementing Winograd, but I have not yet managed to fully understand how it works as I struggle with interpreting technical papers with equations into actual code.

To give you the short version of my idea I'm refining. Normal LDM SD gen's get good quality. LCM is perhaps 10X faster but there is some degrading of the quality. If I do the FIRST 50% of inferencing with LDM(10 steps of 20) and then take the latent and pass it to LCM for the LAST 50% of denoising, steps 4, 5, and 6 of a 6 step gen. I get nearly double the LDM perf and somehow I get even better images than either pure LDM or LCM. Thus I ask what happens if I mix the speed of 8 bit with other fp16 schedulers for only a small part of the denoising to improve the quality. The technique does work once one finds the right param's as I have shown.

Left image is pure LDM 20 steps, Right image is pure LCM at 6 steps. The middle image is the result of my hybrid approach. I wrote the tool you see to have realtime feedback as I move the sliders.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I'm doing 45ms 512x512 4 step LCM generations. #97

{{title}}

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

I'm doing 45ms 512x512 4 step LCM generations. #97

aifartist Nov 30, 2023

Replies: 4 comments · 7 replies

Amin456789 Nov 30, 2023

Amin456789 Nov 30, 2023

aifartist Nov 30, 2023 Author

Amin456789 Nov 30, 2023

aifartist Dec 3, 2023 Author

Amin456789 Dec 3, 2023

Green-Sky Nov 30, 2023

aifartist Nov 30, 2023 Author

FSSRepo Nov 30, 2023

aifartist Nov 30, 2023 Author

aifartist
Nov 30, 2023

Replies: 4 comments 7 replies

Amin456789
Nov 30, 2023

aifartist Nov 30, 2023
Author

aifartist Dec 3, 2023
Author

Green-Sky
Nov 30, 2023

aifartist Nov 30, 2023
Author

FSSRepo
Nov 30, 2023

aifartist
Nov 30, 2023
Author