Replies: 4 comments 7 replies
-
u should wait for the upcoming update, it seems @FSSRepo managed to make cuda devices [nvidia gpus like urs] to work with quantized models u can see the progress in TAESD topic in pull requests section here |
Beta Was this translation helpful? Give feedback.
-
have you tried sdxl-turbo? |
Beta Was this translation helpful? Give feedback.
-
@aifartist It's incredible that this project has reached your hands. I'm interested in what you propose, and if 8-bit quantization works in CUDA. The only thing is, don't expect exactly the same results in terms of accuracy, as an image generated with the fp16 model doesn't differ much from the one generated by fp32. However, there is a difference between fp16 and q8_0; the results vary as you lower the quantization level.
It is expected that a RTX 4090 would have such significant performance given the bandwidth and the number of compute blocks that can be executed simultaneously. Additionally, PyTorch benefits from highly optimized convolution algorithms. Here, using im2col and multiplying the matrix by kernels is too inefficient due to the high memory consumption of im2col. I am considering implementing Winograd, but I have not yet managed to fully understand how it works as I struggle with interpreting technical papers with equations into actual code. |
Beta Was this translation helpful? Give feedback.
-
To give you the short version of my idea I'm refining. Normal LDM SD gen's get good quality. LCM is perhaps 10X faster but there is some degrading of the quality. If I do the FIRST 50% of inferencing with LDM(10 steps of 20) and then take the latent and pass it to LCM for the LAST 50% of denoising, steps 4, 5, and 6 of a 6 step gen. I get nearly double the LDM perf and somehow I get even better images than either pure LDM or LCM. Thus I ask what happens if I mix the speed of 8 bit with other fp16 schedulers for only a small part of the denoising to improve the quality. The technique does work once one finds the right param's as I have shown. Left image is pure LDM 20 steps, Right image is pure LCM at 6 steps. The middle image is the result of my hybrid approach. I wrote the tool you see to have realtime feedback as I move the sliders. |
Beta Was this translation helpful? Give feedback.
-
I retired as a performance architect from MSFT last year and playing with SD full time.
I may have the fastest pipelines anyone has. I have a 4090 and i9-13900K.
Using every trick I know I can gen 512x512 images in 45ms or just today hit 215ms for sdxl 1024x1024 4 step LCM images. This latest 215 number is 15% faster than torch.compile using stable-fast. I've been working with the author of stable-fast and he seems to have the fastest compiler tech. Instead of some annoying 5+ minute torch.compile his does the work in under 30 seconds.
The reason I'm interested in what you have here is that I've always wanted to see how fast 8 bit quantization might be. Even if quality is slightly off(?) I been doing real time videos(camera->sd15 LCM->video display. I can look like Biden, Tom Cruise, Emma Watson at over 15fps. Probably faster now as I wrote this demo back when LCM first arrived.
Does 8 bit yet work for an NVidia 4090?
I'd love to test it.
Beta Was this translation helpful? Give feedback.
All reactions