Next release date? #183

al42and · 2024-09-04T11:18:29Z

al42and
Sep 4, 2024

Hi! Do you have a timeline for the next VkFFT release? Would like to see how it matches up with GROMACS release timings to decide which version to bundle. 1377057 would be nice to have :)

DTolm · 2024-09-04T11:46:07Z

DTolm
Sep 4, 2024
Maintainer

Hello, I think I will merge the current state of the develop branch (with some changes to the test suite and not the library) this month as all the other features require more time to implement. Have you tested this optimization to be useful for GROMACS? It only affects big systems and I am not sure if it has much impact on MI300 (due to it having L3 cache).

0 replies

al42and · 2024-09-04T12:21:30Z

al42and
Sep 4, 2024
Author

I think I will merge the current state of the develop branch (with some changes to the test suite and not the library) this month as all the other features require more time to implement.

Great!

Have you tested this optimization to be useful for GROMACS? It only affects big systems and I am not sure if it has much impact on MI300 (due to it having L3 cache).

A quick benchmark on MI250X has shown speed-up up to ~7% in FFT time (primarily for small systems 🤔; will need to looks more into that). Don't have MI300 to try.

0 replies

DTolm · 2024-09-04T13:06:03Z

DTolm
Sep 4, 2024
Maintainer

The improvement may also be due to another commit: daf09d3 where radix kernels were slightly optimized. The padding mentioned in the AMD commit by default is enabled for systems with all dimension sizes multiplied being more than 2097152.

0 replies

DTolm · 2024-09-23T14:55:19Z

DTolm
Sep 23, 2024
Maintainer

Hello,

I have implemented a new register assignment logic and a set of optimizations to generated kernels that improved performance quite a bit (especially on Nvidia). I think it may be interesting to run GROMACS with VkFFT on Nvidia hardware to compare these optimizations to cuFFT. Sorry for the delay with the release, I wanted these improvements to be in it.

Best regards,
Dmitrii

8 replies

DTolm Oct 10, 2024
Maintainer

From profiling, I now understood that the kernel times in the table are averaged per axis as well, so if we assume that one FFT takes 3x the kernel time, we get bandwidth of 4070Ti ~500GB/s, which seems correct. Sorry for the confusion.

As for the daf09d3, there was an incorrect comparison of vendorID that was written as an assignment. So VkFFT used the AMD profile for Nvidia GPUs when deciding how many threads per kernel to use. This resulted in bigger coalescing in strided accesses which turns out to be beneficial in your use case. The mistake was introduced in this update and fixed in the next one. Replacing the contents of vkFFT\vkFFT_PlanManagement\vkFFT_HostFunctions\vkFFT_AxisBlockSplitter.h file with the contents of the attached file should disable the optimization which I think is responsible for the performance drop, so if it actually improves the performance I will rework it for the next release.

I also wanted to ask, are any of the parameters like aimThreads or coalescedMemory changed in GROMACS?
vkFFT_AxisBlockSplitter.txt

al42and Oct 10, 2024
Author

As for the daf09d3, there was an incorrect comparison of vendorID that was written as an assignment.

Lucky that #166 wasn't merged :)

Replacing the contents of vkFFT\vkFFT_PlanManagement\vkFFT_HostFunctions\vkFFT_AxisBlockSplitter.h file with the contents of the attached file should disable the optimization which I think is responsible for the performance drop, so if it actually improves the performance I will rework it for the next release.

Looks about right 👍

develop is 539be29, fix is the same code with your patch added, kernel times (averaged over 3 dims and fwd/bwd):

size (katoms)	develop	fix
12	2.06	2.06
24	2.79	2.65
48	4.87	3.81
96	8.11	6.21
192	14.4	11.7
384	31.3	27.6
768	72.3	53.3
1536	223.0	191.9

I also wanted to ask, are any of the parameters like aimThreads or coalescedMemory changed in GROMACS?

aimThreads (and not updated in a while 😞): https://gitlab.com/gromacs/gromacs/blob/919c0e1085190bd8cfb4f2077ffcd03ea4337422/src/gromacs/fft/gpu_3dfft_sycl_vkfft.cpp#L167-194

DTolm Oct 10, 2024
Maintainer

As for the daf09d3, there was an incorrect comparison of vendorID that was written as an assignment.

Well, the assignment bug was caught straight away, but it is still interesting that optimization that was beneficial to synthetic tests was actually detrimental for real applications. You learn something new every day!

develop is 539be29, fix is the same code with your patch added, kernel times (averaged over 3 dims and fwd/bwd):

Nice to see that to get a 25% boost I had to remove code, not add :)

aimThreads (and not updated in a while 😞):

Is it still better when it is set to 64? I have been optimizing the parameters for 128, so it might be worth checking.

al42and Oct 11, 2024
Author

Is it still better when it is set to 64? I have been optimizing the parameters for 128, so it might be worth checking.

At least for serialized kernel on MI250X, ROCm 6.0.0, 64 seems a bit faster than the default of 128 for mid-sized systems, but it is the opposite for smaller systems (same averaging over all dimensions and 10k fwd+bwd roundtrips; three markers == three independent runs on the same GPU):

Would check how it behaves when run in parallel with other kernels later.

al42and Oct 18, 2024
Author

End-to-end performance, where FFT can overlap on the same GPU with other kernels. More is better.

Different ROCm version from the plot above; also added VkFFT 1.3.1.

Data from 20 runs for 12k (44x44x44) and 48k (96x96x44), and 10 runs for 192k (192x96x96) and 768k (192x192x192).

This is roughly consistent with serialized kernel time measurements (ROCm 6; same data as above, but with 1.3.1 included): 1.3.1 is faster for smaller transforms, 1.3.5 for larger ones:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next release date? #183

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Next release date? #183

al42and Sep 4, 2024

Replies: 4 comments · 8 replies

DTolm Sep 4, 2024 Maintainer

al42and Sep 4, 2024 Author

DTolm Sep 4, 2024 Maintainer

DTolm Sep 23, 2024 Maintainer

DTolm Oct 10, 2024 Maintainer

al42and Oct 10, 2024 Author

DTolm Oct 10, 2024 Maintainer

al42and Oct 11, 2024 Author

al42and Oct 18, 2024 Author

al42and
Sep 4, 2024

Replies: 4 comments 8 replies

DTolm
Sep 4, 2024
Maintainer

al42and
Sep 4, 2024
Author

DTolm
Sep 4, 2024
Maintainer

DTolm
Sep 23, 2024
Maintainer

DTolm Oct 10, 2024
Maintainer

al42and Oct 10, 2024
Author

DTolm Oct 10, 2024
Maintainer

al42and Oct 11, 2024
Author

al42and Oct 18, 2024
Author