-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(Rust binding) Repeated invocation of EltwiseFMAModAVX512 (with different data) in loop has unexpected performance regression #143
Comments
Hello @Janmajayamall. Unfortunately I no longer have the machines to run HEXL at full (Using AVX512). I can tell you modular reduction works different depending on BitShift variable. Look at functions on fma_mod that depends on BitShift here: https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod-avx512.cpp BitShift definition happens here https://github.com/intel/hexl/blob/development/hexl/eltwise/eltwise-fma-mod.cpp Would you have the same behavior using logq = 48 or 46? just curious. Regards, |
Hi @Janmajayamall, You mentioned that you are trying to use the Intel Advanced Vector Extensions 512 Integer Fused Multiply Add (AVX512-IFMA52) instructions. These were introduced in the 3rd Gen Intel® Xeon® Scalable Processors (and onwards), so checking which CPU manufacturer and type you are using will be important. The AVX512-IFMA52 should only be used for primes below 50–52 bits, assuming it suffices for your computation. For more information on how HEXL uses the AVX512-IFMA52, please refer to: and https://arxiv.org/pdf/2103.16400.pdf Regards, |
@Janmajayamall |
Yeah it behaves same for logq=48 and can confirm same for logq=46. I don't suspect that this is due to calling code from rust (but will still compare by implementing same in C++). If I understand correctly the line here sets Bitshift value to 52 and uses IFMA, right?
I am using C3 machine on GCP (4th Gen Intel Xeon Scalable processor) that supports AVX512-IFMA. I don't think there are additional configs I need to enable for hexl, or am I missing something? I am curious whether you have some ideas around what can cause this? Thanks! |
The 4th Gen Intel Xeon Scalable processor does support AVX512-IFMA instructions. But, just in case, assuming you are using Linux, can you check with the command "lscpu". As far as how to make use of HEXL in an FHE library, I would suggest you study the integration of HEXL with MS SEAL and/or with OpenFHE. |
I am weiting rust bindings for hexl here. I have added support for NTT operations and some elwise operations. However, I am running into issues with elwise operations with
prime
(ieq
) set to 50 bits. To see what's wrong you can clone the repository and runcargo bench modulus/elwise_fma_mod
. This will run benches inside benches/modulus.rs with prefixelwise_fma_mod
which usesEltwiseFMAModAVX512
internally and will produce following looking outputI have reduced the output to only necessary items: bench name and time.
bench
modulus/elwise_fma_mod_2d/*
benches this function. The function simply takes two 2-dimensional (row-major) matrixr0
,r1
, and a scalar and callselwise_fma_mod
row-wise.elwise_fma_mod
internally callsEltwiseFMAModAVX512
here.n
is row size, fixed at 32768.logq
is bits in prime andmod_size
is no. of rows in matrix. For example,modulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
callselwise_fma_mod
once (since it has only 1 row) with a 60 bit prime and vector size 32768 andmodulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
callselwise_fma_mod
thrice for 3 different rows (since mod_size is 3) with rest of parameters set to same. Hence we must expect performance ofmodulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=3
to be around 3x ofmodulus/elwise_fma_mod_2d/n=32768/logq=60/mod_size=1
. Indeed it is. Same holds for other benches with n=32768 and logq=60 and mod_size=5 / 15.But things behave differently when logq is set to 50 bits (ie when EltwiseFMAModAVX512 uses IFMA instead of DQ).
modulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=3
is 3x ofmodulus/elwise_fma_mod_2d/n=32768/logq=50/mod_size=1
as expected, but same pattern does not holds when mod_size is either 5 or 15 (for mod_size=5 it should be around 50µs but is 81µs and for mod_size=15 it should be 145µs but is 287µs). I have tried for othermod_size
s and it gets worse asmod_size
increases, that is as no. of rows increase.I am unable to detect what causes this for 50 bit primes. Do you have any pointers? Or is this expected with IFMA?
Thanks!
The text was updated successfully, but these errors were encountered: