Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failing tests on AMPERE80 with gcc-13 and cuda-12.6 #73

Open
DerNils-git opened this issue Jan 20, 2025 · 0 comments
Open

Failing tests on AMPERE80 with gcc-13 and cuda-12.6 #73

DerNils-git opened this issue Jan 20, 2025 · 0 comments

Comments

@DerNils-git
Copy link

DerNils-git commented Jan 20, 2025

I tried to update the compiler toolchain used to build HeFFTe (c61c772).
This leads to failing tests with the CUDA backend on an Nvidia A100.
The fftw CPU backend seems to work fine with these compilers.

I use the following toolchain:

  • g++ (GCC) 13.1.0
  • cuda_12.6.r12.6/compiler.34714021_0
  • mpicxx: Open MPI 4.1.7 (Language: C++)
  • OS: SUSE Linux Enterprise Server 15 SP5

Container Image:
https://mpcdf.pages.mpcdf.de/ci-module-image/latest.html
gitlab-registry.mpcdf.mpg.de/mpcdf/ci-module-image/gcc-cuda:latest

The code was compiled using cmake/3.30 with the following preset

  {
      "name": "heffte-gpu-non-cuda-aware",
      "displayName": "GCC GPU HeFFTe",
      "description": "Build HeFFTe using CUDA backend",
      "cacheVariables": {
        "CMAKE_BUILD_TYPE": "Debug",
        "CMAKE_CXX_COMPILER": "g++",
        "CMAKE_C_COMPILER": "gcc",
        "CMAKE_CXX_EXTENSIONS":"Off",
        "Heffte_ENABLE_CUDA": "On",
        "Heffte_DISABLE_GPU_AWARE_MPI": "On"
      }
  }

and the tests fail with the output

$> ctest

Test project /u/nilsch/codes/heffte/build
      Start  1: unit_tests_nompi
 1/25 Test  #1: unit_tests_nompi .................Subprocess aborted***Exception:   0.63 sec

--------------------------------------------------------------------------------
                                 Non-MPI Tests
--------------------------------------------------------------------------------

                                             prime factorize              pass
                                                process grid              pass
                                               split pencils              pass
                                                 cpu scaling              pass
     float                                       gpu::vector              pass
    double                                       gpu::vector              pass
  ccomplex                                       gpu::vector              pass
  zcomplex                                       gpu::vector              pass
Values: 1;3 error magnitude: 2
Values: (1,-11);(0.75188,-8.27068) error magnitude: 2.74058
                                                 gpu scaling              pass
    double                               stock one-dimension              pass
  zcomplex                               stock one-dimension              pass
     float                           stock one-dimension r2c              pass
    double                           stock one-dimension r2c              pass
     float                   stock-cos-type-II one-dimension              pass
     float                   stock-sin-type-II one-dimension              pass
    double                   stock-cos-type-II one-dimension              pass
    double                   stock-sin-type-II one-dimension              pass
Values: (0,0);(3,0) error magnitude: 3
terminate called after throwing an instance of 'std::runtime_error'
  what():    test cufft one-dimension in file: /u/nilsch/codes/heffte/test/test_units_nompi.cpp line: 254

...

      Start  6: heffte_fft3d_np1
 6/25 Test  #6: heffte_fft3d_np1 .................***Failed    2.88 sec
Values: 0.707649;0 error magnitude: 0.707649
terminate called after throwing an instance of 'std::runtime_error'
  what():  mpi rank = 0  test -np 1  test heffte::fft3d in file: /u/nilsch/codes/heffte/test/test_fft3d.h line: 474
[ravg1078:116771] *** Process received signal ***
[ravg1078:116771] Signal: Aborted (6)
[ravg1078:116771] Signal code:  (-6)
[ravg1078:116771] [ 0] /lib64/libpthread.so.0(+0x16910)[0x14db4580b910]
[ravg1078:116771] [ 1] /lib64/libc.so.6(gsignal+0x10d)[0x14db341f3d2b]
[ravg1078:116771] [ 2] /lib64/libc.so.6(abort+0x177)[0x14db341f53e5]
[ravg1078:116771] [ 3] /mpcdf/soft/SLE_15/packages/x86_64/gcc/13.1.0/lib64/libstdc++.so.6(+0xa8377)[0x14db34448377]
[ravg1078:116771] [ 4] /mpcdf/soft/SLE_15/packages/x86_64/gcc/13.1.0/lib64/libstdc++.so.6(+0xb7b3c)[0x14db34457b3c]
[ravg1078:116771] [ 5] /mpcdf/soft/SLE_15/packages/x86_64/gcc/13.1.0/lib64/libstdc++.so.6(+0xb7ba7)[0x14db34457ba7]
[ravg1078:116771] [ 6] /mpcdf/soft/SLE_15/packages/x86_64/gcc/13.1.0/lib64/libstdc++.so.6(+0xb7e07)[0x14db34457e07]
[ravg1078:116771] [ 7] /u/nilsch/codes/heffte/build/test/test_fft3d_np1[0x42940e]
[ravg1078:116771] [ 8] /u/nilsch/codes/heffte/build/test/test_fft3d_np1[0x427271]
[ravg1078:116771] [ 9] /u/nilsch/codes/heffte/build/test/test_fft3d_np1[0x424cc3]
[ravg1078:116771] [10] /u/nilsch/codes/heffte/build/test/test_fft3d_np1[0x424cf2]
[ravg1078:116771] [11] /lib64/libc.so.6(__libc_start_main+0xef)[0x14db341de24d]
[ravg1078:116771] [12] /u/nilsch/codes/heffte/build/test/test_fft3d_np1[0x42417a]
[ravg1078:116771] *** End of error message ***
------------------------------------------------------

...

The following tests FAILED:
          1 - unit_tests_nompi (Subprocess aborted)
          6 - heffte_fft3d_np1 (Failed)
         15 - heffte_fft3d_r2c_np1 (Failed)
         21 - test_cos_np1 (Failed)

I removed some of the output to improve readability.
I am happy for any help. Thank you in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant