-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337
Comments
The CUDA error appears to be caused by an out-of-bounds GPU memory access on the variable To fix the error, it appears the call stack needs to handle cases where "fewer than k items are probed" in the ivf_pq algorithm. The following is the memory checking log: tests/test_zzz_to_investigate_bug.py using ivf_pq::index_params nrows 20000, dim 200, n_lits 141, pq_dim 56
========= Invalid __global__ read of size 4 bytes
========= at 0x430 in /cuvs/cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh:162:void cuvs::neighbors::ivf_flat::detail::build_index_kernel<float, long, unsigned int, (bool)1>(const T3 *, const T1 *, const T2 *, T1 **, T2 **, unsigned int *, T2, unsigned int, unsigned int, T2)
========= by thread (0,0,0) in block (2002,0,0)
========= Address 0x7fe841fffce0 is out of bounds
========= and is 800 bytes before the nearest allocation at 0x7fe842000000 of size 16,000,000 bytes
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame: [0x2e5a32]
========= in /lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x15bc4]
========= in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
========= Host Frame:cudaLaunchKernel_ptsz [0x54aa1]
========= in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
========= Host Frame:/cuvs/cpp/src/neighbors/refine/detail/../../ivf_flat/ivf_flat_build.cuh:509:void cuvs::neighbors::ivf_flat::detail::fill_refinement_index<float, long>(raft::resources const&, cuvs::neighbors::ivf_flat::index<float, long>*, float const*, long const*, long, unsigned int) [0x156bf72]
========= in /miniconda3/envs/cuvs_125/lib/libcuvs.so
========= Host Frame:/cuvs/cpp/src/neighbors/refine/detail/refine_device_float_float.cu:45:cuvs::neighbors::refine(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvsDistanceType) [0x1569441]
========= in /miniconda3/envs/cuvs_125/lib/libcuvs.so
========= Host Frame:/cuvs/cpp/src/neighbors/detail/cagra/cagra_build.cuh:308:void cuvs::neighbors::cagra::detail::build_knn_graph<float, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, cuvs::neighbors::cagra::graph_build_params::ivf_pq_params) [0xcea567]
|
Describe the bug
A cuda error pops up sometimes when running cagra with ivf_pq build_algo.
Steps/Code to reproduce bug
Run the following code multiple times (e.g. 5). The code sometimes fails with a cuda error.
Expected behavior
The code gets passed stably.
Environment details (please complete the following information):
conda create -n rapids-24.10 -c rapidsai-nightly -c conda-forge -c nvidia \ cuml=24.10 cuvs=24.10 python=3.11 'cuda-version>=12.0,<=12.5'
The text was updated successfully, but these errors were encountered: