[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

lijinf2 · 2024-09-19T18:49:00Z

Describe the bug
A cuda error pops up sometimes when running cagra with ivf_pq build_algo.

Steps/Code to reproduce bug
Run the following code multiple times (e.g. 5). The code sometimes fails with a cuda error.

def generate_dataset_with_sklearn(generator_config, enable_complexity = True):
    import numpy as np
    from sklearn.datasets import make_blobs

    n_samples = generator_config["n_samples"]
    n_features = generator_config["n_features"]

    X, y = make_blobs(
        n_samples=generator_config["n_samples"], 
        n_features=generator_config["n_features"], 
        centers=generator_config["centers"], 
        random_state=generator_config["random_state"]
    )

    if enable_complexity:
        # Add extreme values
        extreme_samples = int(n_samples * 0.1)  # 10% of samples will be extreme
        for i in range(extreme_samples):
            feature = np.random.randint(1, n_features)
            X[i, feature] = np.random.choice([-1000000, 1000000]) * np.random.random()
    return X, y

generator_config = {
    "n_samples": 100000,
    "n_features": 200,
    "centers": 50,
    "random_state": 0,
}
X, _ = generate_dataset_with_sklearn(generator_config)

import cupy as cp
gpu_X = cp.array(X, dtype="float32")

from cuvs.neighbors import cagra
index_params = {
    "build_algo": "ivf_pq",
    "intermediate_graph_degree": 128,
    "graph_degree": 64, 
}
index = cagra.build(cagra.IndexParams(**index_params), gpu_X)

using ivf_pq::index_params nrows 100000, dim 200, n_lits 316, pq_dim 56
[I] [15:15:44.222379] optimizing graph
Traceback (most recent call last):
  File "/home/jinfengl/project/spark-rapids-ml/python/reproducer.py", line 40, in <module>
    index = cagra.build(cagra.IndexParams(**index_params), gpu_X)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "resources.pyx", line 110, in cuvs.common.resources.auto_sync_resources.wrapper
  File "cagra.pyx", line 294, in cuvs.neighbors.cagra.cagra.build
  File "cagra.pyx", line 295, in cuvs.neighbors.cagra.cagra.build
  File "exceptions.pyx", line 37, in cuvs.common.exceptions.check_cuvs
cuvs.common.exceptions.CuvsException: CUDA error encountered at: file=~/miniconda3/envs/rapids-24.10/include/raft/core/interruptible.hpp line=303:

Expected behavior
The code gets passed stably.

Environment details (please complete the following information):

Ubuntu 20.04, Python 3.11, CUDA 12.2, GPU A5000, Rapids 24.10
Installed by copying conda install command from rapids selector with both cuml and cuvs selected.

conda create -n rapids-24.10 -c rapidsai-nightly -c conda-forge -c nvidia  \
    cuml=24.10 cuvs=24.10 python=3.11 'cuda-version>=12.0,<=12.5'

lijinf2 · 2024-11-06T19:53:51Z

The CUDA error appears to be caused by an out-of-bounds GPU memory access on the variablesource_vecs here. The variable was computed as source_vecs += source_ixs[i] * dim, but a printout showed that source_ixs[i] had a value of LONG_MAX (9,223,372,036,854,775,807). This LONG_MAX value resulted from probing fewer than intermediate_graph_degree items in cagra ivf_pq, with LONG_MAX being filled into indices before this error occurred.

To fix the error, it appears the call stack needs to handle cases where "fewer than k items are probed" in the ivf_pq algorithm.

The following is the memory checking log:

tests/test_zzz_to_investigate_bug.py using ivf_pq::index_params nrows 20000, dim 200, n_lits 141, pq_dim 56
========= Invalid __global__ read of size 4 bytes 
=========     at 0x430 in /cuvs/cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh:162:void cuvs::neighbors::ivf_flat::detail::build_index_kernel<float, long, unsigned int, (bool)1>(const T3 *, const T1 *, const T2 *, T1 **, T2 **, unsigned int *, T2, unsigned int, unsigned int, T2)
=========     by thread (0,0,0) in block (2002,0,0)
=========     Address 0x7fe841fffce0 is out of bounds
=========     and is 800 bytes before the nearest allocation at 0x7fe842000000 of size 16,000,000 bytes 
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2e5a32]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x15bc4]
=========                in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel_ptsz [0x54aa1]
=========                in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
=========     Host Frame:/cuvs/cpp/src/neighbors/refine/detail/../../ivf_flat/ivf_flat_build.cuh:509:void cuvs::neighbors::ivf_flat::detail::fill_refinement_index<float, long>(raft::resources const&, cuvs::neighbors::ivf_flat::index<float, long>*, float const*, long const*, long, unsigned int) [0x156bf72] 
=========                in /miniconda3/envs/cuvs_125/lib/libcuvs.so
=========     Host Frame:/cuvs/cpp/src/neighbors/refine/detail/refine_device_float_float.cu:45:cuvs::neighbors::refine(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvsDistanceType) [0x1569441]
=========                in /miniconda3/envs/cuvs_125/lib/libcuvs.so
=========     Host Frame:/cuvs/cpp/src/neighbors/detail/cagra/cagra_build.cuh:308:void cuvs::neighbors::cagra::detail::build_knn_graph<float, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, cuvs::neighbors::cagra::graph_build_params::ivf_pq_params) [0xcea567]

achirkin · 2024-11-11T11:38:31Z

@lijinf2 thank you for the small reproducer and detailed information! Tracking progress on this in #460

lijinf2 added the bug Something isn't working label Sep 19, 2024

cjnolet assigned tfeher Oct 2, 2024

tfeher assigned achirkin and unassigned tfeher Nov 6, 2024

achirkin linked a pull request Nov 11, 2024 that will close this issue

Fix an OOB error in device-side cuvs::neighbors::refine and CAGRA kern_prune #460

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

lijinf2 commented Sep 19, 2024

lijinf2 commented Nov 6, 2024

achirkin commented Nov 11, 2024

[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

Comments

lijinf2 commented Sep 19, 2024

lijinf2 commented Nov 6, 2024

achirkin commented Nov 11, 2024