Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] A cuda error pops up sometimes when running cagra with ivf_pq build_algo on a large dataset of extreme values #337

Open
lijinf2 opened this issue Sep 19, 2024 · 2 comments · May be fixed by #460
Assignees
Labels
bug Something isn't working

Comments

@lijinf2
Copy link

lijinf2 commented Sep 19, 2024

Describe the bug
A cuda error pops up sometimes when running cagra with ivf_pq build_algo.

Steps/Code to reproduce bug
Run the following code multiple times (e.g. 5). The code sometimes fails with a cuda error.

def generate_dataset_with_sklearn(generator_config, enable_complexity = True):
    import numpy as np
    from sklearn.datasets import make_blobs

    n_samples = generator_config["n_samples"]
    n_features = generator_config["n_features"]

    X, y = make_blobs(
        n_samples=generator_config["n_samples"], 
        n_features=generator_config["n_features"], 
        centers=generator_config["centers"], 
        random_state=generator_config["random_state"]
    )

    if enable_complexity:
        # Add extreme values
        extreme_samples = int(n_samples * 0.1)  # 10% of samples will be extreme
        for i in range(extreme_samples):
            feature = np.random.randint(1, n_features)
            X[i, feature] = np.random.choice([-1000000, 1000000]) * np.random.random()
    return X, y

generator_config = {
    "n_samples": 100000,
    "n_features": 200,
    "centers": 50,
    "random_state": 0,
}
X, _ = generate_dataset_with_sklearn(generator_config)

import cupy as cp
gpu_X = cp.array(X, dtype="float32")

from cuvs.neighbors import cagra
index_params = {
    "build_algo": "ivf_pq",
    "intermediate_graph_degree": 128,
    "graph_degree": 64, 
}
index = cagra.build(cagra.IndexParams(**index_params), gpu_X)
using ivf_pq::index_params nrows 100000, dim 200, n_lits 316, pq_dim 56
[I] [15:15:44.222379] optimizing graph
Traceback (most recent call last):
  File "/home/jinfengl/project/spark-rapids-ml/python/reproducer.py", line 40, in <module>
    index = cagra.build(cagra.IndexParams(**index_params), gpu_X)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "resources.pyx", line 110, in cuvs.common.resources.auto_sync_resources.wrapper
  File "cagra.pyx", line 294, in cuvs.neighbors.cagra.cagra.build
  File "cagra.pyx", line 295, in cuvs.neighbors.cagra.cagra.build
  File "exceptions.pyx", line 37, in cuvs.common.exceptions.check_cuvs
cuvs.common.exceptions.CuvsException: CUDA error encountered at: file=~/miniconda3/envs/rapids-24.10/include/raft/core/interruptible.hpp line=303: 

Expected behavior
The code gets passed stably.

Environment details (please complete the following information):

  • Ubuntu 20.04, Python 3.11, CUDA 12.2, GPU A5000, Rapids 24.10
  • Installed by copying conda install command from rapids selector with both cuml and cuvs selected.
conda create -n rapids-24.10 -c rapidsai-nightly -c conda-forge -c nvidia  \
    cuml=24.10 cuvs=24.10 python=3.11 'cuda-version>=12.0,<=12.5'
@lijinf2 lijinf2 added the bug Something isn't working label Sep 19, 2024
@lijinf2
Copy link
Author

lijinf2 commented Nov 6, 2024

The CUDA error appears to be caused by an out-of-bounds GPU memory access on the variablesource_vecs here. The variable was computed as source_vecs += source_ixs[i] * dim, but a printout showed that source_ixs[i] had a value of LONG_MAX (9,223,372,036,854,775,807). This LONG_MAX value resulted from probing fewer than intermediate_graph_degree items in cagra ivf_pq, with LONG_MAX being filled into indices before this error occurred.

To fix the error, it appears the call stack needs to handle cases where "fewer than k items are probed" in the ivf_pq algorithm.

The following is the memory checking log:

tests/test_zzz_to_investigate_bug.py using ivf_pq::index_params nrows 20000, dim 200, n_lits 141, pq_dim 56
========= Invalid __global__ read of size 4 bytes 
=========     at 0x430 in /cuvs/cpp/src/neighbors/ivf_flat/ivf_flat_build.cuh:162:void cuvs::neighbors::ivf_flat::detail::build_index_kernel<float, long, unsigned int, (bool)1>(const T3 *, const T1 *, const T2 *, T1 **, T2 **, unsigned int *, T2, unsigned int, unsigned int, T2)
=========     by thread (0,0,0) in block (2002,0,0)
=========     Address 0x7fe841fffce0 is out of bounds
=========     and is 800 bytes before the nearest allocation at 0x7fe842000000 of size 16,000,000 bytes 
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame: [0x2e5a32]
=========                in /lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x15bc4]
=========                in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
=========     Host Frame:cudaLaunchKernel_ptsz [0x54aa1]
=========                in /miniconda3/envs/cuvs_125/lib/libcudart.so.12
=========     Host Frame:/cuvs/cpp/src/neighbors/refine/detail/../../ivf_flat/ivf_flat_build.cuh:509:void cuvs::neighbors::ivf_flat::detail::fill_refinement_index<float, long>(raft::resources const&, cuvs::neighbors::ivf_flat::index<float, long>*, float const*, long const*, long, unsigned int) [0x156bf72] 
=========                in /miniconda3/envs/cuvs_125/lib/libcuvs.so
=========     Host Frame:/cuvs/cpp/src/neighbors/refine/detail/refine_device_float_float.cu:45:cuvs::neighbors::refine(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<long const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long const>, (raft::memory_type)2> >, std::experimental::mdspan<long, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<long>, (raft::memory_type)2> >, std::experimental::mdspan<float, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float>, (raft::memory_type)2> >, cuvsDistanceType) [0x1569441]
=========                in /miniconda3/envs/cuvs_125/lib/libcuvs.so
=========     Host Frame:/cuvs/cpp/src/neighbors/detail/cagra/cagra_build.cuh:308:void cuvs::neighbors::cagra::detail::build_knn_graph<float, unsigned int, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >(raft::resources const&, std::experimental::mdspan<float const, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<float const>, (raft::memory_type)2> >, std::experimental::mdspan<unsigned int, std::experimental::extents<long, 18446744073709551615ul, 18446744073709551615ul>, std::experimental::layout_right, raft::host_device_accessor<std::experimental::default_accessor<unsigned int>, (raft::memory_type)0> >, cuvs::neighbors::cagra::graph_build_params::ivf_pq_params) [0xcea567]

@achirkin
Copy link
Contributor

@lijinf2 thank you for the small reproducer and detailed information! Tracking progress on this in #460

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants