Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

lijinf2
Copy link
Collaborator

@lijinf2 lijinf2 commented Oct 14, 2024

No description provided.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 14, 2024

build

@lijinf2 lijinf2 changed the title Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max when less than k items probed Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. Oct 14, 2024
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 15, 2024

build

python/src/spark_rapids_ml/knn.py Outdated Show resolved Hide resolved
python/tests/utils.py Outdated Show resolved Hide resolved
python/src/spark_rapids_ml/knn.py Show resolved Hide resolved
(2) ivfpq has become unstable in 24.10. It does not get passed with algoParam {"nlist" : 10, "nprobe" : 2, "M": 2, "n_bits": 4} in ci where test_ivfflat is run beforehand. avg_recall shows large variance, depending on the quantization accuracy. This can be fixed by increasing nlist, nprobe, M, and n_bits. Note ivf_pq is non-deterministic, and it seems due to kmeans initialization leveraging runtime values of GPU memory.

(3) If M is is too small (e.g. 2), the returned distances can be very different from the ground distances.
Spark rapids ml may give lower recall than cuvs sg because it aggregates local topk candidates by the returned distances.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought cuml has a 'refine distances' api that can be used to fill in the exact distances.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems to be working for SG only. Do we want to support that or use that in our test case?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we invoke SG classes locally, so can we do the refine in each worker? Seems important to accurately pick the top k from the partial results.

Copy link
Collaborator Author

@lijinf2 lijinf2 Oct 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, should be able to call the refine in mapInPandas and after cuvs PQ. A few questions:

(Q1) The refine API returns the exact distances, together with the refined topk indices. The API does not return the approximate (PQ) distances to the refined topk indices. That means, our "ivf_pq" algorithm will be returning exact distances, which is different from the behavior of cuvs "ivf_pq".

(Q2) Is the "ivf_pq + refine" equivalent to the ivf_pq build_algo of cagra? We have already supported the ivf_pq option of cagra.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the refine API for ivf_pq to return exact distances.

python/tests/test_approximate_nearest_neighbors.py Outdated Show resolved Hide resolved
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

2 similar comments
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

2 similar comments
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 16, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 17, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 24, 2024

build

@rishic3
Copy link
Collaborator

rishic3 commented Oct 25, 2024

For the Cagra "compression" param, is there a way to create and pass a CompressionParams object in Python?

cuml_alg_params["algorithm"] != "brute"
): # brute links to CPUNearestNeighborsModel of benchmark.bench_nearest_neighbors
if cuml_alg_params["algorithm"] == "cagra":
check_fn = self._cal_cagra_params_and_check
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to check if metric == "sqeuclidean" at initialization if given a cagra algo, rather than waiting till transform to catch - or just set it by default if given cagra?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checking seems meaningful only when both metric and algorithm have been set. This may always be the case in initialization. For example, the application code could set metric in init but set algorithm through setter.

python/src/spark_rapids_ml/knn.py Show resolved Hide resolved
python/src/spark_rapids_ml/knn.py Show resolved Hide resolved
top1_ind = indices[:, 0]
rest_indices = indices[:, 1:]
rest_distances = distances[:, 1:]
rest_distances[rest_indices == top1_ind[:, cp.newaxis]] = float(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So refine might return the top-1 index in other locations? What is the reason for that if so?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is observed refine might return the top-1 index in other locations, if less than k items probed for a query. This can happen when the input candidate indices already contain many long_max (or top-1 index in other locations).

Refine improves the indices and fills exact distances, but does not guarantee every query to probe at least k items.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it return a permutation of the original top-k' indices for k' ( < k ) being the number of checked neighbors? Why would there be duplicates?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I follow you accurately. Yes, it is observed cuvs repeats the top-1 nn instead of permutating top k'.
At our test case test_return_fewer_k[float32-ivfpq-array], ivfpq + refine returns indices [2, 3, 2, 2] and distances [0., 0., 'inf', 'inf'] for data point of id 2 (and for data point of id 3 too since the two data points are identical).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just added into the test case comparison with sg cuvs when less than k items probed.

ivfflat: ensured output is the same as sg.
ivfpq + refine: ensured the same behavior as ivfflat, and listening to the update of next release of sg.

python/tests/utils.py Outdated Show resolved Hide resolved
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 26, 2024

For the Cagra "compression" param, is there a way to create and pass a CompressionParams object in Python?

Yeah, it should work as long as a compression object is created and passed into algoParams as {"compression":compression_obj}.
Did not add that to the PR. Seems it is not yet listed as one of the cuvs python APIs with example code.

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Oct 26, 2024

build

@lijinf2
Copy link
Collaborator Author

lijinf2 commented Nov 5, 2024

build

support derived class and cuvs ivf_pq

Signed-off-by: Jinfeng <[email protected]>

add testing cosine for ivf_pq

replace cuml ivfpq with cuvs ivf_pq

fix less than k items probed and support long label dtype in create spark dataframe

normalize dataset to unit norms for inner_product distances to avoid mg failure

increase ivf_pq quantization to make its recall more stable

remove normalization as it transform the dataset that leads to lower recall

add case when less than k items are probed
improve test case for fewer k items probed

fix bug relates to CPUNN

revise per comments

fix create_pyspark_dataframe to get it works for cp arrays as input

fix bug on label of create_pyspark_dataframe

fix bug tested in CPUNearestNeighbors model

add refine to the knn.py for ivfpq

in progress for checkout

add debug info

get ivf_pq cosine passed by increasing dataset std to make it separable

get ivf_pq working after using refine

remove unnecessary test for refine

get refine work for less than k itmes probed

replace df.withColumn with df.select to fix slowdown for df that was initialized with wide pd.DataFrame

revise comment to make it more clear
… k items probed

listening for future updates to consolidate behaviors of ivfflat, ivfpq and refine
@lijinf2
Copy link
Collaborator Author

lijinf2 commented Nov 6, 2024

build

Copy link
Collaborator

@eordentlich eordentlich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lijinf2 lijinf2 merged commit 2d814c4 into NVIDIA:branch-24.10 Nov 6, 2024
2 checks passed
@lijinf2 lijinf2 deleted the pyworker_reuse branch November 6, 2024 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants