Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

lijinf2 · 2024-10-14T20:02:10Z

No description provided.

lijinf2 · 2024-10-14T20:04:41Z

build

lijinf2 · 2024-10-15T00:09:16Z

build

python/src/spark_rapids_ml/knn.py

python/tests/utils.py

python/src/spark_rapids_ml/knn.py

eordentlich · 2024-10-15T22:02:16Z

python/tests/test_approximate_nearest_neighbors.py

+    (2) ivfpq has become unstable in 24.10. It does not get passed with algoParam {"nlist" : 10, "nprobe" : 2, "M": 2, "n_bits": 4} in ci where test_ivfflat is run beforehand. avg_recall shows large variance, depending on the quantization accuracy. This can be fixed by increasing nlist, nprobe, M, and n_bits. Note ivf_pq is non-deterministic, and it seems due to kmeans initialization leveraging runtime values of GPU memory.
+
+    (3) If M is is too small (e.g. 2), the returned distances can be very different from the ground distances.
+    Spark rapids ml may give lower recall than cuvs sg because it aggregates local topk candidates by the returned distances.


I thought cuml has a 'refine distances' api that can be used to fill in the exact distances.

Yeah, that seems to be working for SG only. Do we want to support that or use that in our test case?

But we invoke SG classes locally, so can we do the refine in each worker? Seems important to accurately pick the top k from the partial results.

Yeah, should be able to call the refine in mapInPandas and after cuvs PQ. A few questions:

(Q1) The refine API returns the exact distances, together with the refined topk indices. The API does not return the approximate (PQ) distances to the refined topk indices. That means, our "ivf_pq" algorithm will be returning exact distances, which is different from the behavior of cuvs "ivf_pq".

(Q2) Is the "ivf_pq + refine" equivalent to the ivf_pq build_algo of cagra? We have already supported the ivf_pq option of cagra.

Added the refine API for ivf_pq to return exact distances.

python/tests/test_approximate_nearest_neighbors.py

lijinf2 · 2024-10-16T01:05:21Z

build

lijinf2 · 2024-10-16T20:33:22Z

build

lijinf2 · 2024-10-16T20:36:32Z

build

lijinf2 · 2024-10-16T20:58:36Z

build

lijinf2 · 2024-10-16T22:02:36Z

build

lijinf2 · 2024-10-17T00:06:25Z

build

lijinf2 · 2024-10-24T18:50:06Z

build

rishic3 · 2024-10-25T19:40:10Z

For the Cagra "compression" param, is there a way to create and pass a CompressionParams object in Python?

rishic3 · 2024-10-25T20:20:54Z

python/src/spark_rapids_ml/knn.py

+            cuml_alg_params["algorithm"] != "brute"
+        ):  # brute links to CPUNearestNeighborsModel of benchmark.bench_nearest_neighbors
+            if cuml_alg_params["algorithm"] == "cagra":
+                check_fn = self._cal_cagra_params_and_check


Would it be helpful to check if metric == "sqeuclidean" at initialization if given a cagra algo, rather than waiting till transform to catch - or just set it by default if given cagra?

the checking seems meaningful only when both metric and algorithm have been set. This may always be the case in initialization. For example, the application code could set metric in init but set algorithm through setter.

python/src/spark_rapids_ml/knn.py

eordentlich · 2024-10-25T21:14:41Z

python/src/spark_rapids_ml/knn.py

+                    top1_ind = indices[:, 0]
+                    rest_indices = indices[:, 1:]
+                    rest_distances = distances[:, 1:]
+                    rest_distances[rest_indices == top1_ind[:, cp.newaxis]] = float(


So refine might return the top-1 index in other locations? What is the reason for that if so?

Yes, it is observed refine might return the top-1 index in other locations, if less than k items probed for a query. This can happen when the input candidate indices already contain many long_max (or top-1 index in other locations).

Refine improves the indices and fills exact distances, but does not guarantee every query to probe at least k items.

Shouldn't it return a permutation of the original top-k' indices for k' ( < k ) being the number of checked neighbors? Why would there be duplicates?

Not sure if I follow you accurately. Yes, it is observed cuvs repeats the top-1 nn instead of permutating top k'.
At our test case test_return_fewer_k[float32-ivfpq-array], ivfpq + refine returns indices [2, 3, 2, 2] and distances [0., 0., 'inf', 'inf'] for data point of id 2 (and for data point of id 3 too since the two data points are identical).

Just added into the test case comparison with sg cuvs when less than k items probed.

ivfflat: ensured output is the same as sg.
ivfpq + refine: ensured the same behavior as ivfflat, and listening to the update of next release of sg.

python/tests/utils.py

python/tests/test_approximate_nearest_neighbors.py

lijinf2 · 2024-10-26T00:34:45Z

For the Cagra "compression" param, is there a way to create and pass a CompressionParams object in Python?

Yeah, it should work as long as a compression object is created and passed into algoParams as {"compression":compression_obj}.
Did not add that to the PR. Seems it is not yet listed as one of the cuvs python APIs with example code.

lijinf2 · 2024-10-26T00:54:15Z

build

lijinf2 · 2024-11-05T20:15:14Z

build

support derived class and cuvs ivf_pq Signed-off-by: Jinfeng <[email protected]> add testing cosine for ivf_pq replace cuml ivfpq with cuvs ivf_pq fix less than k items probed and support long label dtype in create spark dataframe normalize dataset to unit norms for inner_product distances to avoid mg failure increase ivf_pq quantization to make its recall more stable remove normalization as it transform the dataset that leads to lower recall add case when less than k items are probed

improve test case for fewer k items probed fix bug relates to CPUNN revise per comments fix create_pyspark_dataframe to get it works for cp arrays as input fix bug on label of create_pyspark_dataframe fix bug tested in CPUNearestNeighbors model add refine to the knn.py for ivfpq in progress for checkout add debug info get ivf_pq cosine passed by increasing dataset std to make it separable get ivf_pq working after using refine remove unnecessary test for refine get refine work for less than k itmes probed replace df.withColumn with df.select to fix slowdown for df that was initialized with wide pd.DataFrame revise comment to make it more clear

… k items probed listening for future updates to consolidate behaviors of ivfflat, ivfpq and refine

lijinf2 · 2024-11-06T01:56:27Z

build

eordentlich

👍

lijinf2 changed the title ~~Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max when less than k items probed~~ Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. Oct 14, 2024

eordentlich reviewed Oct 15, 2024

View reviewed changes

lijinf2 force-pushed the pyworker_reuse branch from 3662f01 to db2a2e9 Compare October 16, 2024 20:57

lijinf2 force-pushed the pyworker_reuse branch from 4115a6f to 787653f Compare October 24, 2024 18:49

rishic3 reviewed Oct 25, 2024

View reviewed changes

eordentlich reviewed Oct 25, 2024

View reviewed changes

lijinf2 force-pushed the pyworker_reuse branch from 6b268ca to b9bac35 Compare November 5, 2024 20:14

lijinf2 added 3 commits November 6, 2024 01:55

ensure spark returns are consistent with cuvs when handling less than…

8f290f2

… k items probed listening for future updates to consolidate behaviors of ivfflat, ivfpq and refine

lijinf2 force-pushed the pyworker_reuse branch from b9bac35 to 8f290f2 Compare November 6, 2024 01:55

eordentlich approved these changes Nov 6, 2024

View reviewed changes

lijinf2 merged commit 2d814c4 into NVIDIA:branch-24.10 Nov 6, 2024
2 checks passed

lijinf2 deleted the pyworker_reuse branch November 6, 2024 06:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

lijinf2 commented Oct 14, 2024

lijinf2 commented Oct 14, 2024

lijinf2 commented Oct 15, 2024

eordentlich Oct 15, 2024

lijinf2 Oct 16, 2024

eordentlich Oct 16, 2024

lijinf2 Oct 17, 2024 •

edited

Loading

lijinf2 Oct 24, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 24, 2024

rishic3 commented Oct 25, 2024

rishic3 Oct 25, 2024

lijinf2 Oct 26, 2024

eordentlich Oct 25, 2024

lijinf2 Oct 26, 2024

eordentlich Oct 26, 2024

lijinf2 Oct 26, 2024

lijinf2 Nov 5, 2024

lijinf2 commented Oct 26, 2024

lijinf2 commented Oct 26, 2024

lijinf2 commented Nov 5, 2024

lijinf2 commented Nov 6, 2024

eordentlich left a comment

Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757

Conversation

lijinf2 commented Oct 14, 2024

lijinf2 commented Oct 14, 2024

lijinf2 commented Oct 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 Oct 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 16, 2024

lijinf2 commented Oct 17, 2024

lijinf2 commented Oct 24, 2024

rishic3 commented Oct 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lijinf2 commented Oct 26, 2024

lijinf2 commented Oct 26, 2024

lijinf2 commented Nov 5, 2024

lijinf2 commented Nov 6, 2024

eordentlich left a comment

Choose a reason for hiding this comment

lijinf2 Oct 17, 2024 •

edited

Loading