-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revise to cuvs ivf_pq, add cosine for ivf_pq and support long_max indices for all ann algorithms. #757
Conversation
build |
build |
(2) ivfpq has become unstable in 24.10. It does not get passed with algoParam {"nlist" : 10, "nprobe" : 2, "M": 2, "n_bits": 4} in ci where test_ivfflat is run beforehand. avg_recall shows large variance, depending on the quantization accuracy. This can be fixed by increasing nlist, nprobe, M, and n_bits. Note ivf_pq is non-deterministic, and it seems due to kmeans initialization leveraging runtime values of GPU memory. | ||
|
||
(3) If M is is too small (e.g. 2), the returned distances can be very different from the ground distances. | ||
Spark rapids ml may give lower recall than cuvs sg because it aggregates local topk candidates by the returned distances. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought cuml has a 'refine distances' api that can be used to fill in the exact distances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems to be working for SG only. Do we want to support that or use that in our test case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we invoke SG classes locally, so can we do the refine in each worker? Seems important to accurately pick the top k from the partial results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, should be able to call the refine in mapInPandas and after cuvs PQ. A few questions:
(Q1) The refine API returns the exact distances, together with the refined topk indices. The API does not return the approximate (PQ) distances to the refined topk indices. That means, our "ivf_pq" algorithm will be returning exact distances, which is different from the behavior of cuvs "ivf_pq".
(Q2) Is the "ivf_pq + refine" equivalent to the ivf_pq build_algo of cagra? We have already supported the ivf_pq option of cagra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the refine API for ivf_pq to return exact distances.
build |
2 similar comments
build |
build |
3662f01
to
db2a2e9
Compare
build |
2 similar comments
build |
build |
4115a6f
to
787653f
Compare
build |
For the Cagra "compression" param, is there a way to create and pass a CompressionParams object in Python? |
cuml_alg_params["algorithm"] != "brute" | ||
): # brute links to CPUNearestNeighborsModel of benchmark.bench_nearest_neighbors | ||
if cuml_alg_params["algorithm"] == "cagra": | ||
check_fn = self._cal_cagra_params_and_check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to check if metric == "sqeuclidean" at initialization if given a cagra algo, rather than waiting till transform to catch - or just set it by default if given cagra?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the checking seems meaningful only when both metric and algorithm have been set. This may always be the case in initialization. For example, the application code could set metric in init but set algorithm through setter.
top1_ind = indices[:, 0] | ||
rest_indices = indices[:, 1:] | ||
rest_distances = distances[:, 1:] | ||
rest_distances[rest_indices == top1_ind[:, cp.newaxis]] = float( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So refine might return the top-1 index in other locations? What is the reason for that if so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is observed refine might return the top-1 index in other locations, if less than k items probed for a query. This can happen when the input candidate indices already contain many long_max (or top-1 index in other locations).
Refine improves the indices and fills exact distances, but does not guarantee every query to probe at least k items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it return a permutation of the original top-k' indices for k' ( < k ) being the number of checked neighbors? Why would there be duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I follow you accurately. Yes, it is observed cuvs repeats the top-1 nn instead of permutating top k'.
At our test case test_return_fewer_k[float32-ivfpq-array], ivfpq + refine returns indices [2, 3, 2, 2] and distances [0., 0., 'inf', 'inf'] for data point of id 2 (and for data point of id 3 too since the two data points are identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just added into the test case comparison with sg cuvs when less than k items probed.
ivfflat: ensured output is the same as sg.
ivfpq + refine: ensured the same behavior as ivfflat, and listening to the update of next release of sg.
Yeah, it should work as long as a compression object is created and passed into algoParams as {"compression":compression_obj}. |
build |
6b268ca
to
b9bac35
Compare
build |
support derived class and cuvs ivf_pq Signed-off-by: Jinfeng <[email protected]> add testing cosine for ivf_pq replace cuml ivfpq with cuvs ivf_pq fix less than k items probed and support long label dtype in create spark dataframe normalize dataset to unit norms for inner_product distances to avoid mg failure increase ivf_pq quantization to make its recall more stable remove normalization as it transform the dataset that leads to lower recall add case when less than k items are probed
improve test case for fewer k items probed fix bug relates to CPUNN revise per comments fix create_pyspark_dataframe to get it works for cp arrays as input fix bug on label of create_pyspark_dataframe fix bug tested in CPUNearestNeighbors model add refine to the knn.py for ivfpq in progress for checkout add debug info get ivf_pq cosine passed by increasing dataset std to make it separable get ivf_pq working after using refine remove unnecessary test for refine get refine work for less than k itmes probed replace df.withColumn with df.select to fix slowdown for df that was initialized with wide pd.DataFrame revise comment to make it more clear
… k items probed listening for future updates to consolidate behaviors of ivfflat, ivfpq and refine
b9bac35
to
8f290f2
Compare
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
No description provided.