-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving MG functions into unified API + raft::device_resources_snmg
as device resource type for MG functions
#454
base: branch-25.02
Are you sure you want to change the base?
Moving MG functions into unified API + raft::device_resources_snmg
as device resource type for MG functions
#454
Conversation
cpp/include/cuvs/neighbors/mg.hpp
Outdated
* @param[in] index_params configure the index building | ||
* @param[in] index_dataset a row-major matrix on host [n_rows, dim] | ||
* | ||
* @return the constructed IVF-Flat MG index | ||
*/ | ||
auto build(const raft::device_resources& handle, | ||
auto build(const raft::device_resources_snmg& clique, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the main goals with having the "snmg" resources object match the single-gpu object is that we wanted to be able to remove this additional MG. We should now be able to accept device_resources
and then check to see if a nccl clique has been set on it (which would imply that it's a multi-gpu resources object and not a single-gpu object). The whole goal with doing this was to consolidate code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NCCL clique is not set as a resource anymore, but we should still be able to implement the dispatching by checking the dynamic type of the device_resources
. The real question then is, do we truly want dispatching on both the regular API (cuvs::neighbors::build
) and the mg
namespace (cuvs::neighbors::mg::build
)? It kind of make sense that a user providing a device_resources_snmg
instance to the regular API (cuvs::neighbors::build
) would want things to be deployed on multiple GPUs. However, the reverse is not necessarily true. A user who explicitly chose the mg
namespace, but did not provide the adequate device_resources_snmg
would fallback to single GPU, potentially unintentionally. Is this what we want?
I propose that we implement the dispatching mechanism solely on the regular API (cuvs::neighbors::build
) in a dedicated follow-up PR? This also allows the MG doc to explicitly avert users that they should use an adequate device_resources_snmg
to use the MG API. What do you think?
Not sure why CI is still failing. The RAFT PR was merged quite awhile ago but I'm still seeing errors like this:
|
This fell back to old 24.12 RAFT packages for unclear reasons:
https://github.com/rapidsai/cuvs/actions/runs/12818869804/job/35749392606?pr=454#step:8:521 |
Oh fooey! Thanks for noticing that @bdice, I just realized the darn target branch was never updated! |
6ae1c70
to
b0c7ab9
Compare
raft::device_resources_snmg
as device resource type for MG functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is nearly there. Just a couple very minor things at this point.
Note- cuVS is not header-only and we shouldn't be putting constants in include files (that further couples cuVS to our exact build policies and makes it harder to navigate the build).
Also, we should not be introducing ANY dependencies in the public header files. That's why we're using raft::resources as a facade so that we don't couple device_resources or device_resources_snmg to our public APIs. In other words, the user can use those objects if they want for convenience, but none of the other APIs in cuVS (nor in RAFT) use those objects because raft::resource is a generic container for these types of objects. That way the user can couple themselves to the dependencie sbut it doesn't force the dependencies on all other users. Does that make sense?
@@ -87,7 +86,7 @@ class cuvs_mg_cagra : public algo<T>, public algo_gpu { | |||
std::unique_ptr<algo<T>> copy() override; | |||
|
|||
private: | |||
raft::device_resources handle_; | |||
raft::device_resources_snmg clique_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_resources_snmg
is a type of raft::resource
. The pattern here is to accept the raft::resources
instance and call the free function inside raft::resource::nccl_resource
underneath to get the clique out of the resource. This way the APIs don't need to look any different.
cpp/include/cuvs/neighbors/cagra.hpp
Outdated
* | ||
* @return the constructed CAGRA MG index | ||
*/ | ||
auto build(const raft::device_resources_snmg& clique, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't couple the standard headers to nccl. This is no good. Instead, please just accept a raft::device_resources
here but provide a function in raft::resource::nccl_resource
namespace to introspect the resources for a nccl clique (otherwise fail). The snmg build params should be anough of a difference
} \ | ||
\ | ||
template <> \ | ||
cuvs::neighbors::mg::index<cagra::index<T, IdxT>, T, IdxT> distribute<T, IdxT>( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we were removing the mg namespace completely from the public APIs.
/ok to test |
raft::device_resources_snmg
as device resource type for MG functionsraft::comms::build_comms_nccl_only
solving [FEA] Remove dependency ofbuild_comms_nccl_only
on UCP raft#2465raft::device_resources_snmg
is introduced in this PR : rapidsai/raft#2549