-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Features/unique sort distributed #749
Draft
ClaudiaComito
wants to merge
110
commits into
main
Choose a base branch
from
features/unique-sort-distributed
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…for both sort() and unique()
This was
linked to
issues
Mar 24, 2021
This was referenced Apr 1, 2022
Open
This was referenced Aug 19, 2023
ClaudiaComito
removed request for
Markus-Goetz,
Cdebus,
lucaspataro,
mtar and
TheSlimvReal
September 4, 2023 09:01
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR introduces major changes in the
ht.unique()
implementation, fixing some bugs/inconsistencies along the way (see below).Changes proposed:
Distributed
unique
requires two passes:The current (v0.5.1) implementation solves step 2. by running
torch.unique
again on the gathered local unique elements. This might turn into a memory bottleneck for very large data.The main implementation change in this PR is that, in the distributed case,
ht.unique
now recycles the "pivot sorting" implementation (seeht.sort()
,manipulations._pivot_sorting()
) to perform anAlltoallv
-based sorted unique operation that doesn't require "gathering".The main user-side changes are as follows:
ht.unique
now, like numpy, always returns the SORTED unique elements."sparse" vs. "dense" unique. If the collective size of the local uniques (from step 1) above) is smaller than the size of the local data, then
ht.unique
gathers everything and runs the operation locally. In this case, the unique elements array will havesplit=None
. Otherwise, distributedunique
via_pivot_sorting()
(Alltoallv
) returning a distributed DNDarray.inverse indices are now a DNDarray and distributed like the input data. Note that inverse indices are used to recreate the original data shape from the unique elements. However, the sorted unique elements corresponding to a given inverse index might be on a different process. Eventually,
setitem
should be able to deal with this, at the momentunique[inverse]
requires aunique.resplit_(None)
first.As an aside:
gethalo
to work on imbalanced DNDarrayscreate_lshape_map
to only require communication for imbalanced DNDarraystest_qr
that has been popping up on and off for ages.factories.array
behaviour whencopy=False
now closer tonp.array
(https://numpy.org/doc/stable/reference/generated/numpy.array.html).Issue/s resolved: #363, #564, #621
Type of change
ht.unique()
always returns the sorted unique elements, kwargsorted
has been removedunique.resplit_(None)
might be required before applying inverse indicesfactories.array(copy=False)
does not copy slices of original data unless absolutely necessary (dtype, order etc.)Due Diligence
Does this change modify the behaviour of other functions? If so, which?
ht.unique()
to return a non-distributed DNDarray may fail in some cases.