Features/unique sort distributed #749

ClaudiaComito · 2021-03-24T10:39:41Z

Description

This PR introduces major changes in the ht.unique() implementation, fixing some bugs/inconsistencies along the way (see below).

Changes proposed:

Distributed unique requires two passes:

find local sorted unique elements,
find global sorted unique elements.

The current (v0.5.1) implementation solves step 2. by running torch.unique again on the gathered local unique elements. This might turn into a memory bottleneck for very large data.

The main implementation change in this PR is that, in the distributed case, ht.unique now recycles the "pivot sorting" implementation (see ht.sort(), manipulations._pivot_sorting()) to perform an Alltoallv-based sorted unique operation that doesn't require "gathering".

The main user-side changes are as follows:

ht.unique now, like numpy, always returns the SORTED unique elements.
"sparse" vs. "dense" unique. If the collective size of the local uniques (from step 1) above) is smaller than the size of the local data, then ht.unique gathers everything and runs the operation locally. In this case, the unique elements array will have split=None. Otherwise, distributed unique via _pivot_sorting() (Alltoallv) returning a distributed DNDarray.
inverse indices are now a DNDarray and distributed like the input data. Note that inverse indices are used to recreate the original data shape from the unique elements. However, the sorted unique elements corresponding to a given inverse index might be on a different process. Eventually, setitem should be able to deal with this, at the moment unique[inverse] requires a unique.resplit_(None) first.

As an aside:

get gethalo to work on imbalanced DNDarrays
get create_lshape_map to only require communication for imbalanced DNDarrays
resolve race condition in test_qr that has been popping up on and off for ages.
ADDED 24 NOV 2021: factories.array behaviour when copy=False now closer to np.array (https://numpy.org/doc/stable/reference/generated/numpy.array.html).

Issue/s resolved: #363, #564, #621

Type of change

Breaking change (fix or feature that would cause existing functionality to not work as expected):
- ht.unique() always returns the sorted unique elements, kwargsorted has been removed
- inverse indices are no longer torch tensors, they're now DNDarrays and distributed like the input
- unique.resplit_(None) might be required before applying inverse indices
- NEW: factories.array(copy=False) does not copy slices of original data unless absolutely necessary (dtype, order etc.)

Due Diligence

All split configurations tested
Multiple dtypes tested in relevant functions
Documentation updated (if needed)
Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

the possibility to leave the data "unsorted" is not available any longer.
operations expecting inverse indices to be a local torch tensor will fail.
operations expecting ht.unique() to return a non-distributed DNDarray may fail in some cases.

…for both sort() and unique()

…icated branch

ClaudiaComito added 25 commits February 19, 2021 12:43

Modify get_halo to work with non-balanced DNDarray

976bec2

Merge branch 'master' into features/unique-sort-distributed

1d5c3b6

Create lshape_map without communication if DNDarray is balanced.

30b3ec3

in-place resplit to work in imbalanced DNDarrays as well

db4ebed

Implement distributed unique, return inverse indices

060b48a

Merge branch 'master' into features/unique-sort-distributed

1a27f45

Debugging unique

1d9f71f

Fix error in counts, displs for unbalanced resplit_(None)

77fb0d4

Fix error in counts, displs for unbalanced resplit_(None)

63c96c4

Fix imbalanced resplit()

a6620da

Debugging unique

1fe8428

Fix incoming_offset error in sparse unique

8dd0d82

Updated documentation, fixed some split errors.

c35c72a

Skip non-populated ranks in imbalanced gethalo

ae56b86

Merge changes to reduce_op

fa597a8

Modify tests for imbalanced gethalo()

39050f0

Generalize sort() implementation into helper function _pivot_sorting …

e57235a

…for both sort() and unique()

Update test_unique based on new distributed implementation

20a9930

Fix write-out bug in MPI ring

9290c40

Expand "dense unique" tests

79e6219

Expand test_unique

8123b6a

Merge branch 'master' into features/unique-sort-distributed

19442fc

minimize boiler-plate code in test_unique

290f11d

Merge branch 'master' into features/unique-sort-distributed

f5af549

remove excess ###

4bed18a

This was linked to issues Mar 24, 2021

vectorized sorting #363

Open

Unique() inconsistencies #621

Open

Add clarification to documentation of unique() #564

Open

ClaudiaComito added 2 commits March 24, 2021 12:03

Debugging

2c2aa6e

Debugging

5a2e592

ClaudiaComito added 2 commits November 25, 2021 05:06

Reinstate "sparse" unique

6007d31

Remove unnecessary balance_ before distributed unique

ec0684c

mtar added ❗ High priority, urgent and removed ❗ High priority, urgent labels Dec 13, 2021

ClaudiaComito and others added 3 commits January 20, 2022 11:47

Merge branch 'master' into features/unique-sort-distributed

924fcd3

Bring back factories.array to original state, changes forked to ded…

6d385e8

…icated branch

Merge branch 'master' into features/unique-sort-distributed

159d1c5

ClaudiaComito mentioned this pull request Feb 8, 2022

Address distributed non-ordered indexing #914

Open

This was referenced Apr 1, 2022

Implement ASSET #376

Open

Let unique() also return the indices #493

Open

ClaudiaComito modified the milestones: 1.2.x, Repo Clean-Up, 1.4.0 Jul 31, 2023

ClaudiaComito marked this pull request as draft August 7, 2023 09:35

This was referenced Aug 19, 2023

Features/777-argsort and testcase #966

Closed

Add clarification to documentation of unique() #564

Open

Unique() inconsistencies #621

Open

vectorized sorting #363

Open

ClaudiaComito removed request for Markus-Goetz, Cdebus, lucaspataro, mtar and TheSlimvReal September 4, 2023 09:01

ClaudiaComito self-assigned this Sep 4, 2023

ClaudiaComito added the manipulations label Mar 3, 2024

ClaudiaComito modified the milestones: 1.4.0, 1.5.0 Apr 12, 2024

ClaudiaComito modified the milestones: 1.5.0, 1.6 Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features/unique sort distributed #749

Features/unique sort distributed #749

ClaudiaComito commented Mar 24, 2021 •

edited

Loading

Features/unique sort distributed #749

Are you sure you want to change the base?

Features/unique sort distributed #749

Conversation

ClaudiaComito commented Mar 24, 2021 • edited Loading

Description

Changes proposed:

Type of change

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

ClaudiaComito commented Mar 24, 2021 •

edited

Loading