Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Features/unique sort distributed #749

Draft
wants to merge 110 commits into
base: main
Choose a base branch
from

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Mar 24, 2021

Description

This PR introduces major changes in the ht.unique() implementation, fixing some bugs/inconsistencies along the way (see below).

Changes proposed:

Distributed unique requires two passes:

  1. find local sorted unique elements,
  2. find global sorted unique elements.

The current (v0.5.1) implementation solves step 2. by running torch.unique again on the gathered local unique elements. This might turn into a memory bottleneck for very large data.

The main implementation change in this PR is that, in the distributed case, ht.unique now recycles the "pivot sorting" implementation (see ht.sort(), manipulations._pivot_sorting()) to perform an Alltoallv-based sorted unique operation that doesn't require "gathering".

The main user-side changes are as follows:

  • ht.unique now, like numpy, always returns the SORTED unique elements.

  • "sparse" vs. "dense" unique. If the collective size of the local uniques (from step 1) above) is smaller than the size of the local data, then ht.unique gathers everything and runs the operation locally. In this case, the unique elements array will have split=None. Otherwise, distributed unique via _pivot_sorting() (Alltoallv) returning a distributed DNDarray.

  • inverse indices are now a DNDarray and distributed like the input data. Note that inverse indices are used to recreate the original data shape from the unique elements. However, the sorted unique elements corresponding to a given inverse index might be on a different process. Eventually, setitem should be able to deal with this, at the moment unique[inverse] requires a unique.resplit_(None) first.

As an aside:

  • get gethalo to work on imbalanced DNDarrays
  • get create_lshape_map to only require communication for imbalanced DNDarrays
  • resolve race condition in test_qr that has been popping up on and off for ages.
  • ADDED 24 NOV 2021: factories.array behaviour when copy=False now closer to np.array (https://numpy.org/doc/stable/reference/generated/numpy.array.html).

Issue/s resolved: #363, #564, #621

Type of change

  • Breaking change (fix or feature that would cause existing functionality to not work as expected):
    • ht.unique() always returns the sorted unique elements, kwargsorted has been removed
    • inverse indices are no longer torch tensors, they're now DNDarrays and distributed like the input
    • unique.resplit_(None) might be required before applying inverse indices
    • NEW: factories.array(copy=False) does not copy slices of original data unless absolutely necessary (dtype, order etc.)

Due Diligence

  • All split configurations tested
  • Multiple dtypes tested in relevant functions
  • Documentation updated (if needed)
  • Updated changelog.md under the title "Pending Additions"

Does this change modify the behaviour of other functions? If so, which?

  • the possibility to leave the data "unsorted" is not available any longer.
  • operations expecting inverse indices to be a local torch tensor will fail.
  • operations expecting ht.unique() to return a non-distributed DNDarray may fail in some cases.

@mtar mtar added High priority, urgent and removed High priority, urgent labels Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unique() inconsistencies Add clarification to documentation of unique() vectorized sorting
3 participants