Normalize UMAP interface as part of a standard algorithms interface strategy #300

lmeyerov · 2022-01-14T19:20:40Z

lmeyerov
Jan 14, 2022
Maintainer

UMAP does not seem like a one-off for algorithm integrations, so it will set some precedent for it and others, and seems worth a bit of thought.

It is interesting in a few ways:

unsupervised / auto mode + supervised
multiple competing implementations with differing args & values
we can assume scaling, featurization, etc. as separate calls, maybe even label prop for generalizing umap to graph nodes / graph edges / etc!
we want external umap and dimensionality reduction / embedding libs to be able to target our plotting API on their side, what API?

Conventions

Some we should probably follow

Supervised convention nowadays follows the scikit style (model(), fit(), ...), with params and outputs at each step
In unsupervised, we can make a default node target col (x, y), and users can override
Pydata multiengine convention seems to be framework-level and standard params as normal, and then engine selection arg + passthrough kwargs for per-engine
Dataframes: We support multiple DF types (pandas, arrow, cudf, dask, dask_cudf) as inputs
... and maybe pytorch (for dgl) now too? or later when dgl comes in?
We can assume ._nodes and ._edges as input, and when added, whatever.featurize() and friends provides ahead of time

Some questions:

What is UMAP on a graph? nodes umap, edges umap, ...
What columns does umap look at? How would the user pass those in, and how could they use a helper call like .featurize() to generate them, and what internal convention do we have for those?
What presentation controls do we want? Ex: Whether to show feature attributes or just original, how to control which similarity edges to show (k-neighbor, or up-to-k-neighbors-on-some-nearness-threshold, ...)
How to ensure the design will be remote-friendly, such as for potentially saving/restoring models, and doing bigger jobs remotely?
What stable interface do we expose to umap_learn and cuml.umap ?

Samples

Some sample program ideas:

Automatic table -> similarity graph:

g = graphistry.nodes(df, 'id')
# any graphistry setup: pandas, cudf, ...

g.umap().plot()
# unsupervised
# uses all numeric cols of df to compute k-nn similarity graph via UMAP

Supervised:

graphistry.nodes(df, 'id').umap(...).fit(...).plot()

With managed node features:

g = graphistry.nodes(df, 'id').edges(df2, 'src', 'dst')
g2 = g.featurize(nodes=..., edges=...).label_prop(...)
g2.umap().plot()

Edge UMAP instead of node UMAP:

g = graphistry.nodes(df, 'id').edges(df2, 'src', 'dst')
g2 = g.linegraph() # so we always do umap on nodes, this is the trick to flip to edges
g2.umap().plot()

Presentation:

g.umap(k=20, threshold_stddev=2).plot()
# show up to 20 similar neighbors per node, and only those within 2 std deviations away

g.umap(edges=False, output_prefix="umap_").plot()
# do not show edges, and put labels on cols umap_x/umap_y

Passthrough

g.umap(engine="umap_learn", extra_opts={...}).plot()

@silkspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize UMAP interface as part of a standard algorithms interface strategy #300

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Normalize UMAP interface as part of a standard algorithms interface strategy #300

lmeyerov Jan 14, 2022 Maintainer

Conventions

Some questions:

Samples

Replies: 0 comments

lmeyerov
Jan 14, 2022
Maintainer