You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UMAP does not seem like a one-off for algorithm integrations, so it will set some precedent for it and others, and seems worth a bit of thought.
It is interesting in a few ways:
unsupervised / auto mode + supervised
multiple competing implementations with differing args & values
we can assume scaling, featurization, etc. as separate calls, maybe even label prop for generalizing umap to graph nodes / graph edges / etc!
we want external umap and dimensionality reduction / embedding libs to be able to target our plotting API on their side, what API?
Conventions
Some we should probably follow
Supervised convention nowadays follows the scikit style (model(), fit(), ...), with params and outputs at each step
In unsupervised, we can make a default node target col (x, y), and users can override
Pydata multiengine convention seems to be framework-level and standard params as normal, and then engine selection arg + passthrough kwargs for per-engine
Dataframes: We support multiple DF types (pandas, arrow, cudf, dask, dask_cudf) as inputs
... and maybe pytorch (for dgl) now too? or later when dgl comes in?
We can assume ._nodes and ._edges as input, and when added, whatever.featurize() and friends provides ahead of time
Some questions:
What is UMAP on a graph? nodes umap, edges umap, ...
What columns does umap look at? How would the user pass those in, and how could they use a helper call like .featurize() to generate them, and what internal convention do we have for those?
What presentation controls do we want? Ex: Whether to show feature attributes or just original, how to control which similarity edges to show (k-neighbor, or up-to-k-neighbors-on-some-nearness-threshold, ...)
How to ensure the design will be remote-friendly, such as for potentially saving/restoring models, and doing bigger jobs remotely?
What stable interface do we expose to umap_learn and cuml.umap ?
Samples
Some sample program ideas:
Automatic table -> similarity graph:
g=graphistry.nodes(df, 'id')
# any graphistry setup: pandas, cudf, ...g.umap().plot()
# unsupervised# uses all numeric cols of df to compute k-nn similarity graph via UMAP
g=graphistry.nodes(df, 'id').edges(df2, 'src', 'dst')
g2=g.linegraph() # so we always do umap on nodes, this is the trick to flip to edgesg2.umap().plot()
Presentation:
g.umap(k=20, threshold_stddev=2).plot()
# show up to 20 similar neighbors per node, and only those within 2 std deviations awayg.umap(edges=False, output_prefix="umap_").plot()
# do not show edges, and put labels on cols umap_x/umap_y
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
UMAP does not seem like a one-off for algorithm integrations, so it will set some precedent for it and others, and seems worth a bit of thought.
It is interesting in a few ways:
Conventions
Some we should probably follow
model()
,fit()
, ...), with params and outputs at each stepx
,y
), and users can override._nodes
and._edges
as input, and when added, whatever.featurize()
and friends provides ahead of timeSome questions:
.featurize()
to generate them, and what internal convention do we have for those?umap_learn
andcuml.umap
?Samples
Some sample program ideas:
Automatic table -> similarity graph:
Supervised:
With managed node features:
Edge UMAP instead of node UMAP:
Presentation:
Passthrough
Beta Was this translation helpful? Give feedback.
All reactions