Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

[RFC] Rethinking of ML Preproc in PyTorch Ecosystem #515

Open
wenleix opened this issue Mar 10, 2023 · 0 comments
Open

[RFC] Rethinking of ML Preproc in PyTorch Ecosystem #515

wenleix opened this issue Mar 10, 2023 · 0 comments

Comments

@wenleix
Copy link
Contributor

wenleix commented Mar 10, 2023

TL; DR: Is nn.Module all you need for last-mile preproc?

TorchArrow started to rethink data preparation pipelines for AI. With iterations over real product workload launches, we believe this is now the right time to rethink about the future strategy and direction.

The ML Data world can be categorized into the two parts: (1) Dataset Preparation (which is more offline; a lot of them also known as feature engineering) (2) Last-mile Preproc (which interacts and iterates more with model authoring). The boundary can sometimes be vague, as during new model iteration, more parts are considered as "Last-mile Preproc"; but later they may gradually stabilized and graduated into Dataset Preparation.

The Dataset Preparation part is a natural fit for DataFrame (and can be potentially unified with feature engineering). But the last-mile preproc, nn.Module, together with Dict[Tensor] or TensorDict flavor structure seems to be a more natural way for ML engineers.

One potential approach (request for comment) is to use DataFrame for dataset preparation, and nn.Module for the last-mile preproc authoring. And we can implement a unified executor that supports both -- which can executes both Velox runtime and PyTorch runtime (e.g. a package and serialized nn.Module) to preproc materialization. Apache Arrow memory layout allows smooth interop between Data and AI.

See the following docs for details and discussions.

https://docs.google.com/document/d/1RHQDCAqLCAt9EkbtaUrd5ETjrbe_7HoV-9Mfxlohq4c/

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant