You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.
TL; DR: Is nn.Module all you need for last-mile preproc?
TorchArrow started to rethink data preparation pipelines for AI. With iterations over real product workload launches, we believe this is now the right time to rethink about the future strategy and direction.
The ML Data world can be categorized into the two parts: (1) Dataset Preparation (which is more offline; a lot of them also known as feature engineering) (2) Last-mile Preproc (which interacts and iterates more with model authoring). The boundary can sometimes be vague, as during new model iteration, more parts are considered as "Last-mile Preproc"; but later they may gradually stabilized and graduated into Dataset Preparation.
The Dataset Preparation part is a natural fit for DataFrame (and can be potentially unified with feature engineering). But the last-mile preproc, nn.Module, together with Dict[Tensor] or TensorDict flavor structure seems to be a more natural way for ML engineers.
One potential approach (request for comment) is to use DataFrame for dataset preparation, and nn.Module for the last-mile preproc authoring. And we can implement a unified executor that supports both -- which can executes both Velox runtime and PyTorch runtime (e.g. a package and serialized nn.Module) to preproc materialization. Apache Arrow memory layout allows smooth interop between Data and AI.
See the following docs for details and discussions.
TL; DR: Is
nn.Module
all you need for last-mile preproc?TorchArrow started to rethink data preparation pipelines for AI. With iterations over real product workload launches, we believe this is now the right time to rethink about the future strategy and direction.
The ML Data world can be categorized into the two parts: (1) Dataset Preparation (which is more offline; a lot of them also known as feature engineering) (2) Last-mile Preproc (which interacts and iterates more with model authoring). The boundary can sometimes be vague, as during new model iteration, more parts are considered as "Last-mile Preproc"; but later they may gradually stabilized and graduated into Dataset Preparation.
The Dataset Preparation part is a natural fit for DataFrame (and can be potentially unified with feature engineering). But the last-mile preproc,
nn.Module
, together withDict[Tensor]
orTensorDict
flavor structure seems to be a more natural way for ML engineers.One potential approach (request for comment) is to use DataFrame for dataset preparation, and nn.Module for the last-mile preproc authoring. And we can implement a unified executor that supports both -- which can executes both Velox runtime and PyTorch runtime (e.g. a package and serialized nn.Module) to preproc materialization. Apache Arrow memory layout allows smooth interop between Data and AI.
See the following docs for details and discussions.
https://docs.google.com/document/d/1RHQDCAqLCAt9EkbtaUrd5ETjrbe_7HoV-9Mfxlohq4c/
The text was updated successfully, but these errors were encountered: