Skip to content
This repository has been archived by the owner on Nov 1, 2024. It is now read-only.

TorchArrow Beta Release v0.1.0

Latest
Compare
Choose a tag to compare
@bearzx bearzx released this 13 Jul 00:01
· 2 commits to release/0.1.0 since this release

We are excited to release the very first Beta version of TorchArrow! TorchArrow is a machine learning preprocessing library over batch data, providing performant and Pandas-style easy-to-use API for model development.

Highlights

TorchArrow provides a Python DataFrame that allows extensible UDFs with Velox, with the following features:

  • Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
  • Zero copy for external readers via Arrow in-memory columnar format
  • Multiple execution runtimes support:
    • High-performance CPU backend via Velox
    • (Future Work) GPU backend via libcudf
  • High-performance C++ UDF support with vectorization

Installation

In this release we are supporting install via PYPI: pip install torcharrow.

Documentation

You can find the API documentation here.

This 10 minutes tutorial provides a short introduction to TorchArrow, and you can also try it in this Colab.

Examples

You can find the example about integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing here. More examples are coming soon!

Future Plans

We hope to continue to expand the library, harden API, and gather feedback to enable future releases. Stay tuned!

Beta Usage Note

TorchArrow is currently in the Beta stage and does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.