We are excited to release the very first Beta version of TorchArrow! TorchArrow is a machine learning preprocessing library over batch data, providing performant and Pandas-style easy-to-use API for model development.
Highlights
TorchArrow provides a Python DataFrame that allows extensible UDFs with Velox, with the following features:
- Seamless handoff with PyTorch or other model authoring, such as Tensor collation and easily plugging into PyTorch DataLoader and DataPipes
- Zero copy for external readers via Arrow in-memory columnar format
- Multiple execution runtimes support:
- High-performance C++ UDF support with vectorization
Installation
In this release we are supporting install via PYPI: pip install torcharrow
.
Documentation
You can find the API documentation here.
This 10 minutes tutorial provides a short introduction to TorchArrow, and you can also try it in this Colab.
Examples
You can find the example about integrating a TorchRec based training loop utilizing TorchArrow's on-the-fly preprocessing here. More examples are coming soon!
Future Plans
We hope to continue to expand the library, harden API, and gather feedback to enable future releases. Stay tuned!
Beta Usage Note
TorchArrow is currently in the Beta stage and does not have a stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback.