dato
is an open source library that provides a rapid, declarative ecosystem for reproducible data science within python. dato
accomplishes this by (1) enabling piping with >>
and (2) unifying common data science libraries under a common syntax.
df >> GroupBy('country') >> Sum >> Hist('revenue', col='age')
Dato has four major components:
dato.base.Pipeable
Decorator that enables piping with>>
.dato.process
Sub-module with pipe-compatiblepandas
operations.dato.plot
Sub-module with pipe-compatible plotting operations, following a consistentpandas
-inspired syntax withseaborn
-esque extended functionality.dato.ml
(in development) Simplifies and standardizes syntax across popular ML libraries.
pip install dato
Although piping has some downside as a general programming paradigm (particularly in obscuring code errors and being naturally difficult to debug), we argue that these downsides are outweighed by a level of concision and maintainability it lends to data workflows. When working with data in development environments which contain hidden states (such as jupyter or R markdown), reproducibility of code can be difficult to consistently achieve. Piping mitigates this danger by (1) enforcing a consistent order of operations, and (2) disallowing hidden states. Consequently, the piping paradigm is naturally reproducible, production-ready, and stable as soon as it is written -- properties that are of paramount importance in data work.