-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
π Comparison/Motivation section added to docs (#15)
- Loading branch information
Showing
9 changed files
with
358 additions
and
224 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
Package,Installation Size (MB) | ||
Wimsey,26 | ||
Soda Core,50 | ||
Soda (Pandas/Dask),305 | ||
Great Expectations,372 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,226 +1,17 @@ | ||
# Wimsey π | ||
|
||
A fast, lightweight and easy data contract library. | ||
|
||
Wimsey is a lightweight, flexible and fully open-source data contract library. It's designed to let you: | ||
|
||
- **Bring your own dataframe library**: Wimsey is built on top of [Narwhals](https://github.com/narwhals-dev/narwhals) so your tests are carried out natively in your own dataframe library (including Pandas, Polars, Dask, CuDF, Rapids, Arrow and Modin) | ||
- **Bring your own contract format**: Write contracts in yaml, json or python - whichever you prefer! | ||
- **Ultra Lightweight**: Built for fast imports and minimal overwhead with only two dependencies ([Narwhals](https://github.com/narwhals-dev/narwhals) and [FSSpec](https://github.com/fsspec/filesystem_spec)) | ||
- **Simple, easy API**: Low mental overheads with two simple functions for testing dataframes, and a simple dataclass for results. | ||
|
||
Ideally, all data would be usable when you recieve it, but you probably already have figured that's not always the case. That's where data contracts come in. | ||
|
||
A data contract is an expression of what *should* be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime. | ||
|
||
Wimsey is built on top of the awesome [Narwhals](https://github.com/narwhals-dev/narwhals) and natively supports any dataframes that Narwhal's does. At the time of writing, that includes Polars, Pandas, Arrow, Dask, Rapids and Modin. | ||
|
||
|
||
## Run through | ||
|
||
As an example, let's work through a simple example, and imagine we recieve "top-6-sleuths.csv" daily, over sftp. It's meant to look something like this: | ||
|
||
| first_name | last_name | rating | cases_solved | | ||
|-------------|-----------|---------|--------------| | ||
| Peter | Wimsey | 9 | 11 | | ||
| Jane | Marple | 9 | 12 | | ||
| Father | Brown | 7 | 53 | | ||
| Hercule | Poirot | 10 | 33 | | ||
| Beatrice | Bradley | 8 | 66 | | ||
|
||
It's meant to contain the top 5 sleuths, only sometimes, it has the wrong number of entries; othertimes first names are missing; and whilst ratings *should* be out of 10, sometimes they are over that. To make things worse every now and then, someone puts "lots" into `cases_solved` meaning it's no longer a number, and that causes all kinds of trouble. | ||
|
||
### Writing Tests | ||
|
||
We can convert those concerns we just mentioned into four tests to carry out on our dataset: | ||
|
||
- The row count should be 5 | ||
- `first_name` should never be null | ||
- `rating` should be be less than 10 | ||
- `cases_solved` should be a number | ||
|
||
We can test for a lot more than that, but that works for our example. Our first move is to write this out as a "contract". This can be a yaml or json file, or alternatively, we can code it directly into python. | ||
|
||
=== "sleuth-checks.yaml" | ||
```yaml | ||
- test: row_count_should | ||
be_exactly: 5 | ||
- column: first_name | ||
test: null_percentage_should | ||
be_exactly: 0 | ||
- column: rating | ||
test: max_should | ||
be_less_than_or_equal_to: 10 | ||
- column: cases_solved | ||
test: type_should | ||
be_one_of: | ||
- int64 | ||
- float64 | ||
``` | ||
> Note you'll need `pyyaml` installed to support reading yaml | ||
|
||
=== "sleuth-checks.json" | ||
```json | ||
[ | ||
{ | ||
"test": "row_count_should", | ||
"be_exactly": 5 | ||
}, | ||
{ | ||
"column": "first_name", | ||
"test": "null_percentage_should", | ||
"be_exactly": 0 | ||
}, | ||
{ | ||
"column": "rating", | ||
"test": "max_should", | ||
"be_less_than_or_equal_to": 10 | ||
}, | ||
{ | ||
"column": "cases_solved", | ||
"test": "type_should", | ||
"be_one_of": ["int64", "float64"] | ||
] | ||
``` | ||
=== "sleuth_checks.py" | ||
```python | ||
from wimsey import tests | ||
|
||
checks = [ | ||
tests.row_count_should(be_exactly=5), | ||
tests.null_percentage_should(column="first_name", be_exactly=0), | ||
tests.max_should(column="rating", be_less_than_or_equal_to=10), | ||
tests.type_should(column="cases_solved", be_one_of=["int64", "float64]), | ||
] | ||
``` | ||
|
||
See [Possible Tests](possible_tests.md) for a full catalogue of runnable tests and their configurations. | ||
|
||
### Executing Tests | ||
|
||
Now that we've written out tests, we just need to actually *run* them on the actual data. There's two functions `wimsey` gives you to carry out checks: `validate` and `test`. These both carry out checks in the same way, but behave slightly differently based on the results. | ||
|
||
- `test` will return a `FinalResult` type of object. It's a dataclasses containing a `success` boolean, alongside further details on the individual tests in a `results` lists. | ||
- `validate` will run the checks and then just return the initial dataframe assuming everything passed. If any tests failed, it'll stop execution and throw a `DataValidationException`. | ||
|
||
These are designed to cover a couple different use cases, `test` will provide more details if you want to dig into problems in a dataset, whilst `validate` is helpful if you just want to use `wimsey` as a "guard" to catch bad data from being processed. | ||
|
||
We'll cover `test` first, it's called the same regardless of what type your dataframe is: | ||
|
||
=== "using sleuth-checks.yaml" | ||
```python | ||
from wimsey import test | ||
|
||
result = test(df, contract="sleuth-checks.yaml") | ||
if result.success: | ||
print("Everything is as expected! π") | ||
else: | ||
print("Uh-oh, something's up! π¬") | ||
print([i for i in result.results if not i.success]) | ||
``` | ||
> Note you'll need `pyyaml` installed to support reading yaml | ||
|
||
=== "using sleuth-checks.json" | ||
```python | ||
from wimsey import test | ||
|
||
result = test(df, contract="sleuth-checks.json") | ||
if result.success: | ||
print("Everything is as expected! π") | ||
else: | ||
print("Uh-oh, something's up! π¬") | ||
print([i for i in result.results if not i.success]) | ||
``` | ||
=== "using sleuth_checks.py" | ||
```python | ||
from wimsey import test | ||
from sleuth_checks import checks | ||
|
||
result = test(df, contract=checks) | ||
if result.success: | ||
print("Everything is as expected! π") | ||
else: | ||
print("Uh-oh, something's up! π¬") | ||
print([i for i in result.results if not i.success]) | ||
``` | ||
|
||
Wimsey uses [fsspec](https://pypi.org/project/fsspec/) under the hood, so configs can be from any filesystem supported by fsspec (such as S3, SSH, Azure, Google Cloud etc) - use the fsspec prefix and pass in the appropriate storage options using `test`'s `storage_options` keyword. See fsspec documentation for more details on this. | ||
|
||
Validate, will run tests in the exact same way as `test`, but simply raises an error if data fails expectations. This, in conjunction with Wimsey's compatibility with multiple dataframe types can make it a convenient tool for providing guarantees in a data pipeline. | ||
|
||
=== "pandas" | ||
```python | ||
import pandas as pd | ||
from wimsey import validate | ||
|
||
from settings import sleuth_storage_options | ||
|
||
top_sleuth: str = ( | ||
pd.read_csv( | ||
"sshfs://sleuthwatch/top-5-sleuths.csv", | ||
storage_options=sleuth_storage_options, | ||
) | ||
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit | ||
.assign(name=lambda df: df["first_name"] + df["last_name"]) | ||
.sort_values("rating", ascending=False) | ||
["name"][0] | ||
) | ||
|
||
print(f"{top_sleuth} is the best sleuth!") | ||
``` | ||
|
||
=== "polars" | ||
```python | ||
import polars as pl | ||
from wimsey import validate | ||
|
||
from settings import sleuth_storage_options | ||
|
||
top_sleuth: str = ( | ||
pl.read_csv( | ||
"sshfs://sleuthwatch/top-5-sleuths.csv", | ||
storage_options=storage_options, | ||
) | ||
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit | ||
.with_columns(name=pl.col("first_name") + " " + pl.col("last_name")) | ||
.sort("rating", descending=True) | ||
.select("name") | ||
.to_series()[0] | ||
) | ||
|
||
print(f"{top_sleuth} is the best sleuth!") | ||
``` | ||
|
||
=== "dask" | ||
```python | ||
import dask.dataframe as dd | ||
from wimsey import validate | ||
|
||
from settings import sleuth_storage_options | ||
|
||
top_sleuth: str = ( | ||
dd.read_csv( | ||
"sshfs://sleuthwatch/top-5-sleuths.csv", | ||
storage_options=sleuth_storage_options, | ||
) | ||
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit | ||
.assign(name=lambda df: df["first_name"] + " " + df["last_name"]) | ||
.sort_values("rating", ascending=False) | ||
["name"] | ||
.compute()[0] | ||
) | ||
|
||
print(f"{top_sleuth} is the best sleuth!") | ||
``` | ||
|
||
=== "pyarrow" | ||
```python | ||
from pyarrow import compute, csv | ||
from wimsey import validate | ||
|
||
from settings import download_sleuth_file | ||
|
||
download_sleuth_file(to="local-5-sleuths.csv") | ||
df = csv.read_csv("local-5-sleuths.csv") | ||
validate(df, "sleuth-checks.json") # <- this is the wimsey bit | ||
name = compute.binary_join_element_wise(df["first_name"], df["last_name"], " ") | ||
df = df.append("name", name).sort_by("rating") | ||
top_sleuth = str(df["name"][-1]) | ||
|
||
print(f"{top_sleuth} is the best sleuth!") | ||
``` | ||
|
||
And that's it, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section. | ||
If you're looking to get a quick feel for Wimsey, check out the [quick start documentation](quick-start.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
Wimsey is designed to be a data contracts library a lot like Soda or Great Expectations. Rather than aiming to provide new functionality, it's primary motivation is to be as lightweight as possible, and, by focusing on dataframes, allow data tests to be evaluated natively and efficiently. | ||
|
||
It's probably a good fit for you if: | ||
|
||
- β You're working with dataframes in python (via Pandas, Polars, Dask, Modin, etc) | ||
- β You want to carry out data testing with minimal overheads | ||
- β You want to minimise your overall dependencies | ||
- β Have an existing metadata format that you want to integrate tests into | ||
|
||
It might not work for you if: | ||
|
||
- β You're wanting to test SQL data without ingesting into python | ||
- β You want a data contracts solution that also provides a business user facing GUI | ||
|
||
## How small is Wimsey? | ||
|
||
The answer is *very*. To give you a picture of comparison to alternative tools by size, here's a comparions of virtual environment sizes based on libaries + their dependencies*. | ||
|
||
|
||
```vegalite | ||
{ | ||
"description": "A simple bar chart with embedded data.", | ||
"data": {"url" : "assets/raw-size.csv"}, | ||
"mark": {"type": "bar", "tooltip": true}, | ||
"encoding": { | ||
"x": {"field": "Package", "type": "nominal", "axis": {"labelAngle": 0}}, | ||
"y": {"field": "Installation Size (MB)", "type": "quantitative"} | ||
} | ||
} | ||
``` | ||
|
||
It's worth bearing in mind that some of these dependencies might be ones you already need to have installed. | ||
|
||
\* Note that soda is a little unusual here, since `soda-core` is very small (around 2x Wimsey's size), but also requires additional components to work with different data types. | ||
|
||
## How fast is Wimsey? | ||
|
||
That's a very big *it depends*. Wimsey executes tests *in your own dataframe library* so performance will match your library of choice, if you're using Modin or Dask, Wimsey will be able to operate over large distributed datasets, if you're using Polars, Wimsey will be blazingly fast. | ||
|
||
Narwhals [operates natively on dataframes with minimal overhead](https://narwhals-dev.github.io/narwhals/overhead/) so you should expect to see performant operations. Additionally, if you were previously needing to convert, or sample data into another format, you'll no longer need to carry this step out, saving you more runtime. |
Oops, something went wrong.