Skip to content

Commit

Permalink
πŸ“– Comparison/Motivation section added to docs (#15)
Browse files Browse the repository at this point in the history
  • Loading branch information
benrutter authored Nov 7, 2024
1 parent 31e80e8 commit 5869bd2
Show file tree
Hide file tree
Showing 9 changed files with 358 additions and 224 deletions.
5 changes: 5 additions & 0 deletions docs/assets/raw-size.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
Package,Installation Size (MB)
Wimsey,26
Soda Core,50
Soda (Pandas/Dask),305
Great Expectations,372
6 changes: 3 additions & 3 deletions docs/generate_possible_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ def generate_doc(name: str, doc_string: str, annotations: dict) -> str:
examples["not_be"] = "string"
annotations.pop("return")
dict_eg = {k: examples[k] for k in annotations}
yaml_eg = yaml.dump(dict_eg)
json_eg = json.dumps(dict_eg, indent=2)
yaml_eg = yaml.dump({"test": name} | dict_eg)
json_eg = json.dumps({"test": name} | dict_eg, indent=2)
python_eg = f"""
from wimsey import test
from wimsey.tests import {name}
Expand Down Expand Up @@ -76,5 +76,5 @@ def generate_doc(name: str, doc_string: str, annotations: dict) -> str:
test_generator.__annotations__,
)
file_doc += doc
with open("docs/possible_tests.md", "wt") as file:
with open("docs/possible-tests.md", "wt") as file:
file.write(file_doc)
225 changes: 8 additions & 217 deletions docs/index.md
Original file line number Diff line number Diff line change
@@ -1,226 +1,17 @@
# Wimsey πŸ”

A fast, lightweight and easy data contract library.

Wimsey is a lightweight, flexible and fully open-source data contract library. It's designed to let you:

- **Bring your own dataframe library**: Wimsey is built on top of [Narwhals](https://github.com/narwhals-dev/narwhals) so your tests are carried out natively in your own dataframe library (including Pandas, Polars, Dask, CuDF, Rapids, Arrow and Modin)
- **Bring your own contract format**: Write contracts in yaml, json or python - whichever you prefer!
- **Ultra Lightweight**: Built for fast imports and minimal overwhead with only two dependencies ([Narwhals](https://github.com/narwhals-dev/narwhals) and [FSSpec](https://github.com/fsspec/filesystem_spec))
- **Simple, easy API**: Low mental overheads with two simple functions for testing dataframes, and a simple dataclass for results.

Ideally, all data would be usable when you recieve it, but you probably already have figured that's not always the case. That's where data contracts come in.

A data contract is an expression of what *should* be true of some data, such as that it should 'only have columns x and y' or 'the values of column a should never exceed 1'. Wimsey is a library built to run these contracts on a dataframe during python runtime.

Wimsey is built on top of the awesome [Narwhals](https://github.com/narwhals-dev/narwhals) and natively supports any dataframes that Narwhal's does. At the time of writing, that includes Polars, Pandas, Arrow, Dask, Rapids and Modin.


## Run through

As an example, let's work through a simple example, and imagine we recieve "top-6-sleuths.csv" daily, over sftp. It's meant to look something like this:

| first_name | last_name | rating | cases_solved |
|-------------|-----------|---------|--------------|
| Peter | Wimsey | 9 | 11 |
| Jane | Marple | 9 | 12 |
| Father | Brown | 7 | 53 |
| Hercule | Poirot | 10 | 33 |
| Beatrice | Bradley | 8 | 66 |

It's meant to contain the top 5 sleuths, only sometimes, it has the wrong number of entries; othertimes first names are missing; and whilst ratings *should* be out of 10, sometimes they are over that. To make things worse every now and then, someone puts "lots" into `cases_solved` meaning it's no longer a number, and that causes all kinds of trouble.

### Writing Tests

We can convert those concerns we just mentioned into four tests to carry out on our dataset:

- The row count should be 5
- `first_name` should never be null
- `rating` should be be less than 10
- `cases_solved` should be a number

We can test for a lot more than that, but that works for our example. Our first move is to write this out as a "contract". This can be a yaml or json file, or alternatively, we can code it directly into python.

=== "sleuth-checks.yaml"
```yaml
- test: row_count_should
be_exactly: 5
- column: first_name
test: null_percentage_should
be_exactly: 0
- column: rating
test: max_should
be_less_than_or_equal_to: 10
- column: cases_solved
test: type_should
be_one_of:
- int64
- float64
```
> Note you'll need `pyyaml` installed to support reading yaml

=== "sleuth-checks.json"
```json
[
{
"test": "row_count_should",
"be_exactly": 5
},
{
"column": "first_name",
"test": "null_percentage_should",
"be_exactly": 0
},
{
"column": "rating",
"test": "max_should",
"be_less_than_or_equal_to": 10
},
{
"column": "cases_solved",
"test": "type_should",
"be_one_of": ["int64", "float64"]
]
```
=== "sleuth_checks.py"
```python
from wimsey import tests

checks = [
tests.row_count_should(be_exactly=5),
tests.null_percentage_should(column="first_name", be_exactly=0),
tests.max_should(column="rating", be_less_than_or_equal_to=10),
tests.type_should(column="cases_solved", be_one_of=["int64", "float64]),
]
```

See [Possible Tests](possible_tests.md) for a full catalogue of runnable tests and their configurations.

### Executing Tests

Now that we've written out tests, we just need to actually *run* them on the actual data. There's two functions `wimsey` gives you to carry out checks: `validate` and `test`. These both carry out checks in the same way, but behave slightly differently based on the results.

- `test` will return a `FinalResult` type of object. It's a dataclasses containing a `success` boolean, alongside further details on the individual tests in a `results` lists.
- `validate` will run the checks and then just return the initial dataframe assuming everything passed. If any tests failed, it'll stop execution and throw a `DataValidationException`.

These are designed to cover a couple different use cases, `test` will provide more details if you want to dig into problems in a dataset, whilst `validate` is helpful if you just want to use `wimsey` as a "guard" to catch bad data from being processed.

We'll cover `test` first, it's called the same regardless of what type your dataframe is:

=== "using sleuth-checks.yaml"
```python
from wimsey import test

result = test(df, contract="sleuth-checks.yaml")
if result.success:
print("Everything is as expected! πŸ™Œ")
else:
print("Uh-oh, something's up! 😬")
print([i for i in result.results if not i.success])
```
> Note you'll need `pyyaml` installed to support reading yaml

=== "using sleuth-checks.json"
```python
from wimsey import test

result = test(df, contract="sleuth-checks.json")
if result.success:
print("Everything is as expected! πŸ™Œ")
else:
print("Uh-oh, something's up! 😬")
print([i for i in result.results if not i.success])
```
=== "using sleuth_checks.py"
```python
from wimsey import test
from sleuth_checks import checks

result = test(df, contract=checks)
if result.success:
print("Everything is as expected! πŸ™Œ")
else:
print("Uh-oh, something's up! 😬")
print([i for i in result.results if not i.success])
```

Wimsey uses [fsspec](https://pypi.org/project/fsspec/) under the hood, so configs can be from any filesystem supported by fsspec (such as S3, SSH, Azure, Google Cloud etc) - use the fsspec prefix and pass in the appropriate storage options using `test`'s `storage_options` keyword. See fsspec documentation for more details on this.

Validate, will run tests in the exact same way as `test`, but simply raises an error if data fails expectations. This, in conjunction with Wimsey's compatibility with multiple dataframe types can make it a convenient tool for providing guarantees in a data pipeline.

=== "pandas"
```python
import pandas as pd
from wimsey import validate

from settings import sleuth_storage_options

top_sleuth: str = (
pd.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=sleuth_storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.assign(name=lambda df: df["first_name"] + df["last_name"])
.sort_values("rating", ascending=False)
["name"][0]
)

print(f"{top_sleuth} is the best sleuth!")
```

=== "polars"
```python
import polars as pl
from wimsey import validate

from settings import sleuth_storage_options

top_sleuth: str = (
pl.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.with_columns(name=pl.col("first_name") + " " + pl.col("last_name"))
.sort("rating", descending=True)
.select("name")
.to_series()[0]
)

print(f"{top_sleuth} is the best sleuth!")
```

=== "dask"
```python
import dask.dataframe as dd
from wimsey import validate

from settings import sleuth_storage_options

top_sleuth: str = (
dd.read_csv(
"sshfs://sleuthwatch/top-5-sleuths.csv",
storage_options=sleuth_storage_options,
)
.pipe(validate, "sleuth-checks.json") # <- this is the wimsey bit
.assign(name=lambda df: df["first_name"] + " " + df["last_name"])
.sort_values("rating", ascending=False)
["name"]
.compute()[0]
)

print(f"{top_sleuth} is the best sleuth!")
```

=== "pyarrow"
```python
from pyarrow import compute, csv
from wimsey import validate

from settings import download_sleuth_file

download_sleuth_file(to="local-5-sleuths.csv")
df = csv.read_csv("local-5-sleuths.csv")
validate(df, "sleuth-checks.json") # <- this is the wimsey bit
name = compute.binary_join_element_wise(df["first_name"], df["last_name"], " ")
df = df.append("name", name).sort_by("rating")
top_sleuth = str(df["name"][-1])

print(f"{top_sleuth} is the best sleuth!")
```

And that's it, to keep things simple `validate` and `test` are the only public-intended functions in Wimsey, aside from test creation, which is covered further in the *possible tests* section.
If you're looking to get a quick feel for Wimsey, check out the [quick start documentation](quick-start.md)
40 changes: 40 additions & 0 deletions docs/motivation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
Wimsey is designed to be a data contracts library a lot like Soda or Great Expectations. Rather than aiming to provide new functionality, it's primary motivation is to be as lightweight as possible, and, by focusing on dataframes, allow data tests to be evaluated natively and efficiently.

It's probably a good fit for you if:

- βœ… You're working with dataframes in python (via Pandas, Polars, Dask, Modin, etc)
- βœ… You want to carry out data testing with minimal overheads
- βœ… You want to minimise your overall dependencies
- βœ… Have an existing metadata format that you want to integrate tests into

It might not work for you if:

- ❌ You're wanting to test SQL data without ingesting into python
- ❌ You want a data contracts solution that also provides a business user facing GUI

## How small is Wimsey?

The answer is *very*. To give you a picture of comparison to alternative tools by size, here's a comparions of virtual environment sizes based on libaries + their dependencies*.


```vegalite
{
"description": "A simple bar chart with embedded data.",
"data": {"url" : "assets/raw-size.csv"},
"mark": {"type": "bar", "tooltip": true},
"encoding": {
"x": {"field": "Package", "type": "nominal", "axis": {"labelAngle": 0}},
"y": {"field": "Installation Size (MB)", "type": "quantitative"}
}
}
```

It's worth bearing in mind that some of these dependencies might be ones you already need to have installed.

\* Note that soda is a little unusual here, since `soda-core` is very small (around 2x Wimsey's size), but also requires additional components to work with different data types.

## How fast is Wimsey?

That's a very big *it depends*. Wimsey executes tests *in your own dataframe library* so performance will match your library of choice, if you're using Modin or Dask, Wimsey will be able to operate over large distributed datasets, if you're using Polars, Wimsey will be blazingly fast.

Narwhals [operates natively on dataframes with minimal overhead](https://narwhals-dev.github.io/narwhals/overhead/) so you should expect to see performant operations. Additionally, if you were previously needing to convert, or sample data into another format, you'll no longer need to carry this step out, saving you more runtime.
Loading

0 comments on commit 5869bd2

Please sign in to comment.