This has been archived, feel free to check Easy Expectations
Want to integrate data quality checks into your pipeline but daunted by the learning curve associated with Great Expectations? Easy G.E (pronounced ee-zee-jee-ee) comes to rescue. It is designed for common use cases (In Memory/File), enabling you to set up tests in minutes and saving you from navigating through Great Expectations' rapidly changing documentation.
If you prefer video tutorials, check this low budget 5 minute tutorial out.
Note: Easy G.E has been tested exclusively on Python versions >= 3.10.
Installation is via pip:
pip install easy-ge # for gcs use pip install easy-ge[google]
- Create Your Expectation Suite: This comprises the tests that will be run on your table and/or columns. Below is an example of the content of an expectation suite json file:
{
"data_asset_type": null,
"expectation_suite_name": "yellow_expectations",
"expectations": [
{
"expectation_type": "expect_column_max_to_be_between",
"kwargs": {
"column": "DOLocationID",
"max_value": 265,
"min_value": 1,
"mostly": 0.9
},
"meta": {}
}
]
}
Check here for a more extensive example, here for a tutorial, and visit the Expectations Gallery for a full list of available Expectations (and their options) to use for your own tests.
- Create Your Configuration File:
# You can use the ${VAR} syntax anywhere in the file to replace them with the corresponding runtime Python environment variable values.
Source: # data/file origin
Name: test # Assigned name for the data source
Processor: Pandas # Can be "SparkDF" or "Pandas"
Properties:
File:
FilePath: "tests/test_configs/sample_file.csv"
Backend: # where artifacts/docs will be stored
ExpectationSuiteName: yellow #Expectation suite JSON file name (without .json)
GCS:
Project: ${PROJECT}
Bucket: ${BUCKET}
Report:
NamingRegex: "%Y%m%d%H%M-yellow" # Report naming format
Outputs:
GenerateDocs: True
The above is an example. You can find different examples of configurations in this directory.
-
Position the Expectation Suite: Place the suite file where you want your docs to be stored. For instance, if you want your docs in a GCS bucket, put the expectation suite in an
expectations
folder within that bucket. this bucket has to be in the same location as your backend. -
Import the
easy_validation
Function:
import os
from easy_ge import easy_validation
# Authenticate if using a cloud backend.
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "gcp_key.json"
# Populate a variable declared in the manifest that doesn't exist in the original environment.
os.environ["project"] = "Stellarismusv5"
if __name__ == '__main__':
results = easy_validation("config.yaml")
- You're done! Check the chosen backend for your docs. Note that if you've enabled
SaveSummaryTableAsCSV
, you'll have a CSV file saved with test statistics. This table provides a quick overview of potential issues, offering detailed insights into problematic rows and undesired value counts, including indecies for these rows, allowing for closer inspection.
For validating your files on the fly, you can use the Docker image. However, this is applicable only for files as InMemory
cannot be used. AWS & GCP will require credentials.
docker pull elsayed91/easy_ge:python3.10
docker run -v /path/to/config.yaml:/app/config.yaml \
-v /path/to/key.json:/app/key.json \ #if using GCS
-e GCS_CREDENTIALS_FILE="key.json" \ #if using GCS
-e S3_ACCESS_KEY="your_aws_access_key" \ #if using S3
-e S3_SECRET_KEY="your_aws_secret_key" \ #if using S3
elsayed91/easy_ge --config /app/config.yaml
Spark has been tested and is functional, howeverit's worth noting that PySpark is not included as a dependency. This is due to the requirement for Spark and PySpark versions to align perfectly. To utilize the SparkDF
option, please ensure that your environment has both Spark and PySpark installed.
Enhancements on the radar:
- Integration with Azure.
- Profiler Support.
- Support for custom Expectations.
- Greater flexibility in Backend Setup: While the current design simplifies processes by using the same backend for all stores, providing options for customization in the future is on the agenda.
- Defining Expectations as
YAML
and/or in the same config file.
If you encounter problems running the package due to jsonschema
, installing the package in a virtual environment should resolve the issue. Dependency conflicts with jsonschema
may occur when the package is not installed in a virtual environment.
- Easy G.E is an indirect product of the substantial efforts by the Great Expectations team.
ChatGPT
for helping with crafting some of the unit tests.
Should you face any issues or have inqueries, please open an issue.
Contributions are welcome! The package was designed to support extensibility and easy
integration of new functionalities. If you wish to add a new field to the YAML
configuration, simply adjust the schema.json
file to include the new field and explore
how you can utilize its value in the expectation_manager
class or the run_validation
function.