Utilizing dltHub, dbt, + Dagster as a framework for developing data products with software engineering best practices.
While the short-term goal is to learn these tools, the greater goal is to understand and flesh out what the full development and deployment cycle can look like for orchestrating a data platform and deploying custom pipelines. There is a great process in the transformation layer using dbt where we have local development, testing, versioning/branching, CICD, code-review, separation of dev and prod, project structure/cohesion etc., but how can we apply that to the entire data platform and espeacially, the 10-20% of ingestion jobs that cannot be done in a managed tool like Airbyte and/or is best done using a custom solution?
- Orchestrated ingest, transformation, and downstream dependecies (ML/Analytics) with Dagster - #2, #6
- Developed in dev environment and materaizlied in
dagster dev
server - Configured resources / credentials in a root
.env
file - Current Dagster folder structure (dependencies managed by UV) - #15
- One code location:
dagster_proj/
- Assets:
dagster_proj/assets/
- Resources:
dagster_proj/resources/__init__.py
- Jobs:
dagster_proj/jobs/__init__.py
- Schedules:
dagster_proj/schedules/__init__.py
- Utils:
dagster_proj/utils/__init__.py
- Definitions:
dagster_proj/__init__.py
- Assets:
- The structure is experimental and based on the DagsterU courses
- One code location:
- Developed in dev environment and materaizlied in
- Built a dltHub EL pipeline via the RESTAPIConfig class in
dagster_proj/assets/dlt/activities.py
- Built a dbt-core project to transform the activities data in
analytics_dbt/models
- Created an Sklearn ML pipeline to predict energy expenditure for a given cycling activity
- WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in
dagster_proj/assets/ml_analytics/energy_prediction.py
- This a downstream dependency of a dbt asset materialized in duckdb
- WIP but the general flow of preprocessing, building the ML model, training, testing/evaluation, and prediction can be found in
- Created a Plotly analytics dashboard + an ML results related visulization - #14
- In
dagster_proj/assets/ml_analytics/weekly_totals.py
- In
- Deployed this project to Dagster+
- CICD w/ branching deployments for every PR
- Seperated execution environments - #13
- dev (DuckDB)
- branch (Snowflake)
- prod (Snowflake)
- Configured pre-commits / CI checks and added unit tests - #16
- Beef up the ML pipeline with
dagster-mlflow
for experiment tracking, model versioning, better model observability, etc - Utilize Snowflake Cloning/dbt Slim CI for CI
- Implement partitions/backfilling with dlt/Dagster
For local development only:
- Clone this repo locally
- Create a
.env
file at the root of the directory:
# these are the config values for local dev and will change in branch/prod deployment
DBT_TARGET=dev
DAGSTER_ENVIRONMENT=dev
DUCKDB_DATABASE=data/dev/strava.duckdb
#strava
CLIENT_ID=
CLIENT_SECRET=
REFRESH_TOKEN=
- Download
uv
and runuv sync
- Build the Python package in developer mode via
uv pip install -e ".[dev]"
- Run the dagster daemon locally via
dagster dev
- Materialize the pipeline!
Additional Notes:
- The
refresh_token
in the Strava UI produces anaccess_token
that is limited in scope. Please follow these Strava Dev Docs to generate the properrefresh_token
which will then produce anaccess_token
with the proper scopes. - If you want to run the dbt project locally, outside of dagster, you need to add a
DBT_PROFILES_DIR
environment variable to the .env file and export it- For example, my local env var is:
DBT_PROFILES_DIR=/Users/jairusmartinez/Desktop/dlt-strava/analytics_dbt
- Yours will be:
DBT_PROFILES_DIR=/PATH_TO_YOUR_CLONED_REPO_DIR/analytics_dbt
- For example, my local env var is: