The aim of project is to create modern end-to-end ETL platform for Big Data analysis β€οΈ!
A hands-on experience with the latest library versions in a fully dockerized environment
for
NYC Yellow Taxi Trips Analytics.
With Airflow and PySpark at its core, you'll explore the power of large-scale data processing using DAGs.
Choose between GCP or AWS for cloud solutions and manage your infrastructure with Terraform.
Enjoy playing with FastAPI web application for easy access to trip analytics.
The system fetches trip data from a specified URL,
conducts fundamental daily analytics including passenger count, distance traveled, and maximum trip distance.
Subsequently, it uploads the computed analytics results onto cloud storage provided by the chosen cloud provider.
ETL platform facilitates seamless data transfer from the storage platform to a designated database.
Furthermore, the system offers the functionality to retrieve insights via an accessible API endpoint.
- π ETL platform is open-source and flexible playground for Big Data analysis based on standalone classic π Yellow Taxi Trip Data.
- π¦ Fully dockerized via Docker Compose with latest library versions (π Python 3.10+).
- πͺ Harnesses the power of Airflow and PySpark for efficient processing of large datasets.
- π Offers Google Cloud Platform (Google Cloud Storage, BigQuery), Amazon Web Services (S3, Redshift) cloud solutions, based on user preference.
- βοΈ Cloud infrastructure management handled through Terraform.
- π Includes a user-friendly FastApi web application with Traefik enabling easy access to trip analytics.
- π§ Uses Poetry for dependencies management.
- π Provides basic Pre-commit hooks, Ruff formatting, and Checkov for security and compliance issues.
- Clone the ETL platform project.
- Check existence or install:
- Docker and Docker Compose.
- Terraform to manage cloud infrastructure.
- Get and place credentials for clouds (AWS, GCP or both):
- Google Cloud Platform:
- Create project.
- Create service Account.
- Add roles and policies for Google Cloud Storage (for ex. BigQueryAdmin) and BigQuery (StorageAdmin).
- Download .json file with credentials and place it under
credentials/gcp
folder.
- Amazon Web Services:
- Add roles for selected user.
- Create
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
in specifiedAWS_REGION
(default for ETL platform isus-east-1
) and store in safe place.
- Apply Cloud infrastructure via Terraform from root folder:
cd terraform/<cloud-provider>
terraform init
terraform plan
terraform apply -auto-approve
β For Aws
cloud provider you have to pass
aws_access_key_id
, aws_secret_access_key
, redshift_master_password
.
As an output you will get redshift_cluster_endpoint
where redshift-host
could be extracted and passed in .env
file below.
β For Gcp
cloud provider you have to pass
credentials_path=../../credentials/gcp/<filename>.json
, project_name
.
π If previous steps done successfully, you will see Apply complete!
message from Terraform
and Cloud Infrastructure is ready to use!
- Update
.env
file underbuild
folder with actual credentials:
# Gcp
GCP_CREDENTIALS_PATH=/opt/airflow/credentials/gcp/<filename>.json
GCP_PROJECT_NAME=<project-name>
# Aws
AWS_ACCESS_KEY_ID=<access-key-id>
AWS_SECRET_ACCESS_KEY=<secret-access-key-id>
REDSHIFT_HOST=<redshift-host>
REDSHIFT_MASTER_PASSWORD=<redshift-master-password>
.env
file.
- Up project via docker-compose in
build
folder:
docker-compose up --build
- Check if build is done successfully π (provide
_AIRFLOW_WWW_USER_USERNAME
_AIRFLOW_WWW_USER_PASSWORD
from.env
to authenticate):http://localhost:8080
.
-
In your environment will be a DAG (called
ETL_trip_data
) which you can trigger with default parameters (and selected Cloud Provider): -
Dag Graph you can find under
http://localhost:8080/dags/ETL_trip_data/grid?tab=graph
: -
FastApi web application docs you can find under
http://localhost:8009/docs
. -
Spark interface:
http://localhost:8082
. -
Traefik monitoring:
http://localhost:8085
.