GitHub - vladyslavyaloveha/etl_platform: 🚖 ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

🚖 ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

The aim of project is to create modern end-to-end ETL platform for Big Data analysis ❤️!
A hands-on experience with the latest library versions in a fully dockerized environment for
NYC Yellow Taxi Trips Analytics.
With Airflow and PySpark at its core, you'll explore the power of large-scale data processing using DAGs.
Choose between GCP or AWS for cloud solutions and manage your infrastructure with Terraform.
Enjoy playing with FastAPI web application for easy access to trip analytics.

🏆 The ETL platform does simple task:

The system fetches trip data from a specified URL, conducts fundamental daily analytics including passenger count, distance traveled, and maximum trip distance. Subsequently, it uploads the computed analytics results onto cloud storage provided by the chosen cloud provider. ETL platform facilitates seamless data transfer from the storage platform to a designated database.
Furthermore, the system offers the functionality to retrieve insights via an accessible API endpoint.

🍰 ETL Platform Features

💚 ETL platform is open-source and flexible playground for Big Data analysis based on standalone classic 🚕 Yellow Taxi Trip Data.
📦 Fully dockerized via Docker Compose with latest library versions (🐍 Python 3.10+).
💪 Harnesses the power of Airflow and PySpark for efficient processing of large datasets.
🔍 Offers Google Cloud Platform (Google Cloud Storage, BigQuery), Amazon Web Services (S3, Redshift) cloud solutions, based on user preference.
☁️ Cloud infrastructure management handled through Terraform.
🌟 Includes a user-friendly FastApi web application with Traefik enabling easy access to trip analytics.
🔧 Uses Poetry for dependencies management.
📄 Provides basic Pre-commit hooks, Ruff formatting, and Checkov for security and compliance issues.

🚀 Getting Started

🎌 Installation

Clone the ETL platform project.
Check existence or install:

Docker and Docker Compose.
Terraform to manage cloud infrastructure.

Get and place credentials for clouds (AWS, GCP or both):

Google Cloud Platform:
- Create project.
- Create service Account.
- Add roles and policies for Google Cloud Storage (for ex. BigQueryAdmin) and BigQuery (StorageAdmin).
- Download .json file with credentials and place it under credentials/gcp folder.
Amazon Web Services:
- Add roles for selected user.
- Create AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY in specified AWS_REGION (default for ETL platform is us-east-1) and store in safe place.

Apply Cloud infrastructure via Terraform from root folder:

cd terraform/<cloud-provider>
terraform init
terraform plan
terraform apply -auto-approve

❗ For Aws cloud provider you have to pass aws_access_key_id, aws_secret_access_key, redshift_master_password.
As an output you will get redshift_cluster_endpoint where redshift-host could be extracted and passed in .env file below.
❗ For Gcp cloud provider you have to pass credentials_path=../../credentials/gcp/<filename>.json, project_name.

🌞 If previous steps done successfully, you will see Apply complete! message from Terraform and Cloud Infrastructure is ready to use!

Update .env file under build folder with actual credentials:

# Gcp
GCP_CREDENTIALS_PATH=/opt/airflow/credentials/gcp/<filename>.json
GCP_PROJECT_NAME=<project-name>

# Aws
AWS_ACCESS_KEY_ID=<access-key-id>
AWS_SECRET_ACCESS_KEY=<secret-access-key-id>

REDSHIFT_HOST=<redshift-host>
REDSHIFT_MASTER_PASSWORD=<redshift-master-password>

⚠️ Replace default values (such as passwords) in .env file.

Up project via docker-compose in build folder:

docker-compose up --build

Check if build is done successfully 🎉 (provide _AIRFLOW_WWW_USER_USERNAME _AIRFLOW_WWW_USER_PASSWORD from .env to authenticate): http://localhost:8080.

💡 Usage

In your environment will be a DAG (called ETL_trip_data) which you can trigger with default parameters (and selected Cloud Provider):
Dag Graph you can find under http://localhost:8080/dags/ETL_trip_data/grid?tab=graph:
FastApi web application docs you can find under http://localhost:8009/docs.
Spark interface: http://localhost:8082.
Traefik monitoring: http://localhost:8085.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
.screenshots		.screenshots
backend		backend
build		build
credentials/gcp		credentials/gcp
etl		etl
terraform		terraform
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚖 ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

🏆 The ETL platform does simple task:

🍰 ETL Platform Features

🚀 Getting Started

🎌 Installation

💡 Usage

💻 Tech Stack

😀 Enjoying this project? Support via github star ⭐

✨ Adjust & Improve project for your needs

📈 Metrics

About

Languages

License

vladyslavyaloveha/etl_platform

Folders and files

Latest commit

History

Repository files navigation

🚖 ETL Platform: Analyzing NYC Yellow Taxi Trips with Airflow, FastAPI, and Cloud Integration

🏆 The ETL platform does simple task:

🍰 ETL Platform Features

🚀 Getting Started

🎌 Installation

💡 Usage

💻 Tech Stack

😀 Enjoying this project? Support via github star ⭐

✨ Adjust & Improve project for your needs

📈 Metrics

About

Topics

Resources

License

Stars

Watchers

Forks

Languages