f1-pipeline: Data infrastructure for Formula 1 analytics

f1-pipeline establish the infrastructure for reliable, flexible and performant Formula 1 analytics.

Initial Approach

Based on the OpenF1 API and the Ergast API, the initial approach is to build a data pipeline that collects, processes and stores Formula 1 data in a structured way to enable perfomannt analytics and Machine Learning pipelines.

Both batch (e.g. historical data (already available)) and streaming data (e.g. live event data (work in progress)) will be considered.

Architecture: Simple Lambda Architecture

Batch Layer (ETL)

Extract: Airflow DAGs to fetch data from the OpenF1 API and store it in a raw data lake.
Transform: Spark jobs to clean and transform raw data into schema-complient data.
Load: ingestion into a Clickhouse Data-Warehouse.

Speed Layer (Real-time) (Work in Progress)

Data Warehouse (Clickhouse)

Why Clickhouse?
- Columnar Database
- Fast
- Scalable
- SQL
- Open Source
- Performs for both OLAP and OLTP workloads
Kimball Methodology: Schema
- Extra tables not mentioned in the schema:
  - RealTimeFact: Real-time data from the speed layer. Given Clickhouse's OLTP optimizations, there's no need to resort to a different storage engine (e.g. Redis or NoSQL in general) for real-time capabilities.
  - WeatherDT: Extra table to store weather data by the minute. Given the high cardinality of the data, in order to reference weather data though foreign key, data should be aggregated by the day (DailyWeatherDT). However, for some specific use cases, it might be useful to have the data by the minute.

Outcomes

In order to play with the data and run analytics yourself, the folder outputs contains a sample of the data stored in the Clickhouse Data-Warehouse. If you want to run queries directly in Clickhouse, send me a message and I can provide you with the credentials.

By José Braz

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.astro		.astro
dags		dags
docs/assets		docs/assets
dw		dw
jobs		jobs
outputs		outputs
tests		tests
.DS_Store		.DS_Store
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
AWSCLIV2.pkg		AWSCLIV2.pkg
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
airflow_settings.yaml		airflow_settings.yaml
packages.txt		packages.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

f1-pipeline: Data infrastructure for Formula 1 analytics

Initial Approach

Architecture: Simple Lambda Architecture

Batch Layer (ETL)

Speed Layer (Real-time) (Work in Progress)

Data Warehouse (Clickhouse)

Outcomes

About

Releases

Packages

Languages

License

jcbraz/f1-pipeline

Folders and files

Latest commit

History

Repository files navigation

f1-pipeline: Data infrastructure for Formula 1 analytics

Initial Approach

Architecture: Simple Lambda Architecture

Batch Layer (ETL)

Speed Layer (Real-time) (Work in Progress)

Data Warehouse (Clickhouse)

Outcomes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages