f1-pipeline establish the infrastructure for reliable, flexible and performant Formula 1 analytics.
Based on the OpenF1 API and the Ergast API, the initial approach is to build a data pipeline that collects, processes and stores Formula 1 data in a structured way to enable perfomannt analytics and Machine Learning pipelines.
Both batch (e.g. historical data (already available)) and streaming data (e.g. live event data (work in progress)) will be considered.
- Extract: Airflow DAGs to fetch data from the OpenF1 API and store it in a raw data lake.
- Transform: Spark jobs to clean and transform raw data into schema-complient data.
- Load: ingestion into a Clickhouse Data-Warehouse.
Data Warehouse (Clickhouse)
-
Why Clickhouse?
- Columnar Database
- Fast
- Scalable
- SQL
- Open Source
- Performs for both OLAP and OLTP workloads
-
Kimball Methodology: Schema
-
Extra tables not mentioned in the schema:
-
RealTimeFact: Real-time data from the speed layer. Given Clickhouse's OLTP optimizations, there's no need to resort to a different storage engine (e.g. Redis or NoSQL in general) for real-time capabilities.
-
WeatherDT: Extra table to store weather data by the minute. Given the high cardinality of the data, in order to reference weather data though foreign key, data should be aggregated by the day (DailyWeatherDT). However, for some specific use cases, it might be useful to have the data by the minute.
-
-
In order to play with the data and run analytics yourself, the folder outputs
contains a sample of the data stored in the Clickhouse Data-Warehouse. If you want to run queries directly in Clickhouse, send me a message and I can provide you with the credentials.
By José Braz