This project is a data pipeline designed to retrieve hourly measurements of air pollutants from the AirNow API provided by the US Environmental Protection Agency. The data warehouse model is structured to contain hourly measurements of four key air pollutants: ozone, PM 10, PM 2.5, and NO2. The structured data model facilitates historical analysis of air quality trends over time. By storing hourly measurements, users can analyze patterns, seasonal variations, and long-term trends in air pollutant concentrations. In addition to a date dimension, the data warehouse contains dimensions for monitoring sites that send data to the EPA as well as the reporting areas that the monitoring sites cover, enabling analyses of air quality by geographic location. The inclusion of latitude and longitude in these dimensions makes for some very nice visualizations or reporting applications.
Access to comprehensive air quality data empowers policymakers, researchers, and environmental advocates to make data-driven decisions. Whether it's assessing the effectiveness of pollution control measures, identifying areas of concern, or evaluating the impact of environmental policies, this data pipeline supports informed decision-making processes.
The ETL pipeline begins with data ingestion from the AirNow API, followed by conversion into the Parquet format for optimized storage efficiency. The Parquet format is chosen to minimize storage costs while maintaining data integrity. The transformed data is then securely stored in an Amazon S3 bucket.
Once stored in Amazon S3, the data is loaded into a data warehouse using both the MotherDuck and Snowflake platforms, where it undergoes further transformations and modeling. I chose to use MotherDuck as it is a cloud data warehouse that is free to use until further notice so I can keep my data in there for longer, and Snowflake as they offer a free 30 day trial. The data is structured inside the warehouse into a cohesive data model using DBT (Data Build Tool), preparing the data for seamless analysis and reporting.
Airflow serves as the backbone of the ETL pipeline, orchestrating and scheduling each step. Hosted within Docker containers, Airflow ensures reliable execution of daily data retrieval, storage, and loading processes.
Designed with scalability and automation in mind, the pipeline architecture can seamlessly handle large volumes of data over time. The idempotent nature of the pipeline ensures consistent processing, even with incremental updates. This architecture provides a robust foundation for scalable, automated, and reliable data processing.
Thanks to the Cosmos library, I was able to orchestrate the DBT transformations from within Airflow as its own task group. Below is a picture of the Airflow DAG:
The Fact_AQObs table serves as the central repository for air quality measurements related to key air pollutants including ozone, PM10, PM2.5, and NO2. Each entry in this table represents hourly measurements, making it suitable for granular temporal analysis.
- Date Dimension: Enables analysis of temporal trends by providing date-related attributes.
- Reporting Area Dimension: Facilitates the examination of geographical trends by associating air quality measurements with specific reporting areas.
- Monitoring Site Dimension: Offers insights into air quality variations at specific monitoring sites.
In addition to the core dimensions, a bridge table links monitoring sites to their respective reporting areas. This normalization step enhances data integrity and supports more efficient data querying and analysis processes.
Aside from the main "analytics" schema in my data warehouse, I have also created some views in a "reporting" schema which pre-join the models together in a way that is ready for generating some nice visualizations. Below is screenshot of a dashboard, hosted by Snowflake, that I created showing a world map of the AQI parameters by date.