This repository contains the artefacts requested as part of the application process for a certain data role
The reports below are the primary artefacts
- Top 10 Locations By Day
- Top 10 Locations By Month
- Most Decline Due to Lockdowns
- Most Growth Last Year
- High Level Architecture
- Artefacts
- CICD - GitHub Actions
- IAC - CDK
- Tests & QA Issues Encountered
- Data Model
- Source Data
A user can interact with this product in three ways
- Viewing the notebook reports found here - Glue Notebooks
- Pushing a change to the repository will initiate a Github action that deploys to CDK, runs all the glue jobs and runs the DBT tests
- Exploration through Athena with SQL compliant interface
The Jupyter Notebooks are generated using Glue Notebook servers
You can find the Jupyter Notebooks below
- Top 10 Locations By Month
- Top 10 Locations By Month
- Most Decline Due to Lockdowns
- Most Growth Last Year
The glue scripts are located in Glue Job Scripts and are loaded in Glue with CDK The glue scripts use spark and python to read data from either the glue catalog/s3 or the City of Melbourne API
- This data gets loaded into another glue table which can be queried using Athena
- The GitHub Action associated with this repo runs DBT tests using Athena
All glue jobs are ran as part of the GitHub Action in order to allow for testing
Assets are stored in S3 in the bucket "pedestrian-analysis-working-bucket"
Assets are managed in the Glue/Hive Metadata Catalog
DBT is used for easily organising, reusing and running tests
- The DBT tests will automatically be run when changes are pushed to this repository
- Reusable/Generic Tests are located in pa_dbt/models/schema.yml
- Specific Tests are located in DBT Tests
- You can see some of the tests being run here as an example - Example
A GitHub Action is triggered after every push to the main branch of this repository
- The workflow is defined in /.github/workflows/deploy-cdk.yml
- The workflow installs all of the packages/dependencies, deploys to CDK, runs all of the glue jobs and all the DBT tests
CDK is used to deploy assets to AWS and allows us to do in a programmatic way with Python The specification can be found in [pedestrian_analysis/pedestrian_analysis_stack.py]/pedestrian_analysis/pedestrian_analysis_stack.py
You can view some of the tests that have been run previously on the data using DBT here
https://github.com/hugh-nguyen/pedestrian-analysis/actions/runs/4844324305/jobs/8632539622
Issues found are below
- The first case shows us the location_type of sensor_reference_data has to two outlier values "Indoor Blix" and "Outdoor Blix"
- Location name is often null in report_top_10_locations_by_day, and this is because a lot of sensor_ids are missing from the reference data
- Direction_1, Direction_2 and installation_date have unexpected null values
There are six data assets
https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-sensor-locations/information/ https://melbournetestbed.opendatasoft.com/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/