Skip to content

hugh-nguyen/pedestrian-analysis

Repository files navigation

Project: Pedestrian Analysis

Description

This repository contains the artefacts requested as part of the application process for a certain data role

The reports below are the primary artefacts

Table of contents

High Level Achitecture

A user can interact with this product in three ways

  • Viewing the notebook reports found here - Glue Notebooks
  • Pushing a change to the repository will initiate a Github action that deploys to CDK, runs all the glue jobs and runs the DBT tests
  • Exploration through Athena with SQL compliant interface

alt text

Glue Jupyter Notebooks

The Jupyter Notebooks are generated using Glue Notebook servers

You can find the Jupyter Notebooks below

Glue ETL Scripts

The glue scripts are located in Glue Job Scripts and are loaded in Glue with CDK The glue scripts use spark and python to read data from either the glue catalog/s3 or the City of Melbourne API

  • This data gets loaded into another glue table which can be queried using Athena
  • The GitHub Action associated with this repo runs DBT tests using Athena alt text

All glue jobs are ran as part of the GitHub Action in order to allow for testing alt text

Parquet Tables in S3

Assets are stored in S3 in the bucket "pedestrian-analysis-working-bucket"

  • Data is loaded into here using the Glue Jobs or Notebooks alt text alt text

Glue Catalog Tables

Assets are managed in the Glue/Hive Metadata Catalog

  • This catalog makes access to this data via Glue, Athena and other platforms significantly more easy alt text alt text

Testing in DBT Athena

DBT is used for easily organising, reusing and running tests

  • The DBT tests will automatically be run when changes are pushed to this repository
  • Reusable/Generic Tests are located in pa_dbt/models/schema.yml
  • Specific Tests are located in DBT Tests
  • You can see some of the tests being run here as an example - Example

alt text

CICD

A GitHub Action is triggered after every push to the main branch of this repository

  • The workflow is defined in /.github/workflows/deploy-cdk.yml
  • The workflow installs all of the packages/dependencies, deploys to CDK, runs all of the glue jobs and all the DBT tests alt text alt text

IAC - CDK

CDK is used to deploy assets to AWS and allows us to do in a programmatic way with Python The specification can be found in [pedestrian_analysis/pedestrian_analysis_stack.py]/pedestrian_analysis/pedestrian_analysis_stack.py

  • The CDK deploys IAM roles, Glue Databases, some s3 files and glue jobs alt text

Tests & QA Issues Encountered

You can view some of the tests that have been run previously on the data using DBT here

https://github.com/hugh-nguyen/pedestrian-analysis/actions/runs/4844324305/jobs/8632539622

Issues found are below

  • The first case shows us the location_type of sensor_reference_data has to two outlier values "Indoor Blix" and "Outdoor Blix"
  • Location name is often null in report_top_10_locations_by_day, and this is because a lot of sensor_ids are missing from the reference data
  • Direction_1, Direction_2 and installation_date have unexpected null values

alt text alt text

Data Model

There are six data assets

  • raw.sensor_counts
  • raw.sensor_reference_data
  • report.location_declines_due_to_lockdown alt text

Source Data

https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-sensor-locations/information/ https://melbournetestbed.opendatasoft.com/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published