Project: Pedestrian Analysis

Description

This repository contains the artefacts requested as part of the application process for a certain data role

The reports below are the primary artefacts

High Level Achitecture

A user can interact with this product in three ways

Viewing the notebook reports found here - Glue Notebooks
Pushing a change to the repository will initiate a Github action that deploys to CDK, runs all the glue jobs and runs the DBT tests
- Example test run
Exploration through Athena with SQL compliant interface

Glue Jupyter Notebooks

The Jupyter Notebooks are generated using Glue Notebook servers

https://docs.aws.amazon.com/glue/latest/dg/console-notebooks.html

You can find the Jupyter Notebooks below

Glue ETL Scripts

The glue scripts are located in Glue Job Scripts and are loaded in Glue with CDK The glue scripts use spark and python to read data from either the glue catalog/s3 or the City of Melbourne API

This data gets loaded into another glue table which can be queried using Athena
The GitHub Action associated with this repo runs DBT tests using Athena

All glue jobs are ran as part of the GitHub Action in order to allow for testing

Parquet Tables in S3

Assets are stored in S3 in the bucket "pedestrian-analysis-working-bucket"

Data is loaded into here using the Glue Jobs or Notebooks

Glue Catalog Tables

Assets are managed in the Glue/Hive Metadata Catalog

This catalog makes access to this data via Glue, Athena and other platforms significantly more easy

Testing in DBT Athena

DBT is used for easily organising, reusing and running tests

The DBT tests will automatically be run when changes are pushed to this repository
Reusable/Generic Tests are located in pa_dbt/models/schema.yml
Specific Tests are located in DBT Tests
You can see some of the tests being run here as an example - Example

CICD

A GitHub Action is triggered after every push to the main branch of this repository

The workflow is defined in /.github/workflows/deploy-cdk.yml
The workflow installs all of the packages/dependencies, deploys to CDK, runs all of the glue jobs and all the DBT tests

IAC - CDK

CDK is used to deploy assets to AWS and allows us to do in a programmatic way with Python The specification can be found in [pedestrian_analysis/pedestrian_analysis_stack.py]/pedestrian_analysis/pedestrian_analysis_stack.py

The CDK deploys IAM roles, Glue Databases, some s3 files and glue jobs

Tests & QA Issues Encountered

You can view some of the tests that have been run previously on the data using DBT here

https://github.com/hugh-nguyen/pedestrian-analysis/actions/runs/4844324305/jobs/8632539622

Issues found are below

The first case shows us the location_type of sensor_reference_data has to two outlier values "Indoor Blix" and "Outdoor Blix"
Location name is often null in report_top_10_locations_by_day, and this is because a lot of sensor_ids are missing from the reference data
Direction_1, Direction_2 and installation_date have unexpected null values

Data Model

There are six data assets

raw.sensor_counts
raw.sensor_reference_data
report.location_declines_due_to_lockdown

Source Data

https://data.melbourne.vic.gov.au/explore/dataset/pedestrian-counting-system-sensor-locations/information/ https://melbournetestbed.opendatasoft.com/explore/dataset/pedestrian-counting-system-monthly-counts-per-hour/information/

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
glue_job_scripts		glue_job_scripts
glue_notebooks		glue_notebooks
images		images
pa_dbt		pa_dbt
pedestrian_analysis		pedestrian_analysis
.env-sample		.env-sample
.gitignore		.gitignore
README.md		README.md
app.py		app.py
cdk.json		cdk.json
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
run-glue.py		run-glue.py
source.bat		source.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Pedestrian Analysis

Description

Table of contents

High Level Achitecture

Glue Jupyter Notebooks

Glue ETL Scripts

Parquet Tables in S3

Glue Catalog Tables

Testing in DBT Athena

CICD

IAC - CDK

Tests & QA Issues Encountered

Data Model

Source Data

About

Releases

Packages

Languages

hugh-nguyen/pedestrian-analysis

Folders and files

Latest commit

History

Repository files navigation

Project: Pedestrian Analysis

Description

Table of contents

High Level Achitecture

Glue Jupyter Notebooks

Glue ETL Scripts

Parquet Tables in S3

Glue Catalog Tables

Testing in DBT Athena

CICD

IAC - CDK

Tests & QA Issues Encountered

Data Model

Source Data

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages