I recently finished my MSc in Data Science at Heriot-Watt University. Being an international student, I was keen to connect with the vibrant city of Edinburgh on a deeper level. To achieve this, I undertook a data engineering project that explores the vast amount of information available on the r/Edinburgh subreddit using the Reddit API. This project not only provides an exciting opportunity to apply my data engineering skills but also offers a unique perspective on the pulse of the local community and gives me insights into the current happenings around me.
This project offers a data pipeline solution that enables the extraction, transformation, and loading (ETL) of data from the Reddit (r/Edinburgh subreddit) API into a Redshift data warehouse. The pipeline utilizes tools and services such as Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, and Amazon Athena, and finally, the data is visualized in a dashboard with Tableau.
- Overview
- Architecture
- Technologies
- API Data Description
- Project Structure
- Prerequisite
- Usage
- DashBoard
- Improvements
- Reference
- License
The pipeline is designed to:
- Extract data from r/Edinburgh subreddit using Reddit API.
- Store the raw data into an S3 bucket from Airflow.
- Transform the data using AWS Glue and Amazon Athena.
- Load the transformed data into Amazon Redshift for analytics and querying.
- Visualised the data in Tableau
- Cloud:
AWS
- Infrastructure:
Terraform
- Orchestration:
Airflow
- Data lake:
Amazon S3
- Data transformation:
Amazon Athena
- Data warehouse:
Amazon Redshift
- Data visualization:
Tableau
Column | Description |
---|---|
id |
Unique identifier for the Reddit post |
title |
Title of the Reddit post |
score |
Score associated with the post |
num_comments |
Number of comments on the post |
author |
Username of the post author |
created_utc |
UTC timestamp when the post was created |
url |
URL associated with the post |
over_18 |
Indicates whether the post contains mature content (18+) |
edited |
Timestamp indicating when the post was last edited |
spoiler |
Indicates if the post contains spoiler content |
stickied |
Indicates if the post is stickied at the top of the subreddit |
- Access to AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
- Reddit API credentials.
- Docker Installation
- Python 3.9 or higher
- Tableau Installation
- Basic knowledge of Python programming
- Familiarity with data engineering concepts
- Basic knowledge of command-line
-
Clone the repository.
git clone https://github.com/zabull1/EdinburghReddit_e2e.git
-
Create a virtual environment.
python3 -m venv venv
-
Activate the virtual environment.
source venv/bin/activate
-
Install the dependencies.
pip install -r requirements.txt
-
Rename the configuration file and the credentials to the file.
mv config/config.conf.example config/config.conf
-
Starting the containers
docker-compose up airflow-init docker-compose up -d
-
Launch the Airflow web UI.
open http://localhost:8080
-
- Unit testing
-
- Infrastructure as code
-
- Tableau dashboard
This project is licensed under the MIT License - see the LICENSE file for details.