Skip to content

zabull1/EdinburghReddit_e2e

Repository files navigation

Edinburgh Reddit E2E Data Pipeline Project

I recently finished my MSc in Data Science at Heriot-Watt University. Being an international student, I was keen to connect with the vibrant city of Edinburgh on a deeper level. To achieve this, I undertook a data engineering project that explores the vast amount of information available on the r/Edinburgh subreddit using the Reddit API. This project not only provides an exciting opportunity to apply my data engineering skills but also offers a unique perspective on the pulse of the local community and gives me insights into the current happenings around me.

This project offers a data pipeline solution that enables the extraction, transformation, and loading (ETL) of data from the Reddit (r/Edinburgh subreddit) API into a Redshift data warehouse. The pipeline utilizes tools and services such as Apache Airflow, Celery, PostgreSQL, Amazon S3, AWS Glue, and Amazon Athena, and finally, the data is visualized in a dashboard with Tableau.

Table of Content

Overview

The pipeline is designed to:

  1. Extract data from r/Edinburgh subreddit using Reddit API.
  2. Store the raw data into an S3 bucket from Airflow.
  3. Transform the data using AWS Glue and Amazon Athena.
  4. Load the transformed data into Amazon Redshift for analytics and querying.
  5. Visualised the data in Tableau

Architecture

Technologies

  • Cloud: AWS
  • Infrastructure: Terraform
  • Orchestration: Airflow
  • Data lake: Amazon S3
  • Data transformation: Amazon Athena
  • Data warehouse: Amazon Redshift
  • Data visualization: Tableau

API Data Description

Column Description
id Unique identifier for the Reddit post
title Title of the Reddit post
score Score associated with the post
num_comments Number of comments on the post
author Username of the post author
created_utc UTC timestamp when the post was created
url URL associated with the post
over_18 Indicates whether the post contains mature content (18+)
edited Timestamp indicating when the post was last edited
spoiler Indicates if the post contains spoiler content
stickied Indicates if the post is stickied at the top of the subreddit

Prerequisites

  • Access to AWS Account with appropriate permissions for S3, Glue, Athena, and Redshift.
  • Reddit API credentials.
  • Docker Installation
  • Python 3.9 or higher
  • Tableau Installation
  • Basic knowledge of Python programming
  • Familiarity with data engineering concepts
  • Basic knowledge of command-line

Usage

  1. Clone the repository.

     git clone https://github.com/zabull1/EdinburghReddit_e2e.git
  2. Create a virtual environment.

     python3 -m venv venv
  3. Activate the virtual environment.

     source venv/bin/activate
  4. Install the dependencies.

     pip install -r requirements.txt
  5. Rename the configuration file and the credentials to the file.

     mv config/config.conf.example config/config.conf
  6. Starting the containers

     docker-compose up airflow-init
     docker-compose up -d 
  7. Launch the Airflow web UI.

     open http://localhost:8080
    

DashBoard

Improvements

    1. Unit testing
    1. Infrastructure as code
    1. Tableau dashboard

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published