Skip to content

amysw13/pinterest-data-pipeline819

Repository files navigation

Pinterest Data Pipeline

Table of contents


Description

AWS-hosted end-to-end ETL data pipeline inspired by Pinterest's experiment processing pipeline. Pinterest crunches billions of data points every day to decide how to provide more value to their users.

The pipeline is developed using a Lambda architecture. The batch data is ingested using AWS API Gateway and AWS MSK and then stored in an AWS S3 bucket. The batch data is then read from the S3 bucket into Databricks where it is processed using Apache Spark.

The streaming data is read near real-time from AWS Kinesis using Spark Structured Streaming in Databricks and stored in Databricks Delta Tables for long term storage.

End-to-end data pipeline AWS cloud architecture detailed

End-to-end ETL data pipeline AWS cloud architecture, integrated with Databricks and displaying IAM and permission access points.

Aim

Build an end-to-end data ETL pipeline using AWS-hosted cloud technologies, integrated with Databricks for processing and long-term storage.

The pipeline facilitates social media analytics, of stream data in real-time and of data at rest for sentimental analysis, trending categories and user engagement.

Achievement Outcomes πŸ“–

Achievement Outcomes πŸ“–

Amazon EC2 Badge Apache Kafka Badge

  • Configuring AWS EC2 instances
  • Installation and configuration of Kafka on EC2 client machine
  • Creating topics with Kafka on EC2 instance

Amazon S3 Badge Amazon Identity Access Management Badge

  • Configuring S3 bucket and MSK Connect
  • Confluent to connect to S3 bucket and Kafka topics
  • Creation of customised plugins and configuring MSK connector with IAM roles

Amazon API Gateway Badge

  • Creating resource for proxy integration for REST API gateway
  • Deployment of API
  • Creation of API stages for deployment
  • Configuration of Kafka REST proxy API on EC2 client machine
  • Installation of Confluent package
  • Configuration of kafka-rest.properties to perform IAM authentication
  • Starting REST proxy on EC2 client machine
  • Sending streaming data using API invoke URL to S3 bucket
    • Formatting data to JSON message formatting for API processing
    • Data from 3 pinterest tables to corresponding Kafka topics
  • Creating and configuring resources
  • Creating and configuring GET, POST, DELETE and PUT methods with integration requests.

Databricks Badge

  • Creating Databricks workspace
  • Creating access key and secret access key for in AWS for full S3 access
  • Loading in credential file to Databricks
  • Mounting S3 bucket in Databricks
  • Reading in .json format data into Databricks dataframes
  • Reading and writing streaming data from Kinesis Data Streams

Apache Spark Badge

  • Cleaning data using PySparks
  • Dataframe joins
  • Querying and aggregations of data
    • group by
    • classifying values into groups
    • alias
    • sorting data

Apache Airflow Badge

  • Creating DAG to trigger Databricks notebook to run on schedule
  • Uploading DAG to S3 bucket
  • Using AWS MWAA environment to access Airflow UI
  • Triggering DAG successfully
  • Ensuring Databricks notebook is compatible with DAG and Airflow workflow

Amazon Kinesis

  • Creating data streams in AWS Kinesis console
  • Configuring REST API with Kinesis proxy integration
    • List streams
    • Create, describe and delete streams
    • Add records to streams
  • Sending data to Kinesis Data Streams with REST APIs

Configuration Troubleshooting

  • Troubleshooting through all configurations and set up of AWS services and users on AWS CLI
    • Including checking through all IAM permissions, MSK connect plugin and connector configuration, EC2 instances connection issues, API gateway configurations.
  • Troubleshooting connection issues in Databricks, credentials configuration and Delta data formatting issues.

Prerequisites πŸ”§

  • AWS account with appropriate permissions for EC2, S3, AWS MSK, MSK connect, AWS API gateway, AWS MWAA, AWS Kinesis
  • AWS CLI installation and configured with AWS account credentials
  • Databricks account - with AWS credentials and API tokens

Best configured with Linux, MAC or on windows with WSL

AWS Configuration Example βš™οΈ

Batch Processing:

Amazon EC2 Badge

EC2 Key-Pair information for "-key-pair.pem" ‡️

EC2 SSH Connection instructions and command ‡️

Amazon Identity Access Management Badge

Edit IAM trust policy for EC2 access-role ‡️


Stream Processing:

Amazon Kinesis

Creating Kinesis Data Streams in AWS ‡️

Amazon API Gateway Badge

API Kinesis proxy integration, resources and method configuration ‡️


Example Use πŸ“Ž

Batch Processing:

  1. Initiate configured AWS EC2 Client machine and start Kafka REST API

    example-key-pair.pem located in Credentials directory

    # Make sure access to key-pair.pem is not public
    chmod 400 <aws_iam_user_name>-key-pair.pem
    
    # Connect to EC2 client machine
    ssh -i "<aws_iam_user_name>-key-pair.pem" [email protected]
    
    # Navigate to 'confluent/bin' directory
    cd confluent-7.2.0/bin
    # Start Kafka REST API
    ./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties
  2. Run python script to ingest data to S3 bucket with REST API requests

    credentials_template.yaml located in Credentials directory. Ensure AWS RDS source credentials are altered

    Pinterest_data_ingestion.py
    
    # Run function to initiate pinterest data ingestion to S3 buckets
    Pinterest_data_kafka_data_batch()  
  3. Airflow DAG to trigger Databricks Notebook on a specified schedule (e.g. Daily) in AWS MWAA Airflow UI

    # Airflow DAG script, loaded into S3 bucket and connected to AWS MWAA environment
    124714cdee67_dag.py

    Airflow UI

  4. Mount S3 bucket to configured Databricks workspace and read in batch data for processing with Databricks Notebook

    # Airflow will trigger Databricks Notebook below, that will mount S3 bucket and run data transformations
    Reading, cleaning and querying Pinterest Data from mounted S3 bucket using Sparks.ipynb

Stream Processing:

  1. Run python script to stream data to Kinesis Data streams with configured REST API

credentials_template.yaml located in Credentials directory. Ensure AWS RDS source credentials are altered

  Pinterest_data_ingestion.py

  # Run function to initiate pinterest data ingestion to Databricks with AWS Kinesis
  Pinterest_data_kinesis_data_streams()
  1. Read in streaming data to Databricks with Databricks Notebook, and clean and write stream data to delta tables.

    # Databricks Notebook
     Reading and cleaning data from Kinesis Data Stream.ipynb

Technologies used in this project

Amazon AWS Badge - AWS Account and AWS services

Amazon EC2 Badge - EC2 instance and client machine

Apache Kafka Badge - Kafka and Confluent to connect with S3 buckets and topic creation and consuming data

Amazon MSK Connect - MSK cluster creation and connection from API to S3 buckets

Amazon Identity Access Management Badge - IAM access roles

Amazon S3 Badge - S3 simple storage for objects

Amazon API Gateway Badge - Creating REST APIs with proxy integrations, child resources and methods.

Apache Airflow Badge - Orchestrating workflows on a schedule with DAGs

Databricks Badge - Mounting AWS S3 buckets for reading in batch data, reading and writing streaming data, data processing and long term storage in Delta tables

Apache Spark Badge - Spark structured streaming and data processing in Databricks

Amazon Kinesis - Creating Kinesis Data Streams and streaming data through REST APIs

File Structure πŸ“‚

πŸ“‚ pinterest-data-pipeline819

About

End-to-end data pipeline hosted by AWS cloud services.

Resources

Stars

Watchers

Forks