Pinterest Data Pipeline

Creating resource for proxy integration for REST API gateway
Deployment of API
Creation of API stages for deployment
Configuration of Kafka REST proxy API on EC2 client machine
Installation of Confluent package
Configuration of kafka-rest.properties to perform IAM authentication
Starting REST proxy on EC2 client machine
Sending streaming data using API invoke URL to S3 bucket
- Formatting data to JSON message formatting for API processing
- Data from 3 pinterest tables to corresponding Kafka topics
Creating and configuring resources
Creating and configuring GET, POST, DELETE and PUT methods with integration requests.

Creating Databricks workspace
Creating access key and secret access key for in AWS for full S3 access
Loading in credential file to Databricks
Mounting S3 bucket in Databricks
Reading in .json format data into Databricks dataframes
Reading and writing streaming data from Kinesis Data Streams

Cleaning data using PySparks
Dataframe joins
Querying and aggregations of data
- group by
- classifying values into groups
- alias
- sorting data

Creating DAG to trigger Databricks notebook to run on schedule
Uploading DAG to S3 bucket
Using AWS MWAA environment to access Airflow UI
Triggering DAG successfully
Ensuring Databricks notebook is compatible with DAG and Airflow workflow

Creating data streams in AWS Kinesis console
Configuring REST API with Kinesis proxy integration
- List streams
- Create, describe and delete streams
- Add records to streams
Sending data to Kinesis Data Streams with REST APIs

Configuration Troubleshooting

Troubleshooting through all configurations and set up of AWS services and users on AWS CLI
- Including checking through all IAM permissions, MSK connect plugin and connector configuration, EC2 instances connection issues, API gateway configurations.
Troubleshooting connection issues in Databricks, credentials configuration and Delta data formatting issues.

Prerequisites 🔧

AWS account with appropriate permissions for EC2, S3, AWS MSK, MSK connect, AWS API gateway, AWS MWAA, AWS Kinesis
AWS CLI installation and configured with AWS account credentials
Databricks account - with AWS credentials and API tokens

Best configured with Linux, MAC or on windows with WSL

AWS Configuration Example ⚙️

Batch Processing:

EC2 Key-Pair information for "-key-pair.pem" ⤵️

EC2 SSH Connection instructions and command ⤵️

Edit IAM trust policy for EC2 access-role ⤵️

Stream Processing:

Creating Kinesis Data Streams in AWS ⤵️

API Kinesis proxy integration, resources and method configuration ⤵️

Example Use 📎

Batch Processing:

Initiate configured AWS EC2 Client machine and start Kafka REST API

example-key-pair.pem located in Credentials directory

# Make sure access to key-pair.pem is not public
chmod 400 <aws_iam_user_name>-key-pair.pem

# Connect to EC2 client machine
ssh -i "<aws_iam_user_name>-key-pair.pem" [email protected]

# Navigate to 'confluent/bin' directory
cd confluent-7.2.0/bin
# Start Kafka REST API
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

Run python script to ingest data to S3 bucket with REST API requests

credentials_template.yaml located in Credentials directory. Ensure AWS RDS source credentials are altered
```
Pinterest_data_ingestion.py

# Run function to initiate pinterest data ingestion to S3 buckets
Pinterest_data_kafka_data_batch()  
```

Airflow DAG to trigger Databricks Notebook on a specified schedule (e.g. Daily) in AWS MWAA Airflow UI

# Airflow DAG script, loaded into S3 bucket and connected to AWS MWAA environment
124714cdee67_dag.py

Mount S3 bucket to configured Databricks workspace and read in batch data for processing with Databricks Notebook

# Airflow will trigger Databricks Notebook below, that will mount S3 bucket and run data transformations
Reading, cleaning and querying Pinterest Data from mounted S3 bucket using Sparks.ipynb

Stream Processing:

Run python script to stream data to Kinesis Data streams with configured REST API

credentials_template.yaml located in Credentials directory. Ensure AWS RDS source credentials are altered

  Pinterest_data_ingestion.py

  # Run function to initiate pinterest data ingestion to Databricks with AWS Kinesis
  Pinterest_data_kinesis_data_streams()

Read in streaming data to Databricks with Databricks Notebook, and clean and write stream data to delta tables.
```
# Databricks Notebook
 Reading and cleaning data from Kinesis Data Stream.ipynb
```

Technologies used in this project

- AWS Account and AWS services

- EC2 instance and client machine

- Kafka and Confluent to connect with S3 buckets and topic creation and consuming data

- MSK cluster creation and connection from API to S3 buckets

- IAM access roles

- S3 simple storage for objects

- Creating REST APIs with proxy integrations, child resources and methods.

- Orchestrating workflows on a schedule with DAGs

- Mounting AWS S3 buckets for reading in batch data, reading and writing streaming data, data processing and long term storage in Delta tables

- Spark structured streaming and data processing in Databricks

- Creating Kinesis Data Streams and streaming data through REST APIs

File Structure 📂

📂 pinterest-data-pipeline819

📄 124714cdee67_dag.py
📂 Credentials
- 📄 authentication_credentials_example.csv
- 📄 credentials_template.yaml
- 📄 example-key-pair.pem
📄 Pinterest_data_ingestion.py
📄 README.md
📄 Reading and cleaning data from Kinesis Data Stream.ipynb
📄 Reading, cleaning and querying Pinterest Data from mounted S3 bucket using Sparks.ipynb
📂 gif
- 📄 1.EC2_key_pair_small.gif
- 📄 1.Kinesis_Data_Stream_creation_small.gif
- 📄 2.API_resource_methods_kinesis_streams_small.gif
- 📄 2.EC2_ssh_connect_small.gif
- 📄 3.IAM_role_trust_policy_ec2_access_small.gif
📂 img
- 📄 Airflow_dag.png
- 📄 Pinterest_architecture.detailed.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pinterest Data Pipeline

Table of contents

Description

Aim

Achievement Outcomes 📖

Achievement Outcomes 📖

Configuration Troubleshooting

Prerequisites 🔧

AWS Configuration Example ⚙️

Example Use 📎

Technologies used in this project

File Structure 📂

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Credentials		Credentials
gif		gif
img		img
.gitignore		.gitignore
124714cdee67_dag.py		124714cdee67_dag.py
Mount_S3_bucket_to_Databricks.ipynb		Mount_S3_bucket_to_Databricks.ipynb
Pinterest_data_ingestion.py		Pinterest_data_ingestion.py
README.md		README.md
Reading and cleaning data from Kinesis Data Stream.ipynb		Reading and cleaning data from Kinesis Data Stream.ipynb
Reading, cleaning and querying Pinterest Data from mounted S3 bucket using Sparks.ipynb		Reading, cleaning and querying Pinterest Data from mounted S3 bucket using Sparks.ipynb

amysw13/pinterest-data-pipeline819

Folders and files

Latest commit

History

Repository files navigation

Pinterest Data Pipeline

Table of contents

Description

Aim

Achievement Outcomes 📖

Achievement Outcomes 📖

Configuration Troubleshooting

Prerequisites 🔧

AWS Configuration Example ⚙️

Example Use 📎

Technologies used in this project

File Structure 📂

About

Resources

Stars

Watchers

Forks

Languages