ETL Logic for sample data using Spark and AWS resources

An ETL logic is written in Spark for transforming the given data set present in S3, and query on the transformed data is run using AWS Redshift. The data sets are in json format. All the raw data in json format has to be first uploaded to an S3 source bucket. Using EMR, a Spark job is executed, which would fetch the source data from S3 source bucket, and then perform the necessary transformations on it as per the problem statement. Finally, store the transformed data were to partitioned and stored in parquet format in S3 destination bucket. Now, these files are accessed using AWS Redshift by running SQL queries on the transformed processed data.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
Input files		Input files
Redshift logic to populate transformed table data		Redshift logic to populate transformed table data
Spark code (run on EMR)		Spark code (run on EMR)
Problem Statement.txt		Problem Statement.txt
README.md		README.md
capstone flowchart.png		capstone flowchart.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

PannagaS/ETL-Logic-orchestration-using-Spark-and-AWS

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages