A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As the data engineer, we are tasked with building an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to.
- Python, SQL
- Juypter Notebook
- AWS Services (Redshift, S3, EC2, IAM, VPC, Boto3, CLI)
Creating resources on AWS using the AWS management console to support the Redshift data warehouse.
- create a
myRedshiftRole
IAM role with theAmazonS3ReadOnlyAccess
permission policy attached
- create a
redshift_security_group
security group that authorizes Redshift cluster access (with the default VPC)
- create an IAM user with below two permission policies attached, and create and save the
Access key
andSecurity access key
AmazonRedshiftFullAccess
AmazonS3ReadOnlyAccess
- create the
redshift-cluster-1
cluster that attaches themyRedshiftRole
role and theredshift_security_group
security group
NOTE: Make sure to delete the cluster each time finish working to avoid large, unexpected costs
- Using the AWS CLI S3 commands to list bucket objects
aws configure
AWS Access Key ID: {KEY}
AWS Secret Access Key: {SECRET}
aws s3 ls s3://udacity-dend/log_data/2018/11/
aws s3 ls s3://udacity-dend/song_data/A/A/A/
- Download sample_date to local to explore the data (check data type etc.)
aws s3 cp s3://udacity-dend/song_data/A/A/A/TRAAAAK128F9318786.json sample_data/TRAAAAK128F9318786.json
aws s3 cp s3://udacity-dend/log_data/2018/11/2018-11-30-events.json sample_data/2018-11-30-events.json
aws s3 cp s3://udacity-dend/log_json_path.json sample_data/log_json_path.json
- etl_test.ipynb - test AWS Configurations and the ETL process, including validation and example analytical queries
- sql_queries.py - a collection of SQL queries for
create_tables.py
andetl.py
- Run create_tables.py to create Staging, Fact and Dimension table schema using command
python3 create_tables.py
drop_tables
- drop table if existscreate_tables
- create tables
- Run etl.py to complete the ETL process using command
python3 etl.py
load_staging_tables
- load/copy raw data from S3 buckets to Redshift staging tables
- reference: Using the COPY command to load from Amazon S3
insert_tables
- transforming staging tables to star-schema fact & dimension tables for song play analysis
staging_events
- artist
- auth
- firstName
- gender
- itemInSession
- lastName
- length
- level
- location
- method
- page
- registration
- sessionId
- song
- status
- ts
- userAgent
- userId
staging_songs
- artist_id
- artist_latitude
- artist_location
- artist_longitude
- artist_name
- duration
- num_songs
- song_id
- title
- year
songplays
- songplay_id PRIMARY KEY
- start_time
- user_id
- level
- song_id
- artist_id
- session_id
- duration
- user_agent
users
- user_id PRIMARY KEY
- first_name
- last_name
- gender
- level
songs
- song_id PRIMARY KEY
- title
- artist_id
- year
- duration
artists
- artist_id PRIMARY KEY
- name
- location
- latitude
- longitude
time
- start_time PRIMARY KEY
- hour
- day
- week
- month
- year
- weekday
NOTE:
- The
SERIAL
command in Postgres is not supported in Redshift. The equivalent in redshift isIDENTITY(0,1)
, which you can read more on in the Redshift Create Table Docs.- Amazon Redshift automatically assigns a
NOT NULL
condition to columns defined asPRIMARY KEY
. You don't need to specify it separately. This can simplify the table create scripts.
-
create a virtual environment
python3 -m venv udacity-dend
-
activate the virtual env
source udacity-dend/bin/activate
-
Installation
pip install jupyterlab pip install ipykernel
Validate that the install has succeeded by running
jupyter-lab
from your command line. A new tab should open in your browser, with the JupyterLab application running.- install useful Python packages in this virtual env
pip install boto3 pip install psycopg2
-
register the new virtual env with Jupyter so that you can use it within JupyterLab
python3 -m ipykernel install --user --name=‘udacity-dend‘