Name	Name	Last commit message	Last commit date
parent directory ..
images	images
sample_data	sample_data
README.md	README.md
create_tables.py	create_tables.py
dwh.cfg	dwh.cfg
etl.py	etl.py
etl_test.ipynb	etl_test.ipynb
sql_queries.py	sql_queries.py

Project 2: Cloud Data Warehouse

Project Introduction

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

As the data engineer, we are tasked with building an ETL pipeline that extracts data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights into what songs their users are listening to.

Tools used

Python, SQL
Juypter Notebook
AWS Services (Redshift, S3, EC2, IAM, VPC, Boto3, CLI)

AWS Configuration

Creating resources on AWS using the AWS management console to support the Redshift data warehouse.

1. Create an IAM Role

create a myRedshiftRole IAM role with the AmazonS3ReadOnlyAccess permission policy attached

2. Create Security Group for Redshift

create a redshift_security_group security group that authorizes Redshift cluster access (with the default VPC)

3. Create an IAM User for Redshift

create an IAM user with below two permission policies attached, and create and save the Access key and Security access key
- AmazonRedshiftFullAccess
- AmazonS3ReadOnlyAccess

4. Launch a Redshift Cluster

create the redshift-cluster-1 cluster that attaches the myRedshiftRole role and the redshift_security_group security group

NOTE: Make sure to delete the cluster each time finish working to avoid large, unexpected costs

Project Data Exploration

Using the AWS CLI S3 commands to list bucket objects

aws configure
AWS Access Key ID: {KEY}
AWS Secret Access Key: {SECRET}

aws s3 ls s3://udacity-dend/log_data/2018/11/
aws s3 ls s3://udacity-dend/song_data/A/A/A/

Download sample_date to local to explore the data (check data type etc.)

aws s3 cp s3://udacity-dend/song_data/A/A/A/TRAAAAK128F9318786.json sample_data/TRAAAAK128F9318786.json

aws s3 cp s3://udacity-dend/log_data/2018/11/2018-11-30-events.json sample_data/2018-11-30-events.json

aws s3 cp s3://udacity-dend/log_json_path.json sample_data/log_json_path.json

ETL Pipeline

etl_test.ipynb - test AWS Configurations and the ETL process, including validation and example analytical queries
sql_queries.py - a collection of SQL queries for create_tables.py and etl.py

Run create_tables.py to create Staging, Fact and Dimension table schema using command python3 create_tables.py

drop_tables - drop table if exists
create_tables - create tables

Run etl.py to complete the ETL process using command python3 etl.py

load_staging_tables
- load/copy raw data from S3 buckets to Redshift staging tables
- reference: Using the COPY command to load from Amazon S3
insert_tables
- transforming staging tables to star-schema fact & dimension tables for song play analysis

Database Schema for Song Play Analysis

Staging Tables

staging_events 
          - artist          
          - auth
          - firstName
          - gender
          - itemInSession
          - lastName
          - length
          - level
          - location
          - method
          - page
          - registration
          - sessionId
          - song
          - status
          - ts
          - userAgent
          - userId

staging_songs 
          - artist_id        
          - artist_latitude
          - artist_location
          - artist_longitude
          - artist_name
          - duration
          - num_songs
          - song_id
          - title
          - year

Fact Table

songplays 
          - songplay_id       PRIMARY KEY
          - start_time
          - user_id
          - level
          - song_id
          - artist_id
          - session_id
          - duration
          - user_agent

Dimension Tables

users 
          - user_id           PRIMARY KEY
          - first_name
          - last_name
          - gender
          - level

songs 
          - song_id           PRIMARY KEY
          - title
          - artist_id
          - year
          - duration

artists 
          - artist_id         PRIMARY KEY
          - name
          - location
          - latitude
          - longitude

time 
          - start_time        PRIMARY KEY
          - hour
          - day
          - week
          - month
          - year
          - weekday

NOTE:

The SERIAL command in Postgres is not supported in Redshift. The equivalent in redshift is IDENTITY(0,1), which you can read more on in the Redshift Create Table Docs.

Amazon Redshift automatically assigns a NOT NULL condition to columns defined as PRIMARY KEY. You don't need to specify it separately. This can simplify the table create scripts.

Setup & Run Jupyter Notebooks in VS Code w/ Virtual Env & Kernels

create a virtual environment
```
python3 -m venv udacity-dend 
```
activate the virtual env
```
source udacity-dend/bin/activate
```
Installation
```
pip install jupyterlab

pip install ipykernel
```
Validate that the install has succeeded by running jupyter-lab from your command line. A new tab should open in your browser, with the JupyterLab application running.
- install useful Python packages in this virtual env
```
pip install boto3
pip install psycopg2
```
register the new virtual env with Jupyter so that you can use it within JupyterLab
```
python3 -m ipykernel install --user --name=‘udacity-dend‘ 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project 2-Data Warehouse

Project 2-Data Warehouse

README.md

Project 2: Cloud Data Warehouse

Project Introduction

Tools used

AWS Configuration

1. Create an IAM Role

2. Create Security Group for Redshift

3. Create an IAM User for Redshift

4. Launch a Redshift Cluster

Project Data Exploration

ETL Pipeline

Database Schema for Song Play Analysis

Staging Tables

Fact Table

Dimension Tables

Setup & Run Jupyter Notebooks in VS Code w/ Virtual Env & Kernels

Files

Project 2-Data Warehouse

Directory actions

More options

Directory actions

More options

Latest commit

History

Project 2-Data Warehouse

Folders and files

parent directory

README.md

Project 2: Cloud Data Warehouse

Project Introduction

Tools used

AWS Configuration

1. Create an IAM Role

2. Create Security Group for Redshift

3. Create an IAM User for Redshift

4. Launch a Redshift Cluster

Project Data Exploration

ETL Pipeline

Database Schema for Song Play Analysis

Staging Tables

Fact Table

Dimension Tables

Setup & Run Jupyter Notebooks in VS Code w/ Virtual Env & Kernels