- Model user activity data (JSON) to create a database and ETL pipeline in Postgres for a music streaming app startup called Sparkify
- Define Fact and Dimension tables for a STAR schema optimized for queries on song play analysis
- Insert data into new tables
Sparkify's Analytics team will be able to query the newly designed relational database with ease to understand user behaviors such as what songs users are listening to.
- PostgreSQL
- Python
- Jupyter Notebook
|____data
| |____log_data # simulated app user activity logs
| |____song_data # metadata about a song and the artist of that song
|
|____notebook
| |____etl.ipynb # ETL processes to develop ETL builder
| |____test.ipynb # to test ETL builder
|
|____script
| |____create_tables.py # script to create the Sparkify databases and the Fact & Dimension tables
| |____etl.py # ETL pipeline builder
| |____sql_queries.py # ETL pipeline builder help - DDL queries & Find Song DQL query
songplays
- songplay_id PRIMARY KEY
- start_time
- user_id
- level
- song_id
- artist_id
- session_id
- duration
- user_agent
users
- user_id PRIMARY KEY
- first_name
- last_name
- gender
- level
songs
- song_id PRIMARY KEY
- title
- artist_id
- year
- duration
artists
- artist_id PRIMARY KEY
- name
- location
- latitude
- longitude
time
- start_time PRIMARY KEY
- hour
- day
- week
- month
- year
- weekday
- Run
create_tables.py
in the terminal to create/reset databases/tables using commandpython create_tables.py
- Run
etl.py
in the terminal to complete the ETL process using commandpython etl.py
- process the
song_data
dataset to create thesongs
andartists
dimensional tables, and insert record into the tables - process the
log_data
dataset to to create thetime
andusers
dimensional tables, as well as thesongplays
fact table; and insert records into the tables - run
test.ipynb
to confirm the creation of the tables with the correct columns, and to confirm that the records were successfully inserted into each table
NOTE: You will not be able to run
test.ipynb
,etl.ipynb
, oretl.py
until you have runcreate_tables.py
at least once to create the sparkifydb database, which these other files connect to.