Attempt to predict which musical artist will be played on Thursdays on 89.3 The Current (http://thecurrent.org).
Create a Python virtual environment and install dependencies
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt
The Current posts their playlist by hour or by day to https://www.thecurrent.org/playlist/the-current/2021-01-01.
The code is broken into the major components:
- playlist.py - Retrieves HTML and parses the song information from https://www.thecurrent.org/playlist/the-current/.
- html_tasks.py - Tasks to orchestrates HTML retrieval using playlist.py
- csv_tasks.py - Tasks to parse out data from HTML retrieved from html_tasks.py
A simple example of usage. The following code will retrieve the songs played for a specific hour.
import playlist
playlist.get_hour(2016, 1, 24, 0)
In order to prevent DoS'ing The Current website, a Luigi-based pipeline exists which downloads the HTML for a given day, parses the HTML for the articles, and save the results off as a CSV.
The following command will retrieve the HTML, save it to the output/html
directory, parse the HTML for articles, and save the results as a CSV in the output/csv
directory. Execute the pipeline like song
Retrieve and save a single hour HTML
export PYTHONPATH=.
luigi --module html_tasks SaveHourHtmlToLocal --date=2016-01-01 --hour=0 --local-scheduler
Retrieve and save a full day of HTML by hour (24 hours)
export PYTHONPATH=.
luigi --module html_tasks SaveHourHtmlToLocal --date=2016-01-01 --local-scheduler
Hourly can be...alot. This allows retrieval for a full day
export PYTHONPATH=.
luigi --module html_tasks SaveDayHtmlToLocal --date=2016-01-01 --local-scheduler
Retrieve and save an entire month of HTML
export PYTHONPATH=.
luigi --module html_tasks SaveMonthHtmlToLocal --year=2016 --month=1 --local-scheduler
Retrieve and save an entire year of HTML
export PYTHONPATH=.
luigi --module html_tasks SaveYearHtmlToLocal --year=2016 --local-scheduler
CSV generation requires the HTML has been saved down. If the HTML hasn't been retrieved yet the CSV task will retrieve the HTML and then create the file(s).
Convert a single day's HTML into the song CSV
export PYTHONPATH=.
luigi --module --module csv_tasks ConvertDayHtmlToCsv --date=2016-01-01 --local-scheduler --workers=10
field | Type | Description |
---|---|---|
id | string | A unique identifier of the hashed datetime, artist, and title |
datetime | datetime | The time the song was played at |
artist | string | The name of the artist who played the song |
title | string | The title of the song |
year | integer | The year the song was played on the Current |
month | integer | The month the song was played on the Current |
day | integer | The day the song was played on the Current |
day_of_week | string | The friendly name of the day of week the song was played |
week | integer | The week of the year the song was played |
hour | integer | The hour block the song was played (0-23) |
In VSCode change the Jupyter Notebook File Root
setting to
${workspaceFolder}