Skip to content

sstoikov/piki-music-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Piki Music Dataset

We present the Piki Music dataset with the goal of enabling researchers and practitioners from the RecSys community to mitigate the noisy feedback and self-selection biases inherent in the data collected by existing music platforms. These biases are likely to have significant impact on the fairness, transparency and quality of recommendation systems. Much has already been written about recommendation algorithms and evaluation metrics and we hope this dataset helps the community to focus on the impact of the data collection mechanisms.

Noisy feedback biases arise in implicit data sets collected by streaming apps. Such apps collect user actions without recording the context of the user and without the knowledge that they are being surveyed. Consequently, a song stream from a recommended playlist may be falsely interpreted as an indication that the song was enjoyed, when in fact it was played in the background. A skipped song may be falsely interpreted as an indication that the song was disliked, when in fact the user may not be in the mood for the song in their present context.

Self-selection biases arise in explicit data sets collected by apps that ask users to give ratings. Since rating is optional, the users most incentivized to rate are users who are very happy or very unhappy about their experience with the rated item.

The Piki Music dataset currently consists of 8896 anonymized users, 246,450 anonymized songs and 1,762,502 ratings and the data collection is still on-going. The Piki Music app is available for download here.

The columns of the dataset are as following:

• timestamp: a datetime variable

• user_id: an anonymized user id

• song_id: an anonymized song id

• liked: this is the feedback indicator, 2 if the song is superliked, 1 if the song is liked, or 0 if the song is disliked. Superliked songs are saved to a playlist.

• personalized: this is 2 if the song is hyper-personalized (by an artist that the user has already superliked), 1 if the song was recommended based on their previous choices or 0 if the song was selected randomly. The effect of algorithmic recommendations on the ratings is studied in "Interface Design to Mitigate Inflation in Recommender Systems".

• spotify_popularity: this is the song’s artist’s popularity, a value between 0 and 100, with 100 being the most popular. It is published by Spotify for each artist, through their publicly-available API.

• treatment group: Before January 3rd 2021, users could rate a song as soon as the music video was launched, this is treatment -1. After January 3rd 2021, the dislike button is enabled after 3 seconds, the like button is enabled after 6 seconds and the superlike button is enabled after 12 seconds, this is treatment 0. Between August 19 and December 5, 2022, a Randomized Control Trial was performed on 3 treatment groups: for group 1, the like button was enabled after 3 seconds, group 3 after 6 seconds and group 3 after 9 seconds. Post December 5 2022, the dislike button is enabled after 6 seconds, the like button after 8 seconds and the superlike after 12 seconds. The effect of the changes on the interface in various treatment groups is studied in "Interface Design to Mitigate Inflation in Recommender Systems".

In this repo, we release the dataset(data/piki_dataset.csv), a data exploration Jupyter notebook (Piki Music Dataset.ipynb) and the code for conducting the experiments in "Evaluating Music Recommendations with Binary Feedback for Multiple Stakeholders"

If you're interested in a very quick intro, here is a 7 min video intro to the paper.

To get familiar with the dataset and reproduce the results in the paper, install the dependencies and start by running the python script:

python evaluate_stakeholders.py

We show that a matrix factorization algorithm trained on binary feedback performs significantly better compared to one trained only on likes for stakeholders such as consumers, well-known artists and lesser-known artists.

If you are interested using the dataset for you research, please kindly cite our paper. Contact @sstoikov if you have any feedback or questions on the dataset.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published