Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MxM genre distribution is too skewed #7

Open
eldrin opened this issue Jan 3, 2020 · 8 comments
Open

MxM genre distribution is too skewed #7

eldrin opened this issue Jan 3, 2020 · 8 comments
Assignees
Labels
invalid This doesn't seem right question Further information is requested

Comments

@eldrin
Copy link
Collaborator

eldrin commented Jan 3, 2020

This plot explains everything

mxm_toyd_genre_dist

what should we do?

  1. Just doing much simple test as proof of concept (Rock vs Pop?)
  2. Using other genre mappings (using MSD map)
  3. Considering different task
@eldrin eldrin self-assigned this Jan 3, 2020
@eldrin eldrin added invalid This doesn't seem right question Further information is requested labels Jan 3, 2020
@eldrin
Copy link
Collaborator Author

eldrin commented Jan 3, 2020

Also, the high number of missing genres (UNKNOWN) is another problem.

@andrew0302
Copy link

andrew0302 commented Jan 6, 2020 via email

@eldrin
Copy link
Collaborator Author

eldrin commented Jan 7, 2020

As for Muziekweb, we can probably use the map they sent to me long ago, which contains the map between Spotify-id to their internal-id and the their old genre classification. (Bit different to the current top-level genre categorization). But I guess we need to sign some documents again for this project. Also we need to cross match MSD entries to genres through the Spotify id as bridge, which will make some noise.

But aaaaany way, I can pull up the distribution for quick look and put them here soon.

@eldrin
Copy link
Collaborator Author

eldrin commented Jan 7, 2020

MSD - Muziekweb genre mapping

msd_mw_genre_hist

  1. The number of samples filtered down from 332,674 to 103,073. I expect there must be some bias introduced to the data from the Dutch music market (~preference? demand?)
  2. It will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries
  3. Genre distribution looks like this. (top 30 genres / taking account ~65% of the data, meaning the other 35% of data mapped to more granular genres ~2.6k)

MSD - Tagtraum Genre set

msd_tagtraum_genre_hist

  1. The number of samples filtered down from 332,674 to 109,274. Filtering bias is from the platform where the researchers choose to pull out the relevant info for the genre mapping. (look at Electronic which is the 2nd, not like other corpora)
  2. Again, it will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries
  3. Genre distribution looks like this.

Thoughts

  • well, actually different corpus are consistently saying the world map of the western commercial music (before 2010s) is either Rock (~60-70%) and Pop (~20-30%), with a huge overlap between genres anyway.
  • Regardless, in terms of the noisiness, MSD-Tagtraum is the least bad one.
  • Coverage point of view, on the other hand, MxM is going to be any way the better source, since they have the full catalog (of course, with the noise we already observed up there)
  • Current Muziek web data is the worst option, having both noisiness and the coverage issue (and potential instability of multiple mapping stages)
  • Another thing we should consider here is that either the mild skewness or the scale is not the most critical factor for this study.
  • For conclusion, my ranking is:
    MSD/Tagtraum > MxM > Muziekweb

What do you think?

@andrew0302
Copy link

andrew0302 commented Jan 7, 2020 via email

@eldrin
Copy link
Collaborator Author

eldrin commented Jan 7, 2020

1.1. yeah. MxM genre set has a huge missing chunk (UNKNOWN). But they do have more than 3M entries that we can access. So coverage for the entire music catalog might still be better with MxM.

1.2. pop everywhere on MW catalog seems to be the way they put their genre hierarchy on the genre line. It seems they're composed as the top_genre, mid_genre, leaf_genre, but not completely sure because of the random-looking entries in the bottom.

1.3. Stratifying makes the model will have power to explain each genre equally. But it'll be much less accurate to the majority of the songs in the market, compared to the model that learned from skewed distribution. (which in turn, will be relatively inaccurate on the niche genre for sure) Those two ways are just different perspectives or have different purposes.

The former is to understand the effect of features to each predicting each genre equally accurately to learn what would be the true effect if we want to or can treat each genre independent and equally distributed. The latter wants to have a model that maximizes the prediction power for the real world genre distribution.

What would be the situation/context we want to simulate? I think it must depend on both the narrative and the top-level research question we have. What do you think?

2.1. We don't need to use MxM crossover, nor for user-listening. So yeah the starting point of 330k may not the correct one. But it still will be bounded by the scale of the intersection between the source of the genre (i.e. MSD/T) and MxM.

For instance, the scale of MSD/T is already shrunk down to ~200k, not even close to 1M. MW's scale is around 3M, but it will lose some link due to the Spotify cross-mapping, which probably will be irrelevant at the moment since MW has the least clean set of the genre. MxM, although it's skewed and having big chunk of UNKNOWN tracks, is not suffering such bound, since we will have the full catalog in the end.

But again, I don't think the scalability is not a critical issue in terms of the research design we are going to have since we already have decent scalability anyway (over dozens of thousands already in any case). Actually this CAN be a critical issue if some of the ML models we are going to use will take ages of time due to the large scalability. (Note that we should run the model over and over to tune hyperparameter automatically and covering random data split)

@andrew0302
Copy link

andrew0302 commented Jan 8, 2020 via email

@eldrin
Copy link
Collaborator Author

eldrin commented Jan 8, 2020

1. I feel it's kind of hard to say what is the most common practice in terms of the music genre classification in MIR community. Because it's in the phase of gradual death by the collective efforts of reviewers (including myself 😂 ) But if we define the music genre classification as the real-world task for industry partners, then yes, I think stratifying the genre distribution will make more sense. (I just realized stratifying IS keeping what so ever the source distribution)

2. We can impute those, but I think the gain to have the imputed labels is not that substantial. They will be somewhat erroneous predictions of the model that is trained with the rest of the data, which will bring another set of validation issues on the table, and the model including those entries as part of the training set will not much benefit from them. There is another set of techniques for those unlabeled observations in a much better way, at least in theory (something called semi-supervised learning), but works have been shown they are not much helpful (some minor gain such as 0~5%). So bringing that technique also will make the story complicated. I think we can even dump them, given we are going to have a healthy amount of coverage anyway. But then the remaining question is, we are happy with the MxM minus UNKNOWN genre distribution over the other options (say, MSD/T)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid This doesn't seem right question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants