MxM genre distribution is too skewed #7

eldrin · 2020-01-03T17:24:16Z

This plot explains everything

what should we do?

Just doing much simple test as proof of concept (Rock vs Pop?)
Using other genre mappings (using MSD map)
Considering different task

The text was updated successfully, but these errors were encountered:

eldrin · 2020-01-03T17:25:36Z

Also, the high number of missing genres (UNKNOWN) is another problem.

andrew0302 · 2020-01-06T13:19:07Z

yeah I agree. I think we may have to either use the MSD mapping, or perhaps use Muziekweb and somehow only use the higher-level genre mappins. What do you think?

…

On Fri, Jan 3, 2020 at 6:25 PM Jaehun Kim ***@***.***> wrote: Also, the high number of missing genres (UNKNOWN) is another problem. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7?email_source=notifications&email_token=ACTBAJTCAIDUU4QSUEZDDT3Q35YJDA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBUH5I#issuecomment-570639349>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTBAJSYAPB63I66Q6T4J5LQ35YJDANCNFSM4KCP3TPA> .

-- Andrew. M. Demetriou -- twitter <https://twitter.com/a_m_demetriou> researchgate <https://www.researchgate.net/profile/Andrew_Demetriou/contributions> linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>

eldrin · 2020-01-07T00:20:48Z

As for Muziekweb, we can probably use the map they sent to me long ago, which contains the map between Spotify-id to their internal-id and the their old genre classification. (Bit different to the current top-level genre categorization). But I guess we need to sign some documents again for this project. Also we need to cross match MSD entries to genres through the Spotify id as bridge, which will make some noise.

But aaaaany way, I can pull up the distribution for quick look and put them here soon.

eldrin · 2020-01-07T14:00:32Z

MSD - Muziekweb genre mapping

The number of samples filtered down from 332,674 to 103,073. I expect there must be some bias introduced to the data from the Dutch music market (~preference? demand?)
It will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries
Genre distribution looks like this. (top 30 genres / taking account ~65% of the data, meaning the other 35% of data mapped to more granular genres ~2.6k)

MSD - Tagtraum Genre set

The number of samples filtered down from 332,674 to 109,274. Filtering bias is from the platform where the researchers choose to pull out the relevant info for the genre mapping. (look at Electronic which is the 2nd, not like other corpora)
Again, it will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries
Genre distribution looks like this.

Thoughts

well, actually different corpus are consistently saying the world map of the western commercial music (before 2010s) is either Rock (~60-70%) and Pop (~20-30%), with a huge overlap between genres anyway.
Regardless, in terms of the noisiness, MSD-Tagtraum is the least bad one.
Coverage point of view, on the other hand, MxM is going to be any way the better source, since they have the full catalog (of course, with the noise we already observed up there)
Current Muziek web data is the worst option, having both noisiness and the coverage issue (and potential instability of multiple mapping stages)
Another thing we should consider here is that either the mild skewness or the scale is not the most critical factor for this study.
For conclusion, my ranking is:
MSD/Tagtraum > MxM > Muziekweb

What do you think?

andrew0302 · 2020-01-07T14:27:19Z

1_ Honestly, no option looks *great*. I agree the MSD/Tagtraum looks better, but I'm scratching my head as to why there's so little pop and so much rock. The MxM might look like it has better coverage, but a huge chunk is an UNKNOWN genre, which is the same as missing, no? Looking at the muziekweb, there seems to be a "pop" tag with almost everything, any reason as to why? Anyway. Considering how much missing data will be there, I'm wondering we should just pull out a random sample of tracks from the MSD/T classification, such that we have an even number of tracks for each major genre. The MSD/T does look the cleanest, providing we consider ignoring the "World" category which is essentially meaningless. 2_ Do we *have* to go by the MxM crossover for this task? Over the vacation, I was wondering if we should use different datasets for the tasks after all. Not sure if this is a stupid idea, but if we go for genre classification by lyrics, we can ignore user listening history and have a bigger dataset, right? On the other hand, it's a step away from the style of analysis in the LOTR paper, and it will take up more space. I would prefer it all in one dataset, but that might be asking too much.

…

On Tue, Jan 7, 2020 at 3:00 PM Jaehun Kim ***@***.***> wrote: MSD - Muziekweb genre mapping [image: msd_mw_genre_hist] <https://user-images.githubusercontent.com/5870216/71899014-1312b700-315b-11ea-800f-bb7165850e90.png> 1. The number of samples filtered down from 332,674 to 103,073. I expect there must be some bias introduced to the data from the Dutch music market (~preference? demand?) 2. It will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries 3. Genre distribution looks like this. (top 30 genres / taking account ~65% of the data, meaning the other 35% of data mapped to more granular genres ~2.6k) MSD - Tagtraum Genre set [image: msd_tagtraum_genre_hist] <https://user-images.githubusercontent.com/5870216/71899035-29b90e00-315b-11ea-9f89-6c63c3d4f876.png> 1. The number of samples filtered down from 332,674 to 109,274. Filtering bias is from the platform where the researchers choose to pull out the relevant info for the genre mapping. 2. Again, it will be further filtered down to about 75% of it (~75,000) due to the missing data of MxM database and the non-English entries 3. Genre distribution looks like this. Thoughts - well, actually different corpus consistently are saying the world map of the commercial music is either Rock (~60-70%) and Pop (~20-30%), with a huge overlap between genres anyway. - Regardless, in terms of the noisiness, MSD-Tagtraum is the least bad one. - Coverage point of view, on the other hand, MxM is going to be any way the better source, since they have the full catalog (of course, with the noise we already observed up there) - Current Muziek web data is the worst option, having both noisiness and the coverage issue (and potential instability of multiple mapping stages) - Another thing we should consider here is that either the mild skewness or the scale is not the most critical factor for this study. - For conclusion, my ranking is: MSD/Tagtraum > MxM > Muziekweb What do you think? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7?email_source=notifications&email_token=ACTBAJUOAQU7B7JSRBM5HV3Q4SDIDA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEII6LMA#issuecomment-571598256>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTBAJRIDO67CZ5457ADXN3Q4SDIDANCNFSM4KCP3TPA> .

-- Andrew. M. Demetriou -- twitter <https://twitter.com/a_m_demetriou> researchgate <https://www.researchgate.net/profile/Andrew_Demetriou/contributions> linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>

eldrin · 2020-01-07T15:18:41Z

1.1. yeah. MxM genre set has a huge missing chunk (UNKNOWN). But they do have more than 3M entries that we can access. So coverage for the entire music catalog might still be better with MxM.

1.2. pop everywhere on MW catalog seems to be the way they put their genre hierarchy on the genre line. It seems they're composed as the top_genre, mid_genre, leaf_genre, but not completely sure because of the random-looking entries in the bottom.

1.3. Stratifying makes the model will have power to explain each genre equally. But it'll be much less accurate to the majority of the songs in the market, compared to the model that learned from skewed distribution. (which in turn, will be relatively inaccurate on the niche genre for sure) Those two ways are just different perspectives or have different purposes.

The former is to understand the effect of features to each predicting each genre equally accurately to learn what would be the true effect if we want to or can treat each genre independent and equally distributed. The latter wants to have a model that maximizes the prediction power for the real world genre distribution.

What would be the situation/context we want to simulate? I think it must depend on both the narrative and the top-level research question we have. What do you think?

2.1. We don't need to use MxM crossover, nor for user-listening. So yeah the starting point of 330k may not the correct one. But it still will be bounded by the scale of the intersection between the source of the genre (i.e. MSD/T) and MxM.

For instance, the scale of MSD/T is already shrunk down to ~200k, not even close to 1M. MW's scale is around 3M, but it will lose some link due to the Spotify cross-mapping, which probably will be irrelevant at the moment since MW has the least clean set of the genre. MxM, although it's skewed and having big chunk of UNKNOWN tracks, is not suffering such bound, since we will have the full catalog in the end.

But again, I don't think the scalability is not a critical issue in terms of the research design we are going to have since we already have decent scalability anyway (over dozens of thousands already in any case). Actually this CAN be a critical issue if some of the ML models we are going to use will take ages of time due to the large scalability. (Note that we should run the model over and over to tune hyperparameter automatically and covering random data split)

andrew0302 · 2020-01-08T15:35:16Z

1_ I see what you're saying. Well, what we want to do is demonstrate the viability of using lyrics for MIR tasks. What is the most typical application in MIR work? I'm guessing that the latter option, which maxes out prediction for real world cases is the most common one - if that's the case, then I think my solution isn't a good one. Also, if that's the case then we might want to go with whatever has the most coverage. Maybe we can check, percentage-wise, the proportion of UNKNOWN in the MxM vs. the coverage in MSD/T? If one appears to have more coverage than the other, I think it's defensible to go with that one, keeping in mind that this is a proof-of-concept paper. In this case then, the narrative would be that lyrics have potential utility in MIR work. We demonstrate this by tackling 3 MIR tasks using lyrical features; one of the tasks is genre classification. As real-world genre classification is the most common scenario, we chose the vocabulary with the most coverage of our given dataset - MSD/T. You know this field much better than me. Would that work? 2_ The other thing I'm thinking is whether we can justify some sort of imputation for the UNKNOWNs. Tell me what you think of this: if we can find a way to examine agreement between different classifications and MxM, maybe we can see if the classification with greatest overlap can help us determine the genres of the UNKNOWN items, or at least some of them. I'm not sure how well it would work because there are a lot of items in UNKNOWN, and I don't know if there's a good objective way to quantify the agreement. But it's a thought.

…

On Tue, Jan 7, 2020 at 4:18 PM Jaehun Kim ***@***.***> wrote: *1.1.* yeah. MxM genre set has a huge missing chunk (UNKNOWN). But they do have more than 3M entries that we can access. So coverage for the entire music catalog might still be better with MxM. *1.2.* pop everywhere on MW catalog seems to be the way they put their genre hierarchy on the genre line. It seems they're composed as the top_genre, mid_genre, leaf_genre, but not completely sure because of the random-looking entries in the bottom. *1.3.* Stratifying makes the model will have power to explain each genre equally. But it'll be much less accurate to the majority of the songs in the market, compared to the model that learned from skewed distribution. (which in turn, will be relatively inaccurate on the niche genre for sure) Those two ways are just different perspectives or have different purposes. The former is to understand the effect of features to each predicting each genre equally accurately to learn what would be the true effect if we want to or can treat each genre independent and equally distributed. The latter wants to have a model that maximizes the prediction power for the real world genre distribution. What would be the situation/context we want to simulate? I think it must depend on both the narrative and the top-level research question we have. What do you think? *2.1.* We don't need to use MxM crossover, nor for user-listening. So yeah the starting point of 330k may not the correct one. But it still will be bounded by the scale of the intersection between the source of the genre (i.e. MSD/T) and MxM. For instance, the scale of MSD/T is already shrunk down to ~200k, not even close to 1M. MW's scale is around 3M, but it will lose some link due to the Spotify cross-mapping, which probably will be irrelevant at the moment since MW has the least clean set of the genre. MxM, although it's skewed and having big chunk of UNKNOWN tracks, is not suffering such bound, since we will have the full catalog in the end. But again, I don't think the scalability is not a critical issue in terms of the research design we are going to have since we already have decent scalability anyway (over dozens of thousands already in any case). Actually this CAN be a critical issue if some of the ML models we are going to use will take ages of time due to the large scalability. (Note that we should run the model over and over to tune hyperparameter automatically and covering random data split) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7?email_source=notifications&email_token=ACTBAJUEWZ4DG42IZ54D72DQ4SMNFA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIJGLHI#issuecomment-571631005>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTBAJVZLOXJD4UDASXSGWLQ4SMNFANCNFSM4KCP3TPA> .

-- Andrew. M. Demetriou -- twitter <https://twitter.com/a_m_demetriou> researchgate <https://www.researchgate.net/profile/Andrew_Demetriou/contributions> linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>

eldrin · 2020-01-08T23:29:57Z

1. I feel it's kind of hard to say what is the most common practice in terms of the music genre classification in MIR community. Because it's in the phase of gradual death by the collective efforts of reviewers (including myself 😂 ) But if we define the music genre classification as the real-world task for industry partners, then yes, I think stratifying the genre distribution will make more sense. (I just realized stratifying IS keeping what so ever the source distribution)

2. We can impute those, but I think the gain to have the imputed labels is not that substantial. They will be somewhat erroneous predictions of the model that is trained with the rest of the data, which will bring another set of validation issues on the table, and the model including those entries as part of the training set will not much benefit from them. There is another set of techniques for those unlabeled observations in a much better way, at least in theory (something called semi-supervised learning), but works have been shown they are not much helpful (some minor gain such as 0~5%). So bringing that technique also will make the story complicated. I think we can even dump them, given we are going to have a healthy amount of coverage anyway. But then the remaining question is, we are happy with the MxM minus UNKNOWN genre distribution over the other options (say, MSD/T)

eldrin self-assigned this Jan 3, 2020

eldrin added invalid This doesn't seem right question Further information is requested labels Jan 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MxM genre distribution is too skewed #7

MxM genre distribution is too skewed #7

eldrin commented Jan 3, 2020

eldrin commented Jan 3, 2020

andrew0302 commented Jan 6, 2020 via email

eldrin commented Jan 7, 2020 •

edited

Loading

eldrin commented Jan 7, 2020 •

edited

Loading

andrew0302 commented Jan 7, 2020 via email

eldrin commented Jan 7, 2020

andrew0302 commented Jan 8, 2020 via email

eldrin commented Jan 8, 2020

MxM genre distribution is too skewed #7

MxM genre distribution is too skewed #7

Comments

eldrin commented Jan 3, 2020

eldrin commented Jan 3, 2020

andrew0302 commented Jan 6, 2020 via email

eldrin commented Jan 7, 2020 • edited Loading

eldrin commented Jan 7, 2020 • edited Loading

MSD - Muziekweb genre mapping

MSD - Tagtraum Genre set

Thoughts

andrew0302 commented Jan 7, 2020 via email

eldrin commented Jan 7, 2020

andrew0302 commented Jan 8, 2020 via email

eldrin commented Jan 8, 2020

eldrin commented Jan 7, 2020 •

edited

Loading

eldrin commented Jan 7, 2020 •

edited

Loading