-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MxM genre distribution is too skewed #7
Comments
Also, the high number of missing genres (UNKNOWN) is another problem. |
yeah I agree. I think we may have to either use the MSD mapping, or perhaps
use Muziekweb and somehow only use the higher-level genre mappins. What do
you think?
…On Fri, Jan 3, 2020 at 6:25 PM Jaehun Kim ***@***.***> wrote:
Also, the high number of missing genres (UNKNOWN) is another problem.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#7?email_source=notifications&email_token=ACTBAJTCAIDUU4QSUEZDDT3Q35YJDA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIBUH5I#issuecomment-570639349>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTBAJSYAPB63I66Q6T4J5LQ35YJDANCNFSM4KCP3TPA>
.
--
Andrew. M. Demetriou
--
twitter <https://twitter.com/a_m_demetriou>
researchgate
<https://www.researchgate.net/profile/Andrew_Demetriou/contributions>
linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>
|
As for Muziekweb, we can probably use the map they sent to me long ago, which contains the map between Spotify-id to their internal-id and the their old genre classification. (Bit different to the current top-level genre categorization). But I guess we need to sign some documents again for this project. Also we need to cross match MSD entries to genres through the Spotify id as bridge, which will make some noise. But aaaaany way, I can pull up the distribution for quick look and put them here soon. |
MSD - Muziekweb genre mapping
MSD - Tagtraum Genre set
Thoughts
What do you think? |
1_
Honestly, no option looks *great*. I agree the MSD/Tagtraum looks better,
but I'm scratching my head as to why there's so little pop and so much
rock. The MxM might look like it has better coverage, but a huge chunk is
an UNKNOWN genre, which is the same as missing, no? Looking at the
muziekweb, there seems to be a "pop" tag with almost everything, any reason
as to why?
Anyway. Considering how much missing data will be there, I'm wondering we
should just pull out a random sample of tracks from the MSD/T
classification, such that we have an even number of tracks for each major
genre. The MSD/T does look the cleanest, providing we consider ignoring the
"World" category which is essentially meaningless.
2_
Do we *have* to go by the MxM crossover for this task? Over the vacation, I
was wondering if we should use different datasets for the tasks after all.
Not sure if this is a stupid idea, but if we go for genre classification by
lyrics, we can ignore user listening history and have a bigger dataset,
right? On the other hand, it's a step away from the style of analysis in
the LOTR paper, and it will take up more space. I would prefer it all in
one dataset, but that might be asking too much.
…On Tue, Jan 7, 2020 at 3:00 PM Jaehun Kim ***@***.***> wrote:
MSD - Muziekweb genre mapping
[image: msd_mw_genre_hist]
<https://user-images.githubusercontent.com/5870216/71899014-1312b700-315b-11ea-800f-bb7165850e90.png>
1. The number of samples filtered down from 332,674 to 103,073. I
expect there must be some bias introduced to the data from the Dutch music
market (~preference? demand?)
2. It will be further filtered down to about 75% of it (~75,000) due
to the missing data of MxM database and the non-English entries
3. Genre distribution looks like this. (top 30 genres / taking account
~65% of the data, meaning the other 35% of data mapped to more granular
genres ~2.6k)
MSD - Tagtraum Genre set
[image: msd_tagtraum_genre_hist]
<https://user-images.githubusercontent.com/5870216/71899035-29b90e00-315b-11ea-9f89-6c63c3d4f876.png>
1. The number of samples filtered down from 332,674 to 109,274.
Filtering bias is from the platform where the researchers choose to pull
out the relevant info for the genre mapping.
2. Again, it will be further filtered down to about 75% of it
(~75,000) due to the missing data of MxM database and the non-English
entries
3. Genre distribution looks like this.
Thoughts
- well, actually different corpus consistently are saying the world
map of the commercial music is either Rock (~60-70%) and Pop (~20-30%),
with a huge overlap between genres anyway.
- Regardless, in terms of the noisiness, MSD-Tagtraum is the least bad
one.
- Coverage point of view, on the other hand, MxM is going to be any
way the better source, since they have the full catalog (of course, with
the noise we already observed up there)
- Current Muziek web data is the worst option, having both noisiness
and the coverage issue (and potential instability of multiple mapping
stages)
- Another thing we should consider here is that either the mild
skewness or the scale is not the most critical factor for this study.
- For conclusion, my ranking is:
MSD/Tagtraum > MxM > Muziekweb
What do you think?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7?email_source=notifications&email_token=ACTBAJUOAQU7B7JSRBM5HV3Q4SDIDA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEII6LMA#issuecomment-571598256>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTBAJRIDO67CZ5457ADXN3Q4SDIDANCNFSM4KCP3TPA>
.
--
Andrew. M. Demetriou
--
twitter <https://twitter.com/a_m_demetriou>
researchgate
<https://www.researchgate.net/profile/Andrew_Demetriou/contributions>
linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>
|
1.1. yeah. MxM genre set has a huge missing chunk ( 1.2. 1.3. Stratifying makes the model will have power to explain each genre equally. But it'll be much less accurate to the majority of the songs in the market, compared to the model that learned from skewed distribution. (which in turn, will be relatively inaccurate on the niche genre for sure) Those two ways are just different perspectives or have different purposes. The former is to understand the effect of features to each predicting each genre equally accurately to learn what would be the true effect if we want to or can treat each genre independent and equally distributed. The latter wants to have a model that maximizes the prediction power for the real world genre distribution. What would be the situation/context we want to simulate? I think it must depend on both the narrative and the top-level research question we have. What do you think? 2.1. We don't need to use MxM crossover, nor for user-listening. So yeah the starting point of 330k may not the correct one. But it still will be bounded by the scale of the intersection between the source of the genre (i.e. MSD/T) and MxM. For instance, the scale of But again, I don't think the scalability is not a critical issue in terms of the research design we are going to have since we already have decent scalability anyway (over dozens of thousands already in any case). Actually this CAN be a critical issue if some of the ML models we are going to use will take ages of time due to the large scalability. (Note that we should run the model over and over to tune hyperparameter automatically and covering random data split) |
1_
I see what you're saying. Well, what we want to do is demonstrate the
viability of using lyrics for MIR tasks. What is the most typical
application in MIR work? I'm guessing that the latter option, which maxes
out prediction for real world cases is the most common one - if that's the
case, then I think my solution isn't a good one. Also, if that's the case
then we might want to go with whatever has the most coverage. Maybe we can
check, percentage-wise, the proportion of UNKNOWN in the MxM vs. the
coverage in MSD/T? If one appears to have more coverage than the other, I
think it's defensible to go with that one, keeping in mind that this is a
proof-of-concept paper.
In this case then, the narrative would be that lyrics have potential
utility in MIR work. We demonstrate this by tackling 3 MIR tasks using
lyrical features; one of the tasks is genre classification. As real-world
genre classification is the most common scenario, we chose the vocabulary
with the most coverage of our given dataset - MSD/T.
You know this field much better than me. Would that work?
2_
The other thing I'm thinking is whether we can justify some sort of
imputation for the UNKNOWNs. Tell me what you think of this: if we can find
a way to examine agreement between different classifications and MxM, maybe
we can see if the classification with greatest overlap can help us
determine the genres of the UNKNOWN items, or at least some of them.
I'm not sure how well it would work because there are a lot of items in
UNKNOWN, and I don't know if there's a good objective way to quantify the
agreement. But it's a thought.
…On Tue, Jan 7, 2020 at 4:18 PM Jaehun Kim ***@***.***> wrote:
*1.1.* yeah. MxM genre set has a huge missing chunk (UNKNOWN). But they
do have more than 3M entries that we can access. So coverage for the entire
music catalog might still be better with MxM.
*1.2.* pop everywhere on MW catalog seems to be the way they put their
genre hierarchy on the genre line. It seems they're composed as the top_genre,
mid_genre, leaf_genre, but not completely sure because of the
random-looking entries in the bottom.
*1.3.* Stratifying makes the model will have power to explain each genre
equally. But it'll be much less accurate to the majority of the songs in
the market, compared to the model that learned from skewed distribution.
(which in turn, will be relatively inaccurate on the niche genre for sure)
Those two ways are just different perspectives or have different purposes.
The former is to understand the effect of features to each predicting each
genre equally accurately to learn what would be the true effect if we want
to or can treat each genre independent and equally distributed. The latter
wants to have a model that maximizes the prediction power for the real
world genre distribution.
What would be the situation/context we want to simulate? I think it must
depend on both the narrative and the top-level research question we have.
What do you think?
*2.1.* We don't need to use MxM crossover, nor for user-listening. So
yeah the starting point of 330k may not the correct one. But it still will
be bounded by the scale of the intersection between the source of the genre
(i.e. MSD/T) and MxM.
For instance, the scale of MSD/T is already shrunk down to ~200k, not
even close to 1M. MW's scale is around 3M, but it will lose some link due
to the Spotify cross-mapping, which probably will be irrelevant at the
moment since MW has the least clean set of the genre. MxM, although it's
skewed and having big chunk of UNKNOWN tracks, is not suffering such bound,
since we will have the full catalog in the end.
But again, I don't think the scalability is not a critical issue in terms
of the research design we are going to have since we already have decent
scalability anyway (over dozens of thousands already in any case). Actually
this CAN be a critical issue if some of the ML models we are going to use
will take ages of time due to the large scalability. (Note that we should
run the model over and over to tune hyperparameter automatically and
covering random data split)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7?email_source=notifications&email_token=ACTBAJUEWZ4DG42IZ54D72DQ4SMNFA5CNFSM4KCP3TPKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIJGLHI#issuecomment-571631005>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACTBAJVZLOXJD4UDASXSGWLQ4SMNFANCNFSM4KCP3TPA>
.
--
Andrew. M. Demetriou
--
twitter <https://twitter.com/a_m_demetriou>
researchgate
<https://www.researchgate.net/profile/Andrew_Demetriou/contributions>
linkedin <https://www.linkedin.com/in/andrew-demetriou-5169667/>
|
1. I feel it's kind of hard to say what is the most common practice in terms of the music genre classification in MIR community. Because it's in the phase of gradual death by the collective efforts of reviewers (including myself 😂 ) But if we define the music genre classification as the real-world task for industry partners, then yes, I think stratifying the genre distribution will make more sense. (I just realized stratifying IS keeping what so ever the source distribution) 2. We can impute those, but I think the gain to have the imputed labels is not that substantial. They will be somewhat erroneous predictions of the model that is trained with the rest of the data, which will bring another set of validation issues on the table, and the model including those entries as part of the training set will not much benefit from them. There is another set of techniques for those |
This plot explains everything
what should we do?
Rock vs Pop
?)The text was updated successfully, but these errors were encountered: