You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
non-English lyrics should be filtered properly. Currently, (at least for MVP project) we don't stick to advanced machine learning algorithms avoiding the pipeline as shallow as possible so that we can confidently validate every step although each step has a bit more error.
The logic is fairly simple:
count non-English words from each lyric (based on nltk corpus)
based on the non-English word count distribution, we select a threshold (i.e. percentile)
filter out entries where the number of non-English words more than the threshold
But we need validation for optimal threshold!!
The text was updated successfully, but these errors were encountered:
One way to tackle this issue is to use topic modeling as a tool for the anomaly detector. Namely, we can learn k topics without filtering, pick m topics indicating they are non-english clusters, then filter them. One can do this procedure iteratively but it seems only one cycle already seems okay.
The evaluation of this approach can be (roughly) measured by the total number of non-english words filtered by the nltk.
non-English lyrics should be filtered properly. Currently, (at least for MVP project) we don't stick to advanced machine learning algorithms avoiding the pipeline as shallow as possible so that we can confidently validate every step although each step has a bit more error.
The logic is fairly simple:
nltk
corpus)But we need validation for optimal threshold!!
The text was updated successfully, but these errors were encountered: