English entry filtering #1

eldrin · 2019-11-16T12:27:09Z

non-English lyrics should be filtered properly. Currently, (at least for MVP project) we don't stick to advanced machine learning algorithms avoiding the pipeline as shallow as possible so that we can confidently validate every step although each step has a bit more error.

The logic is fairly simple:

count non-English words from each lyric (based on nltk corpus)
based on the non-English word count distribution, we select a threshold (i.e. percentile)
filter out entries where the number of non-English words more than the threshold

But we need validation for optimal threshold!!

The text was updated successfully, but these errors were encountered:

eldrin · 2019-11-18T10:37:36Z

One way to tackle this issue is to use topic modeling as a tool for the anomaly detector. Namely, we can learn k topics without filtering, pick m topics indicating they are non-english clusters, then filter them. One can do this procedure iteratively but it seems only one cycle already seems okay.

The evaluation of this approach can be (roughly) measured by the total number of non-english words filtered by the nltk.

eldrin · 2019-11-18T15:50:04Z

Still seems there are erroneous entries. If there is enough room for checking it, the current pipeline should be evaluated as well.

eldrin added the help wanted Extra attention is needed label Nov 16, 2019

eldrin added the minor_validation validation issue but less impactful label Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

English entry filtering #1

English entry filtering #1

eldrin commented Nov 16, 2019

eldrin commented Nov 18, 2019

eldrin commented Nov 18, 2019

English entry filtering #1

English entry filtering #1

Comments

eldrin commented Nov 16, 2019

eldrin commented Nov 18, 2019

eldrin commented Nov 18, 2019