Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

English entry filtering #1

Open
eldrin opened this issue Nov 16, 2019 · 2 comments
Open

English entry filtering #1

eldrin opened this issue Nov 16, 2019 · 2 comments
Labels
help wanted Extra attention is needed minor_validation validation issue but less impactful

Comments

@eldrin
Copy link
Collaborator

eldrin commented Nov 16, 2019

non-English lyrics should be filtered properly. Currently, (at least for MVP project) we don't stick to advanced machine learning algorithms avoiding the pipeline as shallow as possible so that we can confidently validate every step although each step has a bit more error.

The logic is fairly simple:

  1. count non-English words from each lyric (based on nltk corpus)
  2. based on the non-English word count distribution, we select a threshold (i.e. percentile)
  3. filter out entries where the number of non-English words more than the threshold

But we need validation for optimal threshold!!

@eldrin eldrin added the help wanted Extra attention is needed label Nov 16, 2019
@eldrin eldrin added the minor_validation validation issue but less impactful label Nov 16, 2019
@eldrin
Copy link
Collaborator Author

eldrin commented Nov 18, 2019

One way to tackle this issue is to use topic modeling as a tool for the anomaly detector. Namely, we can learn k topics without filtering, pick m topics indicating they are non-english clusters, then filter them. One can do this procedure iteratively but it seems only one cycle already seems okay.

The evaluation of this approach can be (roughly) measured by the total number of non-english words filtered by the nltk.

@eldrin
Copy link
Collaborator Author

eldrin commented Nov 18, 2019

Still seems there are erroneous entries. If there is enough room for checking it, the current pipeline should be evaluated as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed minor_validation validation issue but less impactful
Projects
None yet
Development

No branches or pull requests

1 participant