Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom stopwords file during classifier initialization #125

Closed
ibnesayeed opened this issue Jan 14, 2017 · 2 comments · Fixed by #129
Closed

Custom stopwords file during classifier initialization #125

ibnesayeed opened this issue Jan 14, 2017 · 2 comments · Fixed by #129

Comments

@ibnesayeed
Copy link
Contributor

Currently, the library ships with a list of stopwords in various languages. PR #73 adds the ability to specify more directories to look for stopwords. This means one can only add more stopwords, but can't overwrite it, except. perhaps by setting the value of Hasher.STOPWORDS. However, a stopwords list does not suit in all situations, for sepcial purpose collections stopwords are differnt in the same language. And in some case stopwords are not desired at all.

The current implementation also strongly relies on the name of the file being the language code.

In reality, one classifier instance is only tied with one language and each classifier may want to use its own stopwords. It would be nice to be able to pass an array of stopwords or an arbitrary file path during the initialization of the classifier that can overwrite the value of Hasher.STOPWORDS[@language]. I should be able to make a PR for this if we decide to go for it.

@Ch4s3
Copy link
Member

Ch4s3 commented Jan 14, 2017

I could probably make that happen! I'll take a look tomorrow!

@ibnesayeed
Copy link
Contributor Author

PR #129 should take care of it. However, we need code review and some test cases before we merge it.

Ch4s3 pushed a commit that referenced this issue Jan 18, 2017
* Abbility to add custom stopwords at classifier initialization

* Downcased custom test stopwords

* Documented and improved custom stopwords handling

* Added test cases for custom stopwords and empty trainings, #125 and #130

* Added documentation for auto-categorization and custom stopwords
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants