-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How should the classifier behave against empty strings? #130
Comments
Another weirdness could happen when untrain is called more often than train for a category. Some counts will be negative. |
Yeah, this has been sort of a longstanding issue. I'm not sure what the best way to handle empty string would be. |
I have given my recommendation already in the second paragraph of the first post. Without this being resolved, writing tests for #129 would be tricky. |
Ahh sorry, I think I skimmed this one on my phone. Yeah, a length check is probably the right way to go. I wonder if we should log anything out if a user does that. |
Logging is fine, but it should be an information level logging, not a warning or error. Because due to stopword filtering it might get empty unintentionally/unknowingly. |
It occurs to me that We need a |
In what context? In general Ruby code |
That's what I mean. the string Now, this is only an issue if we check for string emptiness before calling the |
I would just check the length of |
Yeah, I guess I was just thinking of a case where you're iteratively training over a large csv or something and it hits a bunch of blank data and wastes time trying to hash that stuff. But your solution is a lot simpler, so let's do that. I may also add a simple |
We can check the emptiness of the supplied string before calling the |
Also, making such decisions should be done in the done in |
closed by #132 |
Currently the Bayes classifier allows passing empty string for all training, untraining, and classification. Also, the strings that have nothing, but stopwords behave the same way. This means, we are essentially messing up with training count while no real training is happening.
I think, we should check the length of
word_hash
and if it is zero then we should just skip the training and untraing methods. If the same is the case when classify method is called, then it should returnnil
as the score for each category should beInfinity
for empty strings.I found this out while I was working on #125.
I can make a PR if this sounds a sensible option to do.
The text was updated successfully, but these errors were encountered: