tfidf(s): remove normalization, improve docstring #76

jbesomi · 2020-07-13T14:29:02Z

The tfidf function, under-the-hoods makes use of the sklearn.feature_extraction.text.TfidfVectorizer.

By default, the TfidfVectorizer return a L2-normalized TF-IDF vector. In Texthero, we would like to avoid this hidden behavior, to let the user have more granular control over each action.

This task consists of removing (and testing) the normalization, as well as make it more clear in the docstring which tfidf-formula is used. For extra information, the chapter text feature extraction of the scikit-learn documentation might be useful.

Task:

Remove L2-normalization.
Test the new function (using the mathematical formula for each value)
Improve the docstring making it clear which TF-IDF function is used

Implementation:

There are probably two ways of solve that:

By using TfidfVectorizer and de-l2-normalizing the output
Using pure NumPy. This would be nicer from a code perspective but it might be tricky as working with NumPy vectors is not always trivial. As the result should be a Sparse Matrix (see Support "Pandas Series Representation" #43 ) this mean we will have to deal with sparse matrix.

Extra:
After having implemented #43, we will add two new functions, L1 and L2, to normalize any "Pandas Representation Series"

The text was updated successfully, but these errors were encountered:

henrifroese · 2020-07-15T11:03:07Z

Working on this with @mk2510 🐐

henrifroese · 2020-07-15T15:56:00Z

Should the new tfidf function already return such a Representation Series?

Working on it, we currently implemented it so that it calculates the Representation Series correctly (using some of your code from #43 ) and then use a function representation_series_to_flat_series to make the output the same as tfidf currently produces. Thus, when #43 will be fully implemented later on, in tfidf only one line has to be changed to get the Representation Series output. But we could also already return the Representation Series output now.

jbesomi · 2020-07-15T16:47:43Z

Working on it, we currently implemented it so that it calculates the Representation Series correctly (using some of your code from #43 ) and then use a function representation_series_to_flat_series to make the output the same as tfidf currently produces. Thus, when #43 will be fully implemented later on, in tfidf only one line has to be changed to get the Representation Series output. But we could also already return the Representation Series output now.

representation_series_to_flat_series: that's sounds great! (this function will be super useful anyway for users to move from the two representation). By the way, we might want to figure out two different XSeries names to differentiate between the two representations. (Also, for instance, pca will still return a Pandas Series where each cell is a list, right?)

Should the new tfidf function already return such a Representation Series?

What's your opinion? We might also want to do both tasks 1 and 2 in #84 before releasing a new version. This is fine as well. In this case, we can already return the "new" Pandas Representation but this means that many tests will fail ...

... have a look at the unit tests to check for the "new" Pandas Representation under #43 (it took me a long while to implement that, it might spare you a bit of time too!)

Docstring now includes formula/explaination. Normalization disabled. Representation Series is already being handled (although output is still like before). Function representation_series_to_flat_series added. Co-authored-by: Maximilian Krahn <[email protected]>

Docstring now includes formula/explaination. Normalization disabled (the option "normalization=None" was "hidden" in the sklearn code, so that turned out to be an easy fix). Representation Series is already being handled (although output is still like before, using representation_series_to_flat_series). Function representation_series_to_flat_series added. Unit tests are changed accordingly, also one with the explicit calculation using the formula. Co-authored-by: Maximilian Krahn <[email protected]>

henrifroese · 2020-07-15T21:21:35Z

We just opened a PR at #97 .

By the way, we might want to figure out two different XSeries names to differentiate between the two representations

Yes, that is very important to discuss, and probably should be handled later on in #85 2.

What's your opinion? We might also want to do both tasks 1 and 2 in #84 before releasing a new version.

I think it makes sense to still return a "old" output at the moment, so that's what we did currently. It probably makes sense to overhaul all the representation functions at the same time in #85 2. As the new tfidf already calculates with the representation series and just gives the old output, it will be easy to change.

jbesomi mentioned this issue Jul 14, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

jbesomi changed the title ~~Imrpviving: tfidf(s) remove normalization, improve docstring~~ Improving tfidf(s) remove normalization, improve docstring Jul 14, 2020

jbesomi pinned this issue Jul 14, 2020

jbesomi changed the title ~~Improving tfidf(s) remove normalization, improve docstring~~ tfidf(s): remove normalization, improve docstring Jul 14, 2020

jbesomi added the enhancement New feature or request label Jul 15, 2020

jbesomi unpinned this issue Jul 15, 2020

jbesomi closed this as completed in 1d4d5a0 Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tfidf(s): remove normalization, improve docstring #76

tfidf(s): remove normalization, improve docstring #76

jbesomi commented Jul 13, 2020 •

edited

Loading

henrifroese commented Jul 15, 2020

henrifroese commented Jul 15, 2020

jbesomi commented Jul 15, 2020

henrifroese commented Jul 15, 2020

tfidf(s): remove normalization, improve docstring #76

tfidf(s): remove normalization, improve docstring #76

Comments

jbesomi commented Jul 13, 2020 • edited Loading

henrifroese commented Jul 15, 2020

henrifroese commented Jul 15, 2020

jbesomi commented Jul 15, 2020

henrifroese commented Jul 15, 2020

jbesomi commented Jul 13, 2020 •

edited

Loading