Change representation_series to DataFrame #156

mk2510 · 2020-08-21T08:47:52Z

all functions, which previously dealt with representation series now handle only the dataframe instead. 🚀
rm all functions like flatten, as they are not needed anymore
adopted docstrings and tests

-> further stuff to do:

add those examples into the tutorials, readme, getting started

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <[email protected]>

*missing: test adopting for new types Co-authored-by: Henri Froese <[email protected]>

henrifroese · 2020-08-21T14:14:30Z

Will review soon

henrifroese

Overall: looks great; nice that we're close to getting this done 🚀 ! General comments that should be addressed:

macOS build is failing in Travis. From the log we can see that this is due to the DocumentTermDF not being printed the same in macOS consoles. We probably do not want to not test this at all, so either (a) look at somehow passing a #doctest: +some_command_to_solve_this or (b) skip it with #doctest: +SKIP and add a unittest instead where we manually compare the series (probably much easier)
in general: for dimensionality reduction and clustering, as far as I can see we are not testing this at all. Of course we haven't tested it before, but this is probably the best time to add at least one unittest for all the functions (we're skipping all the doctests at the moment). Should be relatively quick to implement this in test_representation.py

tests/test_representation.py

texthero/representation.py

tests/test_representation.py

henrifroese · 2020-08-21T15:08:35Z

I'll address everything I can myself right now (doctests, unittests, ...)

henrifroese · 2020-08-21T16:51:08Z

Was able to address most small comments myself, will do the rest later with @mk2510

henrifroese · 2020-08-21T17:50:43Z

We have now addressed the remaining issues. We're skipping doctests in representation.py and implemented doctests for every representation function in test_representation.py instead.

From our side, this can be merged now @jbesomi 🙏 🚀 🐳 🤞

jbesomi · 2020-09-03T18:56:26Z

Looks great (even too hard to understand, at least quickly).

The main question (and sorry for the late review; will catch up faster now): why do we return a MultiIndex sparse DataFrame? Why not simply a (sparse) DataFrame? This should simplify things a bit and probably is the most natural type users expect (i.e we wrap the scipy sparse matrix on a DF)

jbesomi · 2020-09-08T09:18:22Z

As far as I know, we will not have to deal anymore with "RepresentationSeries" as there are no functions that return such object. tfidf and company returns a DataFrame that we can convert into a VectorSeries. Is that right?

Then, for instance, normalize does not need to handle the RepresentationSeries case, rather VectorSeries and DataFrame. It seems to me that _check_is_valid_DataFrame is not useful as-it-is, rather, we can move the code inside the normalize function to detect if input_matrix is a VectorSeries (need to check is valid) or simply a DataFrame.

There are still many part of the code that mention "DocumentTermDF", this might not be necessary, right? i.e pca can be applied to any DataFrame, not only document-term ...

plus you need to fix minor issues, such as docstring longer than 75 characters ... (i.e line 917), pca and nmf not having the same summary sentence (one is missing "on the given input.") [I haven't checked the others]

☝️ @mk2510 for the next time, make sure to set "ready for review" when you double-checked the whole PR and you are sure on all changes. It makes spare time to both you and me :) 👍

mk2510 · 2020-09-09T20:52:30Z

However we use _check_is_valid_DataFrame in multiple different functions, like pca, mnf, etc. It is necessary to convert the given data into the right function input format. This works differently for VectorSeries and for DataFrames. Hence we use this function quite often. If only used in normalize, I totally agree with you, that it would be better to move it inside the function.

There are still many part of the code that mention "DocumentTermDF", this might not be necessary, right? i.e pca can be applied to any DataFrame, not only document-term ...

I totally miss those 🤦 but now all unnecessary Document Term mentions should be gone.

pca and nmf not having the same summary sentence (one is missing "on the given input.") [I haven't checked the others]

Those summary sentences should now be the same and the lengths also under 76 everywhere 🤞

As far as I know, we will not have to deal anymore with "RepresentationSeries" as there are no functions that return such object. tfidf and company returns a DataFrame that we can convert into a VectorSeries. Is that right?

That is absolutly right. The representation Series will be removed in the next PR, where we worked on the hero types. #157 🏎️

henrifroese

Just went through everything once more. ~~Will fix~~ Fixed the very small stuff I found. It's now ready to merge in my opinion

texthero/representation.py

jbesomi · 2020-09-12T09:39:56Z

Looks almost perfect 😍

I just noticed how we don't have a strict rule for how we define the default value in the docstring. I believe we can stick to :

max_features : int, optional, default=None

Can you please make sure in all functions we write it the same way?

henrifroese · 2020-09-12T09:57:29Z

Can you please make sure in all functions we write it the same way?

Yes, I'll do that and add the "British/American English" and "number of default components" to CONTRIBUTING.md and change it in all files later.

EDIT: now decided to already do this in representation as this representation version is so different from the master.

henrifroese · 2020-09-12T10:10:06Z

Just incorporated the suggested changes from the review 🌩️

henrifroese · 2020-09-12T10:15:26Z

As we can see, the DF doctest fails in macOS, so I'll skip it again

jbesomi · 2020-09-12T10:17:27Z

Ok, see here

jbesomi · 2020-09-12T11:11:04Z

Let's go! 🎉 🎉 Good job.

mk2510 and others added 8 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <[email protected]>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <[email protected]>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

vercel bot deployed to Preview August 21, 2020 08:47 View deployment

mk2510 requested a review from henrifroese August 21, 2020 08:48

mk2510 linked an issue Aug 21, 2020 that may be closed by this pull request

Support "Pandas Series Representation" #43

Closed

henrifroese requested changes Aug 21, 2020

View reviewed changes

Address most review comments.

e3af2f9

vercel bot deployed to Preview August 21, 2020 16:49 View deployment

Add more unittests for representation

77ad80e

vercel bot deployed to Preview August 21, 2020 17:45 View deployment

henrifroese marked this pull request as ready for review August 21, 2020 17:50

henrifroese approved these changes Aug 21, 2020

View reviewed changes

henrifroese mentioned this pull request Aug 22, 2020

RepresentationSeries: pca, nmf, tsne #140

Closed

henrifroese added the enhancement New feature or request label Aug 22, 2020

henrifroese mentioned this pull request Aug 22, 2020

HeroTypes in Representation; DataFrame in _types #157

Merged

This was referenced Aug 22, 2020

Add pandas __setitem__ support for DocumentTermDF #158

Closed

Fix Term_Frequency #165

Merged

This was referenced Aug 26, 2020

Implement filter_extremes #169

Open

👩‍💻 API next steps: checklist #85

Open

jbesomi marked this pull request as draft September 8, 2020 09:04

jbesomi requested a review from henrifroese September 8, 2020 09:05

This was referenced Sep 8, 2020

Topic Modelling and Visualization #163

Open

Add ClusterSeries to Hero Series Types #170

Open

edited docstrings and DocumentTermDF

3ba2ebc

vercel bot deployed to Preview September 9, 2020 20:43 View deployment

uniform docstring

111ced6

vercel bot temporarily deployed to Preview September 9, 2020 20:46 Inactive

formatting done

b3823e2

vercel bot deployed to Preview September 9, 2020 20:47 View deployment

mk2510 marked this pull request as ready for review September 9, 2020 20:55

henrifroese reviewed Sep 12, 2020

View reviewed changes

Fix small stuff from review.

efcf8c0

vercel bot deployed to Preview September 12, 2020 08:58 View deployment

jbesomi reviewed Sep 12, 2020

View reviewed changes

incorporate suggested changes

6e0c831

vercel bot deployed to Preview September 12, 2020 10:08 View deployment

re-skip doctest as it fails on macOS

3f8b734

vercel bot deployed to Preview September 12, 2020 10:17 View deployment

jbesomi merged commit 72f351e into jbesomi:master Sep 12, 2020

This was referenced Sep 12, 2020

Issues with _types #180

Closed

Redo / Improve our Doctests #184

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change representation_series to DataFrame #156

Change representation_series to DataFrame #156

mk2510 commented Aug 21, 2020 •

edited

Loading

henrifroese commented Aug 21, 2020

henrifroese left a comment

henrifroese commented Aug 21, 2020

henrifroese commented Aug 21, 2020

henrifroese commented Aug 21, 2020 •

edited

Loading

jbesomi commented Sep 3, 2020

jbesomi commented Sep 8, 2020

mk2510 commented Sep 9, 2020 •

edited

Loading

henrifroese left a comment •

edited

Loading

jbesomi commented Sep 12, 2020

henrifroese commented Sep 12, 2020 •

edited

Loading

henrifroese commented Sep 12, 2020 •

edited

Loading

henrifroese commented Sep 12, 2020

jbesomi commented Sep 12, 2020

jbesomi commented Sep 12, 2020

Change representation_series to DataFrame #156

Change representation_series to DataFrame #156

Conversation

mk2510 commented Aug 21, 2020 • edited Loading

henrifroese commented Aug 21, 2020

henrifroese left a comment

Choose a reason for hiding this comment

henrifroese commented Aug 21, 2020

henrifroese commented Aug 21, 2020

henrifroese commented Aug 21, 2020 • edited Loading

jbesomi commented Sep 3, 2020

jbesomi commented Sep 8, 2020

mk2510 commented Sep 9, 2020 • edited Loading

henrifroese left a comment • edited Loading

Choose a reason for hiding this comment

jbesomi commented Sep 12, 2020

henrifroese commented Sep 12, 2020 • edited Loading

henrifroese commented Sep 12, 2020 • edited Loading

henrifroese commented Sep 12, 2020

jbesomi commented Sep 12, 2020

jbesomi commented Sep 12, 2020

mk2510 commented Aug 21, 2020 •

edited

Loading

henrifroese commented Aug 21, 2020 •

edited

Loading

mk2510 commented Sep 9, 2020 •

edited

Loading

henrifroese left a comment •

edited

Loading

henrifroese commented Sep 12, 2020 •

edited

Loading

henrifroese commented Sep 12, 2020 •

edited

Loading