Add pandas setitem support for DocumentTermDF #158

mk2510 · 2020-08-22T15:48:22Z

As discussed, this PR makes small changes to pd.DataFrame.__setitem__ to allow users to do e.g. df["tfidf"] = hero.tfidf(df["text"]). See the extensive documentation we added in _helper.py. We also add tests for this 🕵️ 🕵️‍♂️ 🕵️‍♀️

NOTE: only so many commits/lines as this builds on #156

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <[email protected]>

*missing: test adopting for new types Co-authored-by: Henri Froese <[email protected]>

*missing: unitTest Co-authored-by: Henri Froese <[email protected]>

…_insert_concat

mk2510 · 2020-09-05T15:20:41Z

@jbesomi we now have also adapted the set_item method to handle the changes in PR #156 when the set item method receives a data frame, which has more than one column it will convert the columns into a pandas multi-index with the original column names on the lower level and the set key on the upper level like the user would naturally expect how pandas work. 🐼
so to demonstrate the functioning, when a DataFrame with multiple columns is inserted into an existing one:

>>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([[3, 5], [8, 4]], columns=["term 1", "term 2"],)
>>> df1["count"] = df2
>>> df1
     Test  count       
          term 1 term 2
0  Text 1      3      5
1  Text 2      8      4

when just a single column DataFrame is inserted, it will behave, as it used to:

>>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([3, 5], columns=["count"])
>>> df1["here"] = df2
>>> df1
     Test  here
0  Text 1     3
1  Text 2     5

jbesomi · 2020-09-08T11:33:19Z

Magic. 🥇

I will review this once #156 is merged so that we can test a bit around with the output DataFrame(s).

Does your solution handle correctly the insertion of sparse Pandas DF?

For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion.

mk2510 · 2020-09-09T21:10:30Z

For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion.

There is no reason, that the pandas API does not support it. This issue we opened at pandas about it. Now our usecase is much easier, with a single column level.

Does your solution handle correctly the insertion of sparse Pandas DF?

It works so far. As soon, as the other two are merged, I will move this PR to ready to review

…rt_concat

Co-authored-by: Maximilian Krahn <[email protected]>

henrifroese

We just went through everything again and added additional tests for sparseness. We believe this is ready to be reviewed/merged now 🐙

jbesomi

Hmm, less complex than expected 🧐
Looks great!

What if we add this for now as an extension that users can test? We can insert this into a separate file texthero.beta.helper or similar and explain in a blog article how to "activate that" (from texthero.beta import helper ...) and what it allows us to do.

jbesomi · 2020-09-14T12:18:14Z

texthero/_helper.py

+    - we don't destroy any pandas functionality as currently calling
+      `__setitem__` with a Multiindexed value is just not possible, so
+      our changes to Pandas do not break any Pandas functionality for
+      the users. We're only _expanding_ the functinoality


functinoality spelling mistake

jbesomi · 2020-09-14T12:18:53Z

texthero/_helper.py

+2     0  0  1    1   1  1
+```
+
+That's a DataFrame. Great! Of course, users can


we can be a bit more succint here (~~That's a DataFrame. Great! Of course,~~)

mk2510 · 2020-09-14T15:33:23Z

I also discussed this issue with @henrifroese. we decided that we will not merge it, as discussed in the meeting. In our view when hiding this feature in a beta version, most users will probably ignore this. As this is performance-wise not optimized for huge DataFrames we decided that the tradeoff is not worth it and we will close this PR

jbesomi · 2020-09-14T15:38:04Z

Ok. I'm sorry that we couldn't merge it, as it would have been a great feature for users, yet, it would have been probably worse to introduce something that would have made the user experience less pleasant. The fact that assigning a sparse DataFrame takes so long because of the large number of columns is what makes me think twice before making it part of the master branch. Let's hope that with Pandas v.2 we will be able to do something similar in a more efficient way 🙏🏻

mk2510 and others added 13 commits August 18, 2020 22:06

added MultiIndex DF support

fa342a9

suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <[email protected]>

beginning with tests

59a9f8c

implemented correct sparse support

19c52de

*missing: test adopting for new types Co-authored-by: Henri Froese <[email protected]>

Merge branch 'master_upstream' into change_representation_to_multicolumn

66e566c

added back list() and rm .tolist()

41f55a8

rm .tolist() and added list()

217611a

Adopted the test to the new dataframes

6a3b56d

wrong format

b8ff561

Address most review comments.

e3af2f9

Add more unittests for representation

77ad80e

implemented setItem

c6ca37f

*missing: unitTest Co-authored-by: Henri Froese <[email protected]>

formated files

8731ea7

Add tests for custom pandas setitem method

5c4db2f

vercel bot deployed to Preview August 22, 2020 15:48 View deployment

henrifroese added the enhancement New feature or request label Aug 23, 2020

henrifroese mentioned this pull request Aug 28, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

jbesomi mentioned this pull request Sep 4, 2020

Change representation_series to DataFrame #156

Merged

mk2510 added 5 commits September 4, 2020 17:04

implemented the suggested changes

e2768b5

fixed messy docstring

b09f624

fix black issues

508c361

fix formatting

75e955f

Merge branch 'change_representation_to_multicolumn' into adapt_pandas…

fc15dc7

…_insert_concat

vercel bot deployed to Preview September 5, 2020 14:45 View deployment

apdated set_item to the new requierements

7bf3583

vercel bot deployed to Preview September 5, 2020 15:08 View deployment

mk2510 marked this pull request as draft September 9, 2020 21:12

Merge remote-tracking branch 'upstream/master' into adapt_pandas_inse…

0a0aaf6

…rt_concat

vercel bot deployed to Preview September 12, 2020 12:55 View deployment

Add tests for sparseness.

f44739e

Co-authored-by: Maximilian Krahn <[email protected]>

vercel bot deployed to Preview September 12, 2020 13:05 View deployment

henrifroese reviewed Sep 12, 2020

View reviewed changes

henrifroese marked this pull request as ready for review September 12, 2020 13:11

jbesomi reviewed Sep 14, 2020

View reviewed changes

mk2510 added the wontfix This will not be worked on label Sep 14, 2020

mk2510 closed this Sep 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pandas setitem support for DocumentTermDF #158

Add pandas setitem support for DocumentTermDF #158

mk2510 commented Aug 22, 2020 •

edited

Loading

mk2510 commented Sep 5, 2020

jbesomi commented Sep 8, 2020

mk2510 commented Sep 9, 2020

henrifroese left a comment

jbesomi left a comment

jbesomi Sep 14, 2020

jbesomi Sep 14, 2020

mk2510 commented Sep 14, 2020

jbesomi commented Sep 14, 2020

Add pandas __setitem__ support for DocumentTermDF #158

Add pandas __setitem__ support for DocumentTermDF #158

Conversation

mk2510 commented Aug 22, 2020 • edited Loading

mk2510 commented Sep 5, 2020

jbesomi commented Sep 8, 2020

mk2510 commented Sep 9, 2020

henrifroese left a comment

Choose a reason for hiding this comment

jbesomi left a comment

Choose a reason for hiding this comment

jbesomi Sep 14, 2020

Choose a reason for hiding this comment

jbesomi Sep 14, 2020

Choose a reason for hiding this comment

mk2510 commented Sep 14, 2020

jbesomi commented Sep 14, 2020

Add pandas setitem support for DocumentTermDF #158

Add pandas setitem support for DocumentTermDF #158

mk2510 commented Aug 22, 2020 •

edited

Loading