Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pandas __setitem__ support for DocumentTermDF #158

Closed

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Aug 22, 2020

As discussed, this PR makes small changes to pd.DataFrame.__setitem__ to allow users to do e.g. df["tfidf"] = hero.tfidf(df["text"]). See the extensive documentation we added in _helper.py. We also add tests for this 🕵️ 🕵️‍♂️ 🕵️‍♀️

NOTE: only so many commits/lines as this builds on #156

@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 5, 2020

@jbesomi we now have also adapted the set_item method to handle the changes in PR #156 when the set item method receives a data frame, which has more than one column it will convert the columns into a pandas multi-index with the original column names on the lower level and the set key on the upper level like the user would naturally expect how pandas work. 🐼
so to demonstrate the functioning, when a DataFrame with multiple columns is inserted into an existing one:

>>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([[3, 5], [8, 4]], columns=["term 1", "term 2"],)
>>> df1["count"] = df2
>>> df1
     Test  count       
          term 1 term 2
0  Text 1      3      5
1  Text 2      8      4

when just a single column DataFrame is inserted, it will behave, as it used to:

>>> df1 = pd.DataFrame(["Text 1", "Text 2"], columns=["Test"])
>>> df2 = pd.DataFrame([3, 5], columns=["count"])
>>> df1["here"] = df2
>>> df1
     Test  here
0  Text 1     3
1  Text 2     5

@jbesomi
Copy link
Owner

jbesomi commented Sep 8, 2020

Magic. 🥇

I will review this once #156 is merged so that we can test a bit around with the output DataFrame(s).

Does your solution handle correctly the insertion of sparse Pandas DF?

For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion.

@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 9, 2020

For you to know, my only concern is that I don't fully understand why the Pandas API does not allow for such insertion.

There is no reason, that the pandas API does not support it. This issue we opened at pandas about it. Now our usecase is much easier, with a single column level.

Does your solution handle correctly the insertion of sparse Pandas DF?

It works so far. As soon, as the other two are merged, I will move this PR to ready to review

@mk2510 mk2510 marked this pull request as draft September 9, 2020 21:12
Co-authored-by: Maximilian Krahn <[email protected]>
Copy link
Collaborator

@henrifroese henrifroese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just went through everything again and added additional tests for sparseness. We believe this is ready to be reviewed/merged now :octocat: 🐙

@henrifroese henrifroese marked this pull request as ready for review September 12, 2020 13:11
Copy link
Owner

@jbesomi jbesomi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, less complex than expected 🧐
Looks great!

What if we add this for now as an extension that users can test? We can insert this into a separate file texthero.beta.helper or similar and explain in a blog article how to "activate that" (from texthero.beta import helper ...) and what it allows us to do.

- we don't destroy any pandas functionality as currently calling
`__setitem__` with a Multiindexed value is just not possible, so
our changes to Pandas do not break any Pandas functionality for
the users. We're only _expanding_ the functinoality
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functinoality spelling mistake

2 0 0 1 1 1 1
```

That's a DataFrame. Great! Of course, users can
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can be a bit more succint here (That's a DataFrame. Great! Of course,)

@mk2510 mk2510 added the wontfix This will not be worked on label Sep 14, 2020
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 14, 2020

I also discussed this issue with @henrifroese. we decided that we will not merge it, as discussed in the meeting. In our view when hiding this feature in a beta version, most users will probably ignore this. As this is performance-wise not optimized for huge DataFrames we decided that the tradeoff is not worth it and we will close this PR

@mk2510 mk2510 closed this Sep 14, 2020
@jbesomi
Copy link
Owner

jbesomi commented Sep 14, 2020

Ok. I'm sorry that we couldn't merge it, as it would have been a great feature for users, yet, it would have been probably worse to introduce something that would have made the user experience less pleasant. The fact that assigning a sparse DataFrame takes so long because of the large number of columns is what makes me think twice before making it part of the master branch. Let's hope that with Pandas v.2 we will be able to do something similar in a more efficient way 🙏🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants