-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pandas __setitem__ support for DocumentTermDF #158
Closed
mk2510
wants to merge
21
commits into
jbesomi:master
from
SummerOfCode-NoHate:adapt_pandas_insert_concat
Closed
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
fa342a9
added MultiIndex DF support
mk2510 59a9f8c
beginning with tests
henrifroese 19c52de
implemented correct sparse support
mk2510 66e566c
Merge branch 'master_upstream' into change_representation_to_multicolumn
mk2510 41f55a8
added back list() and rm .tolist()
mk2510 217611a
rm .tolist() and added list()
mk2510 6a3b56d
Adopted the test to the new dataframes
mk2510 b8ff561
wrong format
mk2510 e3af2f9
Address most review comments.
henrifroese 77ad80e
Add more unittests for representation
henrifroese c6ca37f
implemented setItem
mk2510 8731ea7
formated files
mk2510 5c4db2f
Add tests for custom pandas setitem method
mk2510 e2768b5
implemented the suggested changes
mk2510 b09f624
fixed messy docstring
mk2510 508c361
fix black issues
mk2510 75e955f
fix formatting
mk2510 fc15dc7
Merge branch 'change_representation_to_multicolumn' into adapt_pandas…
mk2510 7bf3583
apdated set_item to the new requierements
mk2510 0a0aaf6
Merge remote-tracking branch 'upstream/master' into adapt_pandas_inse…
henrifroese f44739e
Add tests for sparseness.
henrifroese File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -184,3 +184,4 @@ dmypy.json | |
# Cython debug symbols | ||
cython_debug/ | ||
docs/source/api | ||
.vscode/launch.json |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,5 @@ | |
from .nlp import * | ||
|
||
from . import stopwords | ||
|
||
from . import _helper |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -71,3 +71,166 @@ def wrapper(*args, **kwargs): | |
return wrapper | ||
|
||
return decorator | ||
|
||
|
||
""" | ||
Pandas Integration of DocumentTermDF | ||
|
||
It's really important that users can seamlessly integrate texthero's function | ||
output with their code. Let's assume a user has his documents in a DataFrame | ||
`df["texts"]` that looks like this: | ||
|
||
``` | ||
>>> df = pd.DataFrame(["Text of doc 1", "Text of doc 2", "Text of doc 3"], columns=["text"]) | ||
>>> df | ||
text | ||
0 Text of doc 1 | ||
1 Text of doc 2 | ||
2 Text of doc 3 | ||
|
||
``` | ||
|
||
Let's look at an example output that `hero.count` could | ||
return with the DocumentTermDF: | ||
|
||
``` | ||
>>> hero.count(df["text"]) | ||
count | ||
1 2 3 Text doc of | ||
0 1 0 0 1 1 1 | ||
1 0 1 0 1 1 1 | ||
2 0 0 1 1 1 1 | ||
``` | ||
|
||
That's a DataFrame. Great! Of course, users can | ||
just store this somewhere as e.g. `df_count = hero.count(df["texts"])`, | ||
and that works great. Accessing is then also as always: to get the | ||
count values, they can just do `df_count.values` and have the count matrix | ||
right there! | ||
|
||
However, what we see really often is users wanting to do this: | ||
`df["count"] = hero.count(df["texts"])`. This sadly does not work out | ||
of the box. The reason is that this subcolumn type is implemented | ||
internally through a _Multiindex in the columns_. So we have | ||
|
||
``` | ||
>>> df.columns | ||
Index(['text'], dtype='object') | ||
>>> hero.count(df["texts"]).columns | ||
MultiIndex([('count', '1'), | ||
('count', '2'), | ||
('count', '3'), | ||
('count', 'Text'), | ||
('count', 'doc'), | ||
('count', 'of')], | ||
) | ||
|
||
``` | ||
|
||
Pandas _cannot_ automatically combine these. So what we will | ||
do is this: Calling `df["count"] = hero.count(df["texts"])` is | ||
internally this: `pd.DataFrame.__setitem__(self=df, key="count", value=hero.count(df["texts"]))`. | ||
We will overwrite this method so that if _self_ is not multiindexed yet | ||
and _value_ is multiindexed, we transform _self_ (so `df` here) to | ||
be multiindexed and we can then easily integrate our column-multiindexed output from texthero. | ||
See the implementation below for details. | ||
|
||
Additionally, we support this for pd.concat in a similar way; again, see the | ||
implementation below for details. | ||
|
||
Advantages / Why does this work? | ||
|
||
- we don't destroy any pandas functionality as currently calling | ||
`__setitem__` with a Multiindexed value is just not possible, so | ||
our changes to Pandas do not break any Pandas functionality for | ||
the users. We're only _expanding_ the functinoality | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. functinoality spelling mistake |
||
|
||
- after multiindexing, users can still access their | ||
"normal" columns like before; e.g. `df["texts"]` will | ||
behave the same way as before even though it is now internally | ||
multiindexed as `MultiIndex([('text', ''), ('count', '1'), | ||
('count', '2'), | ||
('count', '3'), | ||
('count', 'Text'), | ||
('count', 'doc'), | ||
('count', 'of')], | ||
)`. | ||
|
||
Disadvantage: | ||
|
||
- poor performance, so we discurage user from using it, but we still want to support it | ||
""" | ||
|
||
# Store the original __setitem__ function as _original__setitem__ | ||
_pd_original__setitem__ = pd.DataFrame.__setitem__ | ||
pd.DataFrame._original__setitem__ = _pd_original__setitem__ | ||
|
||
|
||
# Define a new __setitem__ function that will replace pd.DataFrame.__setitem__ | ||
def _hero__setitem__(self, key, value): | ||
""" | ||
Called when doing self["key"] = value. | ||
E.g. df["count"] = hero.count(df["texts"]) is internally doing | ||
pd.DataFrame.__setitem__(self=df, key="count", value=hero.count(df["texts"]). | ||
|
||
So self is df, key is the new column's name, value is | ||
what we want to put into the new column. | ||
|
||
What we do: | ||
|
||
1. If user calls __setitem__ with value being multiindexed, e.g. | ||
df["count"] = hero.count(df["texts"]), | ||
so __setitem__(self=df, key="count", value=hero.count(df["texts"]) | ||
|
||
2. we make self multiindexed if it isn't already | ||
-> e.g. column "text" internally becomes multiindexed | ||
to ("text", "") but users do _not_ notice this. | ||
This is a very quick operation that does not need | ||
to look at the df's values, we just reassign | ||
self.columns | ||
|
||
3. we change value's columns so the first level is named `key` | ||
-> e.g. a user might do df["haha"] = hero.count(df["texts"]), | ||
so just doing df[hero.count(df["texts"]).columns] = hero.count(df["texts"]) | ||
would give him a new column that is named like texthero's output, | ||
e.g. "count" instead of "haha". So we internally rename the | ||
value columns (e.g. ('haha', '1'), | ||
('haha', '2'), | ||
('haha', '3'), | ||
('haha', 'Text'), | ||
('haha', 'doc'), | ||
('haha', 'of')]]) | ||
|
||
4. we do self[value.columns] = value as that's exactly the command | ||
that correctly integrates the multiindexed `value` into `self` | ||
|
||
""" | ||
|
||
# 1. | ||
if ( | ||
isinstance(value, pd.DataFrame) | ||
and len(value.columns) > 1 | ||
and isinstance(key, str) | ||
): | ||
|
||
# 2. | ||
if not isinstance(self.columns, pd.MultiIndex): | ||
self.columns = pd.MultiIndex.from_tuples( | ||
[(col_name, "") for col_name in self.columns.values] | ||
) | ||
|
||
# 3. | ||
value.columns = pd.MultiIndex.from_tuples( | ||
[(key, subcol_name) for subcol_name in value.columns.values] | ||
) | ||
|
||
# 4. | ||
self[value.columns] = value | ||
|
||
else: | ||
|
||
self._original__setitem__(key, value) | ||
|
||
|
||
# Replace __setitem__ with our custom function | ||
pd.DataFrame.__setitem__ = _hero__setitem__ |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can be a bit more succint here (
That's a DataFrame. Great! Of course,)