-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kind of Pandas Series #60
Comments
File _helper.py implements Series types for the library. Would implement jbesomi#60 .
I just opened a first draft PR in #69 as a step to implementing this. I'll copy the new file _helper.py 's docstring here: Hero Series TypesThere are different kinds of Pandas Series used in the library, depending on use. These are the implemented types:
You could now do this:
The decorators (@...) make python check whether the input is valid The typing helps the users understand the code more easily Note that users can of course still simply Example: user has standard pd.Series s and wants to clean the text. Concerning performance, a user might often have a Series s on which
(NOTE: this could lead to problems later on, if e.g. a user The classes are lightweight subclasses of pd.Series and serve 2 purposes:
More Examples
These are just some simple examples. As you can see, this makes it easy for the vast majority of functions to implement checking the correct type through the decorators. It also makes it easier for users to use the library as they immediately know what kind of Series they will give as input / receive as output. |
Now incorporates suggested changes. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>
New pull request from jbesomi#46 as we had some Git problems. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>
…..). (#69) * First implementation of custom Series types in. File _helper.py implements Series types for the library. Would implement #60 . * Format code with black. * Re-implement and overhaul Series types Co-authored-by: Maximilian Krahn <[email protected]> * remove `wrapt` import/dependency * really remove wrapt dependency * Implement suggested changes - rename "DocumentRepresentationSeries" to "RepresentationSeries" Co-authored-by: Henri Froese <[email protected]> Co-authored-by: Maximilian Krahn <[email protected]>
Motivation
Having a unified view and a clear idea of the expected Pandas Series input it's useful both for the users and for the developers.
To receive precise and correct errors is very valuable for the users as this permits an easy and pleasant debugging. We can summarize three kinds of Pandas Series a Texthero's function can receive as input (or it can output):
Types
cell
has some textcell
has a list of tokensfloat
values). This will be improved soon (See issue Support "Pandas Series Representation" #43)In the best scenario, every Texthero's function receive as input a Pandas Series of one of these three kind. Testing that the given Pandas Series is of the right expected types is therefore useful.
Go further
preprocess.py
: almost all function (at the exception oftokenize
) takes as input a Pandas Text Series and Return a Pandas Text Series.represention.py
: input (will be TokenSeries as input to every representation function #44 ) a Tokenized Pandas Series and output will be Representation Pandas Seriesnlp.py
: input is a Text Pandas Series, whereas the output is TODOvisualization.py
TODO.It would be great to have a unified and clear view of all this:
_helper.py
)Extra
Unfortunately, there are more variants of Pandas Series (output of
named_entities
, output ofpca
, ...) there is still some design work to go there ...Work in progress ...
The text was updated successfully, but these errors were encountered: