Kind of Pandas Series #60

jbesomi · 2020-07-10T09:48:53Z

Motivation

Having a unified view and a clear idea of the expected Pandas Series input it's useful both for the users and for the developers.

To receive precise and correct errors is very valuable for the users as this permits an easy and pleasant debugging. We can summarize three kinds of Pandas Series a Texthero's function can receive as input (or it can output):

Types

"Pandas Text Series" --> every cell has some text
"Pandas Tokenized Series" --> every cell has a list of tokens
"Pandas Representation Series" --> every cell is a representation of a text ( it's a list of float values). This will be improved soon (See issue Support "Pandas Series Representation" #43)

In the best scenario, every Texthero's function receive as input a Pandas Series of one of these three kind. Testing that the given Pandas Series is of the right expected types is therefore useful.

Go further

preprocess.py: almost all function (at the exception of tokenize) takes as input a Pandas Text Series and Return a Pandas Text Series.
represention.py: input (will be TokenSeries as input to every representation function #44 ) a Tokenized Pandas Series and output will be Representation Pandas Series
nlp.py: input is a Text Pandas Series, whereas the output is TODO
visualization.py TODO.

It would be great to have a unified and clear view of all this:

Every function should check for the right type (we will need to define the "check" function, probably under a new file, something like _helper.py)
Once everything is in place and defined, add under the website (documentation) a clear document that explain all this. It will be so easy to use Texthero then!
New ideas

Extra

Unfortunately, there are more variants of Pandas Series (output of named_entities, output of pca, ...) there is still some design work to go there ...

Work in progress ...

The text was updated successfully, but these errors were encountered:

File _helper.py implements Series types for the library. Would implement jbesomi#60 .

henrifroese · 2020-07-11T21:32:50Z

I just opened a first draft PR in #69 as a step to implementing this. I'll copy the new file _helper.py 's docstring here:

Hero Series Types

There are different kinds of Pandas Series used in the library, depending on use.
For example, the functions in preprocessing.py usually take as input a Series
where every cell is a string, and return as output a Series where every cell
is a string. To make handling the different types easier (and most importantly
intuitive for users), this file implements the types as subclasses of Pandas
Series and defines functions to check the types.

These are the implemented types:

TextSeries: cells are text (i.e. strings), e.g. "Test"
TokenSeries: cells are lists of tokens (i.e. lists of strings), e.g. ["word1", "word2"]
RepresentationSeries: cells are vector representations of text (see issue Support "Pandas Series Representation" #43), e.g. [0.25, 0.75]

You could now do this:

@OutputSeries(RepresentationSeries)
@InputSeries(TokenSeries)
def tfidf(s: TokenSeries) -> RepresentationSeries:
    ...

The decorators (@...) make python check whether the input is valid
and transform the output into the correct type,
which leads to easier code and exception handling (no need to write
"if not is_text_series(s): raise ..." in every function) and easy
modification/expansion later on. It will automatically throw the correct error
if the input pandas Series is not a list of words in every cell (as it expects a TokenSeries).
Users do not have to use
the custom types like TokenSeries themselves! They can just use
a normal Pandas Series, and they can immediately see from
the function header that their input should look / behave like
a TokenSeries, and that their output will be a RepresentationSeries.

The typing helps the users understand the code more easily
as they'll be able to see immediately from the documentation
on what types of Series a function operates. This is much more
verbose and clearer than e.g. "tfidf(s: pd.Series) -> pd.Series".

Note that users can of course still simply
use ordinary pd.Series objects.
The functions will then just check if the Series could be
e.g. a TextSeries (so it checks the properties) to give maximum flexibility.
The custom types are subclasses of pd.Series anyway. Thus,
the types enable better documentation and expressiveness
of the code and do not mean that a user really has to pass
a e.g. TextSeries; what he passes just has to have the properties
of one.

Example: user has standard pd.Series s and wants to clean the text.
Calling hero.clean(s), the clean function will check whether s
could be a TextSeries. If yes, it proceeds with the cleaning
and returns a TextSeries. If no, an error is thrown with
a good explaination.

Concerning performance, a user might often have a Series s on which
different operations will be performed. The behaviour will be as follows:

s = pd.Series("test")
s = hero.remove_punctuation(s)
# hero.remove_punctuation first checked if s can be a TextSeries.
# That is the case, so the function was applied as usual.
# The output was then transformed to a TextSeries, without
# the user noticing. If now something like this is done:
s = hero.remove_diacritics(s)
# the remove_diacritics function will immediately notice
# that s is a TextSeries, so the check is O(1) through isinstance.

(NOTE: this could lead to problems later on, if e.g. a user
changes s after remove_punctuation, then the library still
treats it as a TextSeries even though the user might have
applied functions from e.g a different library such that s does not
fulfill the "TextSeries" properties anymore. The error messages
would then be not as good.)

The classes are lightweight subclasses of pd.Series and serve 2 purposes:

Good documentation for users through docstring.
Function(s) to check if a pd.Series has the required properties.

More Examples

import pandas as pd
from texthero._helper import *   # Bad style

@OutputSeries(TextSeries)
@InputSeries(TextSeries)
def do_nothing(s: TextSeries) -> TextSeries:
    return s
t = do_nothing(pd.Series("test"))
t
# 0    test
# dtype: object
type(t)
# TextSeries

do_nothing(pd.Series([1.0]))  # not a TextSeries
TypeError(...)  # (error message is good; too long so left out here)
do_nothing("test")  # not a TextSeries
TypeError(...)

These are just some simple examples. As you can see, this makes it easy for the vast majority of functions to implement checking the correct type through the decorators. It also makes it easier for users to use the library as they immediately know what kind of Series they will give as input / receive as output.

Now incorporates suggested changes. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>

New pull request from jbesomi#46 as we had some Git problems. Input checking done with pd.api.types.is_string_dtype. Not a permanent solution, will be improved by jbesomi#60 etc. Co-authored-by: Maximilian Krahn <[email protected]>

…..). (#69) * First implementation of custom Series types in. File _helper.py implements Series types for the library. Would implement #60 . * Format code with black. * Re-implement and overhaul Series types Co-authored-by: Maximilian Krahn <[email protected]> * remove `wrapt` import/dependency * really remove wrapt dependency * Implement suggested changes - rename "DocumentRepresentationSeries" to "RepresentationSeries" Co-authored-by: Henri Froese <[email protected]> Co-authored-by: Maximilian Krahn <[email protected]>

henrifroese · 2020-08-24T16:37:10Z

See #138 #139 #69

jbesomi added discussion To discuss new improvements documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2020

This was referenced Jul 10, 2020

Check if Series consists of strings only, instead of casting to unicode #55

Closed

Add count_sentences function to nlp.py #51

Merged

henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 11, 2020

First implementation of custom Series types in.

b01115a

File _helper.py implements Series types for the library. Would implement jbesomi#60 .

henrifroese mentioned this issue Jul 11, 2020

Checking for Series Types (Representation Series, Tokenized Series, ...). #69

Merged

henrifroese mentioned this issue Jul 12, 2020

Implement Automated Readability Index, Closes #20 ; new PR; Waiting until Checking for NaNs is implemented. #74

Draft

This was referenced Jul 14, 2020

👩‍💻 API next steps: checklist #85

Open

Adding the missing arguments to wordcloud func. Closes #77 #96

Merged

This was referenced Jul 24, 2020

How to provide multilingual support #84

Open

Add _check_tokenized function #125

Closed

henrifroese closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kind of Pandas Series #60

Kind of Pandas Series #60

jbesomi commented Jul 10, 2020

henrifroese commented Jul 11, 2020

henrifroese commented Aug 24, 2020

Kind of Pandas Series #60

Kind of Pandas Series #60

Comments

jbesomi commented Jul 10, 2020

henrifroese commented Jul 11, 2020

Hero Series Types

More Examples

henrifroese commented Aug 24, 2020