Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kind of Pandas Series #60

Closed
jbesomi opened this issue Jul 10, 2020 · 2 comments
Closed

Kind of Pandas Series #60

jbesomi opened this issue Jul 10, 2020 · 2 comments
Labels
discussion To discuss new improvements documentation Improvements or additions to documentation enhancement New feature or request

Comments

@jbesomi
Copy link
Owner

jbesomi commented Jul 10, 2020

Motivation

Having a unified view and a clear idea of the expected Pandas Series input it's useful both for the users and for the developers.

To receive precise and correct errors is very valuable for the users as this permits an easy and pleasant debugging. We can summarize three kinds of Pandas Series a Texthero's function can receive as input (or it can output):

Types

  • "Pandas Text Series" --> every cell has some text
  • "Pandas Tokenized Series" --> every cell has a list of tokens
  • "Pandas Representation Series" --> every cell is a representation of a text ( it's a list of float values). This will be improved soon (See issue Support "Pandas Series Representation" #43)

In the best scenario, every Texthero's function receive as input a Pandas Series of one of these three kind. Testing that the given Pandas Series is of the right expected types is therefore useful.

Go further

  • preprocess.py: almost all function (at the exception of tokenize) takes as input a Pandas Text Series and Return a Pandas Text Series.
  • represention.py: input (will be TokenSeries as input to every representation function #44 ) a Tokenized Pandas Series and output will be Representation Pandas Series
  • nlp.py: input is a Text Pandas Series, whereas the output is TODO
  • visualization.py TODO.

It would be great to have a unified and clear view of all this:

  1. Every function should check for the right type (we will need to define the "check" function, probably under a new file, something like _helper.py)
  2. Once everything is in place and defined, add under the website (documentation) a clear document that explain all this. It will be so easy to use Texthero then!
  3. New ideas

Extra

Unfortunately, there are more variants of Pandas Series (output of named_entities, output of pca, ...) there is still some design work to go there ...

Work in progress ...

@jbesomi jbesomi added discussion To discuss new improvements documentation Improvements or additions to documentation enhancement New feature or request labels Jul 10, 2020
henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 11, 2020
File _helper.py implements Series types for the library.

Would implement jbesomi#60 .
@henrifroese
Copy link
Collaborator

I just opened a first draft PR in #69 as a step to implementing this. I'll copy the new file _helper.py 's docstring here:

Hero Series Types

There are different kinds of Pandas Series used in the library, depending on use.
For example, the functions in preprocessing.py usually take as input a Series
where every cell is a string, and return as output a Series where every cell
is a string. To make handling the different types easier (and most importantly
intuitive for users), this file implements the types as subclasses of Pandas
Series and defines functions to check the types.

These are the implemented types:

  • TextSeries: cells are text (i.e. strings), e.g. "Test"
  • TokenSeries: cells are lists of tokens (i.e. lists of strings), e.g. ["word1", "word2"]
  • RepresentationSeries: cells are vector representations of text (see issue Support "Pandas Series Representation" #43), e.g. [0.25, 0.75]

You could now do this:

@OutputSeries(RepresentationSeries)
@InputSeries(TokenSeries)
def tfidf(s: TokenSeries) -> RepresentationSeries:
    ...

The decorators (@...) make python check whether the input is valid
and transform the output into the correct type,
which leads to easier code and exception handling (no need to write
"if not is_text_series(s): raise ..." in every function) and easy
modification/expansion later on. It will automatically throw the correct error
if the input pandas Series is not a list of words in every cell (as it expects a TokenSeries).
Users do not have to use
the custom types like TokenSeries themselves! They can just use
a normal Pandas Series, and they can immediately see from
the function header that their input should look / behave like
a TokenSeries, and that their output will be a RepresentationSeries.

The typing helps the users understand the code more easily
as they'll be able to see immediately from the documentation
on what types of Series a function operates. This is much more
verbose and clearer than e.g. "tfidf(s: pd.Series) -> pd.Series".

Note that users can of course still simply
use ordinary pd.Series objects.
The functions will then just check if the Series could be
e.g. a TextSeries (so it checks the properties) to give maximum flexibility.
The custom types are subclasses of pd.Series anyway. Thus,
the types enable better documentation and expressiveness
of the code and do not mean that a user really has to pass
a e.g. TextSeries; what he passes just has to have the properties
of one.

Example: user has standard pd.Series s and wants to clean the text.
Calling hero.clean(s), the clean function will check whether s
could be a TextSeries. If yes, it proceeds with the cleaning
and returns a TextSeries. If no, an error is thrown with
a good explaination.

Concerning performance, a user might often have a Series s on which
different operations will be performed. The behaviour will be as follows:

s = pd.Series("test")
s = hero.remove_punctuation(s)
# hero.remove_punctuation first checked if s can be a TextSeries.
# That is the case, so the function was applied as usual.
# The output was then transformed to a TextSeries, without
# the user noticing. If now something like this is done:
s = hero.remove_diacritics(s)
# the remove_diacritics function will immediately notice
# that s is a TextSeries, so the check is O(1) through isinstance.

(NOTE: this could lead to problems later on, if e.g. a user
changes s after remove_punctuation, then the library still
treats it as a TextSeries even though the user might have
applied functions from e.g a different library such that s does not
fulfill the "TextSeries" properties anymore. The error messages
would then be not as good.)

The classes are lightweight subclasses of pd.Series and serve 2 purposes:

  1. Good documentation for users through docstring.
  2. Function(s) to check if a pd.Series has the required properties.

More Examples

import pandas as pd
from texthero._helper import *   # Bad style

@OutputSeries(TextSeries)
@InputSeries(TextSeries)
def do_nothing(s: TextSeries) -> TextSeries:
    return s
t = do_nothing(pd.Series("test"))
t
# 0    test
# dtype: object
type(t)
# TextSeries

do_nothing(pd.Series([1.0]))  # not a TextSeries
TypeError(...)  # (error message is good; too long so left out here)
do_nothing("test")  # not a TextSeries
TypeError(...)

These are just some simple examples. As you can see, this makes it easy for the vast majority of functions to implement checking the correct type through the decorators. It also makes it easier for users to use the library as they immediately know what kind of Series they will give as input / receive as output.

henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 12, 2020
Now incorporates suggested changes.

Input checking done with pd.api.types.is_string_dtype. Not a
permanent solution, will be improved by jbesomi#60 etc.

Co-authored-by: Maximilian Krahn <[email protected]>
henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this issue Jul 12, 2020
New pull request from jbesomi#46 as we had some Git problems.

Input checking done with pd.api.types.is_string_dtype. Not a
permanent solution, will be improved by jbesomi#60 etc.

Co-authored-by: Maximilian Krahn <[email protected]>
jbesomi pushed a commit that referenced this issue Aug 5, 2020
…..). (#69)

* First implementation of custom Series types in.

File _helper.py implements Series types for the library.

Would implement #60 .

* Format code with black.

* Re-implement and overhaul Series types

Co-authored-by: Maximilian Krahn <[email protected]>

* remove `wrapt` import/dependency

* really remove wrapt dependency

* Implement suggested changes

- rename "DocumentRepresentationSeries" to "RepresentationSeries"

Co-authored-by: Henri Froese <[email protected]>
Co-authored-by: Maximilian Krahn <[email protected]>
@henrifroese
Copy link
Collaborator

See #138 #139 #69

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion To discuss new improvements documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants