Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check if Series consists of strings only, instead of casting to unicode #55

Closed
henrifroese opened this issue Jul 9, 2020 · 3 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@henrifroese
Copy link
Collaborator

Currently, some functions do not check if the Series they get as input really consists of strings only, and they give unexpected results, e.g. if there are missing values.

Example:

import texthero as hero
import pandas as pd
import numpy as np

s = pd.Series(["Test", np.nan])
hero.noun_chunks(s)
>>0                   []
>>1    [(nan, NP, 0, 3)]

This could be fixed by stopping to use s.astype('unicode') which e.g. converts np.nan -> "nan". Instead, a function should check whether the Series consists of strings only. Something along the lines of

def _check_series_strings(s):
    if not df.map(type).eq(str).all():
        raise TypeError("Non-string values in series. Use hero.drop_no_content(s) to drop those values.")
@jbesomi
Copy link
Owner

jbesomi commented Jul 10, 2020

Amazing!

We might want to have a different name for this function. If we agree on the name of the kinds of pandas series defined in #60, we could call it _check_is_text_series or something like that.

Can you work on this?

@henrifroese
Copy link
Collaborator Author

Yes, I'll think about / make some comments at #60 and work on this one.

@jbesomi jbesomi added the enhancement New feature or request label Jul 11, 2020
@henrifroese
Copy link
Collaborator Author

See #60 and related PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants