Check if Series consists of strings only, instead of casting to unicode #55

henrifroese · 2020-07-09T17:57:17Z

Currently, some functions do not check if the Series they get as input really consists of strings only, and they give unexpected results, e.g. if there are missing values.

Example:

import texthero as hero
import pandas as pd
import numpy as np

s = pd.Series(["Test", np.nan])
hero.noun_chunks(s)
>>0                   []
>>1    [(nan, NP, 0, 3)]

This could be fixed by stopping to use s.astype('unicode') which e.g. converts np.nan -> "nan". Instead, a function should check whether the Series consists of strings only. Something along the lines of

def _check_series_strings(s):
    if not df.map(type).eq(str).all():
        raise TypeError("Non-string values in series. Use hero.drop_no_content(s) to drop those values.")

The text was updated successfully, but these errors were encountered:

jbesomi · 2020-07-10T10:04:32Z

Amazing!

We might want to have a different name for this function. If we agree on the name of the kinds of pandas series defined in #60, we could call it _check_is_text_series or something like that.

Can you work on this?

henrifroese · 2020-07-10T16:26:29Z

Yes, I'll think about / make some comments at #60 and work on this one.

henrifroese · 2020-08-24T16:35:08Z

See #60 and related PRs

henrifroese mentioned this issue Jul 9, 2020

Add count_sentences function to nlp.py #51

Merged

jbesomi assigned henrifroese Jul 11, 2020

jbesomi added the enhancement New feature or request label Jul 11, 2020

jbesomi mentioned this issue Jul 14, 2020

👩‍💻 API next steps: checklist #85

Open

17 tasks

henrifroese closed this as completed Aug 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check if Series consists of strings only, instead of casting to unicode #55

Check if Series consists of strings only, instead of casting to unicode #55

henrifroese commented Jul 9, 2020

jbesomi commented Jul 10, 2020

henrifroese commented Jul 10, 2020

henrifroese commented Aug 24, 2020

Check if Series consists of strings only, instead of casting to unicode #55

Check if Series consists of strings only, instead of casting to unicode #55

Comments

henrifroese commented Jul 9, 2020

jbesomi commented Jul 10, 2020

henrifroese commented Jul 10, 2020

henrifroese commented Aug 24, 2020