Add count_sentences function to nlp.py #51

henrifroese · 2020-07-09T10:59:05Z

Also add tests for the function to test_nlp.py
The function is implemented with spaCy, similarly to the other functions in the nlp module.
In response to #46

Also add tests for the function to test_nlp.py

jbesomi · 2020-07-09T11:36:51Z

It looks super nice! (and cool!)

This is definitely useful (Texthero's users will like it)

Some small changes / comments:

You did a great job! 👍
For consistency, on tests/test_nlp.py "Number of sentences." should be replace with "Count sentences" and the same for the test functions names.
Great that you did "return pd.Series(..., s=s.index)". Forgetting s.index is a common mistake and it leads to problems.
---> We need to test this; on every Texthero's function (this will take a while ...). Can you please add a unit test for this case? You just need to define as input a Pandas Series with index not reset.
---> Then, we will have to open a new issue and mention that we should test this on all Texthero's function that receives as input a Series and return a Series (almost all). You want to do it? I can review it later if you wish. Probably, we can add an extra test_x (name to be decided) file on the tests folder that we text for this kind of thing.
Top that you select in the pipeline only the sentencizer
s.astype("unicode").values I forgot why we should add this, can you explain it? If we just pass s.values, what does it happen? this might help
---> Also there we need to add unit-test to check this edge-cases
"Return a new Pandas Series with the results." is not very informative. Also, the body of the docstring and the head title should be separated by an empty line. Have a look at the numpydoc docstring standard section for hints.
pd.api.types.is_string_dtype, as we might to use this more than in one place, maybe from pd.api.types import is_string_dtype is a valid alternative.
---> Great catch anyway! We will need to add this kind of condition to other functions as well (will open an issue soon).

henrifroese · 2020-07-09T17:08:35Z

Thanks, I've implemented the suggestions. I opened the issue mentioned in 3 at #54 . About 5, I thought about it, and python already supports unicode in normal strings, so we actually do not need this. It rather is dangerous (it's also used in some other functions like noun_chunks) as e.g. we have

import pandas as pd
import numpy as np
s = pd.Series([np.nan])
s.values.astype('unicode')[0]
>> "nan"

so it converts the np.nan object that's used internally for missing values to the literal string "nan". This is certainly not what the user wants/expects.

Using pd.api.types.is_string_dtype(s) will not work, as it shows the following behaviour:

import pandas as pd
import numpy as np
s = pd.Series(["Test", 0])
t = pd.Series(["Test", np.nan])
pd.api.types.is_string_dtype(s)
>> True
pd.api.types.is_string_dtype(t)
>> True

Something like this

def _check_series_strings(s):
if not df.map(type).eq(str).all():
    raise TypeError("Non-string values in series. Use hero.drop_no_content(s) to drop those values.")

should work and is very fast.

So in total,

changes to count_sentences are implemented. This does not use .astype('unicode') anymore but instead lets the exception be thrown by spaCy for the time being (it's already correctly a TypeError, it will just be nicer if texthero throws it. Same goes for checking the index, I added a test case for that, but Add tests for (and implement when missing) using the input's index when returning a new Series. #54 will do this better). Another issue/PR (next bullet point) will overhaul this for all functions where it makes sense.
I opend issue Check if Series consists of strings only, instead of casting to unicode #55 to implement a function like above to check for strings and call this function where it makes sense.

Add more tests, change style (docstring, tests naming). Remove unicode-casting to avoid unexpected behaviour.

jbesomi · 2020-07-10T10:08:01Z

Very good job and a very nice catch!

By the way, what's your name? I suggest to add a photo profile on your Github profile and also have a more readable name ... it might be good pub for you!

Thank you for opening the issue Add tests for (and implement when missing) using the input's index when returning a new Series. #54, amazing.
We might add a "See also" on the docstring where we reference the spacy sentencizer. This is a bit tricky, but a good learning :) you will need to edit intersphinx_mapping dict of conf.py of the Sphinx configuration file (look under docs). For more info, you can refer to sphinx.ext.intersphinx. Have you ever used Sphinx before? At the beginning is quite hard to use as the documentation is not particularly readable, but after a while, things will start making sense :). You will need to check that by running the website locally, again, some good learning 👍
Is the term item the most appropriate? I never know how to precisely define a single cell of a Pandas Series ...

How Texthero's should throw errors

I opened an issue regarding that, please have a look at #60. We can either deal with that afterward this PR and merge it or again put it in draft mode and fix (?) #60 first ...

henrifroese · 2020-07-10T16:22:23Z

We might add a "See also" on the docstring where we reference the spacy sentencizer.

I couldn't yet get the website to work locally with the updated documentation (running the website works, but the script update_documentation.sh gives me errors - maybe that's not the correct way to update the documentation or I need to do something else first?), so I added a direct link to the spacy sentencizer documentation that will render as hyperlink in sphinx (should have the same effect).

I opened an issue regarding that, please have a look at #60. We can either deal with that afterward this PR and merge it or again put it in draft mode and fix (?) #60 first ...

I'd suggest not putting this PR on hold as #60 could take quite some time until everything surrounding that is finished. When implementing #60 and related issues like #55 most functions will need to be overhauled anyway.

Is the term item the most appropriate?

I've changed it to "cell", that should be intuitive and I saw it's already used in some other locations in texthero.

Also, I implemented this 🤓 :

I suggest to add a photo profile on your Github profile and also have a more readable name

Additionally update index tests, they're cleaner now.

jbesomi · 2020-07-11T08:44:45Z

Thank you!

All right, we will change it later on all functions.

Regarding spaCy, we should do a comparison study and see if effectively using spacy pipe is faster than pandas apply. We will need to test this on a large dataset. Also, batch_size to 32 has been chosen quite randomly, we will need to test different values (or even approaches for that). I will open an issue now.

You can now add another PR where you update the README by adding your name under contributors! 🎉🎉 congrats and welcome!

* Add count_sentences function to nlp.py Also add tests for the function to test_nlp.py * Implement suggestions from pull request. Add more tests, change style (docstring, tests naming). Remove unicode-casting to avoid unexpected behaviour. * Add link to spacy documentation. Additionally update index tests, they're cleaner now. Co-authored-by: Henri Froese <[email protected]>

Add count_sentences function to nlp.py

dc27f1c

Also add tests for the function to test_nlp.py

vercel bot deployed to Preview July 9, 2020 10:59 View deployment

henrifroese mentioned this pull request Jul 9, 2020

Add Automated Readability Index (ARI). Closes #20 #46

Closed

henrifroese marked this pull request as ready for review July 9, 2020 11:03

Implement suggestions from pull request.

e90227a

Add more tests, change style (docstring, tests naming). Remove unicode-casting to avoid unexpected behaviour.

vercel bot deployed to Preview July 9, 2020 17:13 View deployment

Add link to spacy documentation.

e0639d6

Additionally update index tests, they're cleaner now.

vercel bot deployed to Preview July 10, 2020 16:25 View deployment

jbesomi merged commit 452c93c into jbesomi:master Jul 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add count_sentences function to nlp.py #51

Add count_sentences function to nlp.py #51

henrifroese commented Jul 9, 2020

jbesomi commented Jul 9, 2020

henrifroese commented Jul 9, 2020 •

edited

Loading

jbesomi commented Jul 10, 2020

henrifroese commented Jul 10, 2020

jbesomi commented Jul 11, 2020

Add count_sentences function to nlp.py #51

Add count_sentences function to nlp.py #51

Conversation

henrifroese commented Jul 9, 2020

jbesomi commented Jul 9, 2020

henrifroese commented Jul 9, 2020 • edited Loading

jbesomi commented Jul 10, 2020

henrifroese commented Jul 10, 2020

jbesomi commented Jul 11, 2020

henrifroese commented Jul 9, 2020 •

edited

Loading