Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add count_sentences function to nlp.py #51

Merged
merged 3 commits into from
Jul 11, 2020

Conversation

henrifroese
Copy link
Collaborator

Also add tests for the function to test_nlp.py
The function is implemented with spaCy, similarly to the other functions in the nlp module.
In response to #46

Also add tests for the function to test_nlp.py
@jbesomi
Copy link
Owner

jbesomi commented Jul 9, 2020

It looks super nice! (and cool!)

This is definitely useful (Texthero's users will like it)

Some small changes / comments:

  1. You did a great job! 👍
  2. For consistency, on tests/test_nlp.py "Number of sentences." should be replace with "Count sentences" and the same for the test functions names.
  3. Great that you did "return pd.Series(..., s=s.index)". Forgetting s.index is a common mistake and it leads to problems.
    ---> We need to test this; on every Texthero's function (this will take a while ...). Can you please add a unit test for this case? You just need to define as input a Pandas Series with index not reset.
    ---> Then, we will have to open a new issue and mention that we should test this on all Texthero's function that receives as input a Series and return a Series (almost all). You want to do it? I can review it later if you wish. Probably, we can add an extra test_x (name to be decided) file on the tests folder that we text for this kind of thing.
  4. Top that you select in the pipeline only the sentencizer
  5. s.astype("unicode").values I forgot why we should add this, can you explain it? If we just pass s.values, what does it happen? this might help
    ---> Also there we need to add unit-test to check this edge-cases
  6. "Return a new Pandas Series with the results." is not very informative. Also, the body of the docstring and the head title should be separated by an empty line. Have a look at the numpydoc docstring standard section for hints.
  7. pd.api.types.is_string_dtype, as we might to use this more than in one place, maybe from pd.api.types import is_string_dtype is a valid alternative.
    ---> Great catch anyway! We will need to add this kind of condition to other functions as well (will open an issue soon).

@henrifroese
Copy link
Collaborator Author

henrifroese commented Jul 9, 2020

Thanks, I've implemented the suggestions. I opened the issue mentioned in 3 at #54 . About 5, I thought about it, and python already supports unicode in normal strings, so we actually do not need this. It rather is dangerous (it's also used in some other functions like noun_chunks) as e.g. we have

import pandas as pd
import numpy as np
s = pd.Series([np.nan])
s.values.astype('unicode')[0]
>> "nan"

so it converts the np.nan object that's used internally for missing values to the literal string "nan". This is certainly not what the user wants/expects.

Using pd.api.types.is_string_dtype(s) will not work, as it shows the following behaviour:

import pandas as pd
import numpy as np
s = pd.Series(["Test", 0])
t = pd.Series(["Test", np.nan])
pd.api.types.is_string_dtype(s)
>> True
pd.api.types.is_string_dtype(t)
>> True

Something like this

def _check_series_strings(s):
if not df.map(type).eq(str).all():
    raise TypeError("Non-string values in series. Use hero.drop_no_content(s) to drop those values.")

should work and is very fast.

So in total,

Add more tests, change style (docstring, tests naming).
Remove unicode-casting to avoid unexpected behaviour.
@jbesomi
Copy link
Owner

jbesomi commented Jul 10, 2020

Very good job and a very nice catch!

By the way, what's your name? I suggest to add a photo profile on your Github profile and also have a more readable name ... it might be good pub for you!

  1. Thank you for opening the issue Add tests for (and implement when missing) using the input's index when returning a new Series. #54, amazing.
  2. We might add a "See also" on the docstring where we reference the spacy sentencizer. This is a bit tricky, but a good learning :) you will need to edit intersphinx_mapping dict of conf.py of the Sphinx configuration file (look under docs). For more info, you can refer to sphinx.ext.intersphinx. Have you ever used Sphinx before? At the beginning is quite hard to use as the documentation is not particularly readable, but after a while, things will start making sense :). You will need to check that by running the website locally, again, some good learning 👍
  3. Is the term item the most appropriate? I never know how to precisely define a single cell of a Pandas Series ...

How Texthero's should throw errors

I opened an issue regarding that, please have a look at #60. We can either deal with that afterward this PR and merge it or again put it in draft mode and fix (?) #60 first ...

@henrifroese
Copy link
Collaborator Author

We might add a "See also" on the docstring where we reference the spacy sentencizer.

I couldn't yet get the website to work locally with the updated documentation (running the website works, but the script update_documentation.sh gives me errors - maybe that's not the correct way to update the documentation or I need to do something else first?), so I added a direct link to the spacy sentencizer documentation that will render as hyperlink in sphinx (should have the same effect).

I opened an issue regarding that, please have a look at #60. We can either deal with that afterward this PR and merge it or again put it in draft mode and fix (?) #60 first ...

I'd suggest not putting this PR on hold as #60 could take quite some time until everything surrounding that is finished. When implementing #60 and related issues like #55 most functions will need to be overhauled anyway.

Is the term item the most appropriate?

I've changed it to "cell", that should be intuitive and I saw it's already used in some other locations in texthero.

Also, I implemented this 🤓 :

I suggest to add a photo profile on your Github profile and also have a more readable name

Additionally update index tests, they're cleaner now.
@jbesomi
Copy link
Owner

jbesomi commented Jul 11, 2020

Thank you!

All right, we will change it later on all functions.

Regarding spaCy, we should do a comparison study and see if effectively using spacy pipe is faster than pandas apply. We will need to test this on a large dataset. Also, batch_size to 32 has been chosen quite randomly, we will need to test different values (or even approaches for that). I will open an issue now.

You can now add another PR where you update the README by adding your name under contributors! 🎉🎉 congrats and welcome!

henrifroese added a commit to SummerOfCode-NoHate/texthero that referenced this pull request Jul 12, 2020
* Add count_sentences function to nlp.py

Also add tests for the function to test_nlp.py

* Implement suggestions from pull request.

Add more tests, change style (docstring, tests naming).
Remove unicode-casting to avoid unexpected behaviour.

* Add link to spacy documentation.

Additionally update index tests, they're cleaner now.

Co-authored-by: Henri Froese <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants