Add support for other NLP preprocessors (CoreNLP, nltk) #1418

bhancock8 · 2019-08-17T00:12:21Z

spaCy is great as a preprocessor for NLP labeling functions, but there are other libraries that individuals may want to use.

Ideally, we'd like to have wrappers for other packages as well, such as Stanford CoreNLP (https://stanfordnlp.github.io/stanfordnlp/) and NLTK (https://www.nltk.org/). We can pattern match on the SpacyPreprocessor. Then ultimately, give the nlp_labeling_function decorator a keyword argument where the user can specify which preprocessor to use.

The text was updated successfully, but these errors were encountered:

cyrilou242 · 2019-10-01T12:59:32Z

Hello, I may have a need for this in the future. I may get some time to contribute.

I am not sure of what you mean by 'pattern matching' the SpacyPreprocessor: do we want to rebuild a spacy.Doc like object, to assure some compatibility at the tf definition level?

For instance: when defining an augmentation function like this for spacy :

spacy_proc = SpacyPreprocessor()
@transformation_function(pre=[spacy_proc])
def swap_adjectives(x):
    adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
    ... # modify adjectives
        return x

Let's say we want to replace spacy_proc by a nltk_proc or stanfordNLP_proc.

option 1: the token.pos_ exists for all nlp preprocessors, changing the preprocessor in the transformation above does not break the function
option 2: each processors builds its own objects and we assume the user knows what each processor builds, changing the preprocessor in the transformation above breaks the function because .pos_ belongs to spacy

In any case nltk does not have a pipeline concept (correct me if I'am wrong), so it would have to be specified.
On the contrary, StanfordNLP does have a pipeline, and it is different from spacy's one.

Maybe you already have thought about the design of this.
If think going with option 2 would be better to avoid having to maintain 'adaptators' to spacy Doc format, but switching between nlp proc will be difficult. I guess you would go for option 2, but I am a bit biased for option 1, because I'm building a repo of text transformation function, and being able to switch from one processors to another without breaking my tfs would be cool.

Let me know if there are any other important points to consider.

EDITED: I mistakenly switched option 1 and option 2 at the end of the last paragraph.

bhancock8 · 2019-10-02T06:45:56Z

Hi @cyrilou242, thanks for your post! I think you've outlined the tradeoffs well. Because each preprocessor produces potentially different fields, and even fields with the same high-level "type" (such as NER tags) can have different cardinalities and definitions for each tag between processors, I think option 2 is the safer choice, where in your function you'll need to use the field names specific to the preprocessor that you chose, and we assume that you've done your homework to understand what that field means.

If you're able to find the time to give this a shot, feel free to post intermediate thoughts here along the way so we can talk over any additional design considerations like the one you brought up; much easier to have these conversations early!

cyrilou242 · 2019-10-02T09:25:55Z

Thanks for the reply, I agree with option 2.
I'll get back in some weeks, I got to dig a bit more into StanfordNLP 0.2.0 which is quite new.

bhancock8 added the help wanted label Aug 17, 2019

vincentschen added the feature request label Aug 19, 2019

vincentschen added the no-stale Auto-stale bot skips this issue label Nov 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

bhancock8 commented Aug 17, 2019

cyrilou242 commented Oct 1, 2019 •

edited

Loading

bhancock8 commented Oct 2, 2019

cyrilou242 commented Oct 2, 2019 •

edited

Loading

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

Add support for other NLP preprocessors (CoreNLP, nltk) #1418

Comments

bhancock8 commented Aug 17, 2019

cyrilou242 commented Oct 1, 2019 • edited Loading

bhancock8 commented Oct 2, 2019

cyrilou242 commented Oct 2, 2019 • edited Loading

cyrilou242 commented Oct 1, 2019 •

edited

Loading

cyrilou242 commented Oct 2, 2019 •

edited

Loading