-
Notifications
You must be signed in to change notification settings - Fork 857
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for other NLP preprocessors (CoreNLP, nltk) #1418
Comments
Hello, I may have a need for this in the future. I may get some time to contribute. I am not sure of what you mean by 'pattern matching' the SpacyPreprocessor: do we want to rebuild a For instance: when defining an augmentation function like this for spacy : spacy_proc = SpacyPreprocessor()
@transformation_function(pre=[spacy_proc])
def swap_adjectives(x):
adjective_idxs = [i for i, token in enumerate(x.doc) if token.pos_ == "ADJ"]
... # modify adjectives
return x Let's say we want to replace
In any case nltk does not have a pipeline concept (correct me if I'am wrong), so it would have to be specified. Maybe you already have thought about the design of this. Let me know if there are any other important points to consider. EDITED: I mistakenly switched option 1 and option 2 at the end of the last paragraph. |
Hi @cyrilou242, thanks for your post! I think you've outlined the tradeoffs well. Because each preprocessor produces potentially different fields, and even fields with the same high-level "type" (such as NER tags) can have different cardinalities and definitions for each tag between processors, I think option 2 is the safer choice, where in your function you'll need to use the field names specific to the preprocessor that you chose, and we assume that you've done your homework to understand what that field means. If you're able to find the time to give this a shot, feel free to post intermediate thoughts here along the way so we can talk over any additional design considerations like the one you brought up; much easier to have these conversations early! |
Thanks for the reply, I agree with option 2. |
spaCy is great as a preprocessor for NLP labeling functions, but there are other libraries that individuals may want to use.
Ideally, we'd like to have wrappers for other packages as well, such as Stanford CoreNLP (https://stanfordnlp.github.io/stanfordnlp/) and NLTK (https://www.nltk.org/). We can pattern match on the SpacyPreprocessor. Then ultimately, give the
nlp_labeling_function
decorator a keyword argument where the user can specify which preprocessor to use.The text was updated successfully, but these errors were encountered: