Transformer based NER model #2

walter-hernandez · 2021-06-25T12:40:22Z

Discussion 🗣

Hello all!

Continuing a discussion with @EasonC13 and @niallroche, I am opening this thread to keep track of the different approaches to fine-tune a transformer model (BERT or some variation of it like ALBERT) and its usage with Snorkel.

Context

@EasonC13 already did some work to generate a dataset with multiple NER labels using Keras here: https://github.com/accordproject/labs-cicero-classify/blob/dev/Practice/keras/keras_decompose_NER_model.ipynb

To replicate the above, we can:

Use pytorch by following the example of huggingface
Use Spacy version 3, which comes with a friendly api to work with transformer models and do easy preprocessing of the data

Detailed Description

If we go with spacy, Snorkel has compatibility with it out of the box. However, it is limited to version 2 and depending of our needs, we can do the pull request to facilitate the implementation spacy v3 in Snorkel. Although, we can go without having to do it.

Still, we can create a labelling function with our fine-tuned transformer model and use it as a custom labelling function while using Snorkel's implementation of spacy for the preprocessing needed.

Another thing to consider is the way to do inference, having in mind the high run-time cost in production of a transformer model, with the fine-tuned transformer model being used as a labelling function:

Do batch inference of the labels
Use a lightweight transformer model variation of BERT like ALBERT. huggingface has an example of how to implement it too

EasonC13 · 2021-06-26T11:09:19Z

Let me fill up the previous context.

at 6/24, I think I am encounter a challenge.
It is easy to edit BERT NER model to have custom label. So the model can predict with one label, like “Eason” is a person.
However, I still looking on how to let NER model have “multi custom label”, I want the model to know, like, “Eason” is not only a person but also a party and a string. Because the NER example I see so far, BERT turn the label into a 128 dim label ids, not like classic classification model is the one-hot encoding.
Even-though the model at the pytorch print said it is a n dim output, the actual input data seems not like this.

So I ask my mentors for help, and turn to Keras because I am more familiar with it.

6/25

I successfully change the model output as I wish to categorial mode via Keras.
https://github.com/accordproject/labs-cicero-classify/blob/dev/Practice/keras/keras_decompose_NER_model.ipynb

I first reference this tutorial.
https://apoorvnandan.github.io/2020/08/02/bert-ner/

Then I change the classical NER model's loss to CategoricalCrossentropy and transform train_y into the one-hot encoding format.

Now the model is training! I can't wait to validate it and create a multi_label dataset to verify this method will work or not.

I think I can do it since I first learn Deep learning via Keras. Now I need to know how to do the same thing via PyTorch. I will start reading the document on it.

If you have any clue about how to do the same setting on PyTorch, please let me know. I hope I can build one in PyTorch because Keras is CUDA version-specific and eats all my memory. So it is not a good tool for long-term production use.

And feel free to comment about this approach (and the code).

6/26

I keep learning and looking on how to implement one on PyTorch.

Will also take a look on Spacy.
I think we can discuss that weather use pytorch or use Spacy version 3 as @walter-hernandez mention on monday's meeting.

walter-hernandez · 2021-06-28T12:02:48Z

Hello @EasonC13

Using the same entity for multiple labels is something that I think may not be possible to do within the same model in BERT or at least I am not fully aware of clean ways to do it. However, you can try some hack approaches like:

Have one label being "person+party+string" and distinguish it from label "person" when training the model
Train multiple NER models for different sets of labels. So, one model would tag Eason as person, the other one would tag it as party and the third one would tag it as string

If you go for the second approach, you can try using:

Lightweight versions of BERT like ALBERT
Do adapter tuning

If yo do Adapter tuning, it means that you would have to:

Train a model for each set of labels that identify "Eason":

One model for a group of labels where "Eason" is a PERSON

One model for a group of labels where "Eason" is a STRING

Each individual model keeps BERT's pretrained weights frozen, which means "Adapter-based tuning requires training two orders of magnitude fewer parameters to fine-tuning, while attaining similar performance".

Each individual model or adapter can be combined using adapter fusion:

"The AdapterFusion component takes as input the representations of multiple adapters trained on different tasks and learns a parameterized mixer of the encoded information.", which means sharing information across multiple tasks.

-- Check the documentation about how to do adapter tuning and use adapter fusion: https://docs.adapterhub.ml/

EasonC13 · 2021-07-12T11:03:10Z

Hi @walter-hernandez , I think I can use N adapter for N labels, each adapter specialist for one datatype. So when user correct, it is easy to re-train them without interrupt others.

Now I'm trying to implement it.

While it seems adapter only can train with same output label. So it is a little challenge to set up an effective training pipeline, or I should train N times for N adapter. Now I have come up with a plan to done effective training. That is, train all wanted adapter together with one label (ex: 1), no matter it is the correct label or not, but only do gradient descent on those with correct label, then train the opposite label (ex: 0) and do gradient descent on others.

While I first will build a prototype which have N adapter for N labels, on basic NER dataset, to prove adapter will work for this job.

What do you think about it?

rjurney · 2021-10-13T00:32:18Z

@EasonC13 how did this go?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transformer based NER model #2

Transformer based NER model #2

walter-hernandez commented Jun 25, 2021

EasonC13 commented Jun 26, 2021 •

edited

Loading

walter-hernandez commented Jun 28, 2021

EasonC13 commented Jul 12, 2021

rjurney commented Oct 13, 2021

Transformer based NER model #2

Transformer based NER model #2

Comments

walter-hernandez commented Jun 25, 2021

Discussion 🗣

Context

Detailed Description

EasonC13 commented Jun 26, 2021 • edited Loading

6/25

6/26

walter-hernandez commented Jun 28, 2021

EasonC13 commented Jul 12, 2021

rjurney commented Oct 13, 2021

EasonC13 commented Jun 26, 2021 •

edited

Loading