This repository contains code for training and evaluating a Named Entity Recognition (NER) model on the MultiNERD dataset.
The goal is to develop two models that can identify and classify entities on
- All 10 languages () and only 5 categories (Person, Organization, Location, Diseases and Anmial)
- All 15 categories, but only using the English language
python main.py --learning-rate 1e-4 --batch-size 64 --epochs 5 --model-name bert-base-uncased --language-filter en
torchrun --nproc_per_node 4 main.py --gpu --learning-rate 5e-5 --model-name roberta-large --categories PER ORG
-
--learning-rate
- Description: Learning rate for the optimizer.
- Type: Float
- Default: 5e-5
- Example:
python main.py --learning-rate 1e-4
-
--batch-size
- Description: Batch size for training.
- Type: Integer
- Default: 32
- Example:
python main.py --batch-size 64
-
--epochs
- Description: Number of training epochs.
- Type: Integer
- Default: 1
- Example:
python main.py --epochs 5
-
--model-name
- Description: Specify which model to fine-tune.
- Type: String
- Default: "prajjwal1/bert-tiny"
- Example:
python main.py --model-name "bert-base-uncased"
-
--language-filter
- Description: When specified, all other languages are filtered from the dataset.
- Type: String
- Default: None
- Example:
python main.py --language-filter en
-
--categories
- Description: When specified, all other categories are filtered from the dataset. Provide a whitespace-separated list of categories.
- Type: List of Strings
- Default: None
- Example:
python main.py --categories PERSON ORG LOC
NOTE: You can not use --language-filter
together with --categories
git clone https://github.com/sandstromviktor/MultiNERD.git
cd multinerd-nerd
- Set Up Environment (Linux)
python3 -m venv venv
source ./venv/bin/activate
pip install -r requirements.txt
Or if you want to run a (Docker) container (May not work on GPU)
docker build -t multinerd .
docker run --rm -it -v $PWD/models:/home/code/models multinerd bash
This opens a shell to the container where you can run the same commands (see below) as you would in your venv.
The -v
flag mounts the models folder to the repo folder so that your trained models are persistent on your drive.
The training script preprocesses the data and then uses the 🤗 Trainer
to fine-tune the model.
System A is a language model that is trained to classify all entity types, but for only the english subset of the data.
Run the following command to fine-tune a pre-trained language model on the English subset:
python main.py --model-name prajjwal1/bert-tiny --language-filter English
Specify model of your choice and set parameters as desired.
System B is a language model that is trained to classify five entity types, using all languages in the dataset. The assignment specifies the categories PER ORG LOC DIS ANIM
Run the following command to fine-tune a pre-trained language model on these categories.
python main.py --model-name prajjwal1/bert-tiny --categories PER ORG LOC DIS ANIM
Evaluation of the model is done automatically in the training script every 1000 steps on the validation set. After training is completed, the model is evaulated on the test set.
This calculates Compute micro-F1, recall, precision and accuracy.
Let
Two experiments were conducted, each using the bert-based-multilingual-cased
model (Link). Each model were trained for 1 epoch, using all default hyperparameters (see train.py
for exact values) before tested on the test-dataset
.
Models were trained on 1 NVIDIA A-100 GPU (At NSC Berzelius)
python3 main.py --filter-language english --model-name bert-based-multilingual-cased
Category | F1 | Precision | Recall | Number |
---|---|---|---|---|
ANIM | 0.598 | 0.816 | 0.472 | 32390 |
BIO | 0.333 | 0.592 | 0.232 | 250 |
CEL | 0.849 | 0.801 | 0.902 | 33900 |
DIS | 0.563 | 0.807 | 0.433 | 30676 |
EVE | 0.463 | 0.695 | 0.347 | 1406 |
FOOD | 0.949 | 0.948 | 0.950 | 6373068 |
INST | 0.889 | 0.865 | 0.914 | 145830 |
LOC | 0.471 | 0.503 | 0.444 | 28342 |
MEDIA | 0.590 | 0.592 | 0.589 | 11838 |
MYTH | 0.381 | 0.413 | 0.353 | 11032 |
ORG | 0.493 | 0.470 | 0.519 | 5994 |
PER | 0.927 | 0.909 | 0.945 | 169556 |
PLANT | 0.807 | 0.747 | 0.878 | 6484 |
TIME | 0.291 | 0.437 | 0.218 | 4106 |
VEHI | 0.693 | 0.754 | 0.641 | 11472 |
All | 0.940 | 0.940 | 0.939 | 6866344 |
python3 main.py --categories PER ORG LOC DIS ANIM --model-name bert-base-multilingual-cased
Category | F1 | Precision | Recall | Number |
---|---|---|---|---|
ANIM | 0.801 | 0.811 | 0.792 | 28346 |
DIS | 0.964 | 0.972 | 0.956 | 138188 |
LOC | 0.983 | 0.982 | 0.985 | 169616 |
ORG | 0.953 | 0.948 | 0.957 | 33982 |
PER | 0.983 | 0.979 | 0.987 | 121334 |
All | 0.966 | 0.964 | 0.967 | 491466 |
System A achieved an overall F1 score of 0.940, excelling in categories like FOOD, PER, and INST, while facing challenges in BIO and TIME. The language filter (English subset) didn't impact performance negatively, suggesting adaptability across languages, likely due to the pre-training on multiple languages.
System B outperformed A with an F1 score of 0.966, showing consistency across categories, especially in LOC and PER. Its multilingual training proved advantageous, potentially due to the larger dataset.
Both systems highlighted challenges in categories like BIO and TIME, indicating room for improvement. In-depth error analysis can provide insights for targeted enhancements. The impact of data size was evident, with System B's larger multilingual dataset contributing to its superior performance.