Code for SemEval Task4 Subtask 1
- python == 3.8.18
- pip3
- pipenv or conda (optional but strongly recommended) - for environment isolation
First, clone sklearn-hierarchical-classification
repository:
git clone https://github.com/lfmatosm/sklearn-hierarchical-classification
pip install -r requirements.txt
pipenv shell
pipenv install
If you encounter any problems related to installing sklearn-hierarchical-classification
with pipenv
, just ignore it.
After the previous steps, use pip
to install the local repository:
pip install ../sklearn-hierarchical-classification # point to the cloned repository path
For a working Google Colab example, please refer to this notebook.
For a quickstart using a shell script, please refer to this shell script
For a multilabel classification example, please refer to this notebook.
python -m src.fine_tuning \
--model xlm-roberta-base \
--dataset ptc2019 \
--fine_tuned_name xlm-roberta-base-ptc2019 \
--save_model
python -m src.fine_tuning_with_class \
--model jhu-clsp/bernice \
--dataset semeval2024_dev_labeled \
--fine_tuned_name jhu-clsp-bernice-semeval2024-dev-labeled-classifier \
--batch_size 8 \
--save_strategy epoch \
--lr 3.9e-5 \
--epochs 5 \
--save_model
Using the [CLS]
token:
python -m src.feature_extraction \
--model xlm-roberta-base \
--dataset semeval2024 \
--extraction_method cls \
Or if you want to use specific hidden-layers:
python -m src.feature_extraction \
--model xlm-roberta-base \
--dataset semeval2024 \
--extraction_method layers \
--layers 4 5 6 7 \
--agg_method "avg"
Or if you want to use sentence embeddings:
python -m src.feature_extraction \
--model "sentence-transformers/stsb-xlm-r-multilingual" \
--dataset semeval2024 \
--extraction_method sentence
You can also specify a folder for saving the features:
python -m src.feature_extraction \
--model "sentence-transformers/jhu-clsp/bernice" \
--dataset semeval2024 \
--extraction_method cls \
--output_dir test_folder/
Using a Binary Relevance classifier. Notice those have a few optional arguments that may be relevant to Oversampling
python -m src.classification \
--classifier "LogisticRegression" \
--dataset semeval2024 \
--train_features "./feature_extraction/train_features.json" \
--test_features "./feature_extraction/test_features.json" \
--dev_features "./feature_extraction/dev_features.json" \
--seed 1 \
--oversampler Combination \
--sample_strategy 1
Using a multilabel feedforward classifier:
python -m src.classification \
--classifier "MLP" \
--dataset semeval2024 \
--train_features "./feature_extraction/train_features.json" \
--test_features "./feature_extraction/test_features.json" \
--dev_features "./feature_extraction/dev_features.json" \
--seed 1