This repository contains code for our paper to appear in ICHI 2023: End-to-End Models for Chemical–Protein Interaction Extraction: Better Tokenization and Span-Based Pipeline Strategies.
pip install -r requirements.txt
The full original dataset is availabe at this link: ChemProt dataset of BioCreative VI. However, for fair comparsion, we have made available preprocessed data suitable for span-based methods in this folder of this repository: chemprot_data/processed_data/json
. To clarify, the original training and validation datasets were combined and split into 80:20 partitions for our modeling. This is the split that is made available in tokenized format in this repository's data folder.
You can use scripts in preprocess
to preprocess raw data downloaded from ChemProt dataset of BioCreative VI. Note after preprocessing there are 1,020 training and 612 validation instances. However, to increase the amount of data used for training the model, we have combined the provided training and validation instances (1,020 + 612 = 1632) into 1,305 training and 327 validation instances available in chemprot_data/processed_data/json
. This is typically how other teams who addressed this task have handled; since we are not touching the test dataset in any of these aspects, our evaluation does not involve any leakage from test instances.
The code for this project is based on the span-based pipeline model: Princeton University Relation Extraction (PURE) by Zhong and Chen (NAACL 2021). Please see further details for different arguments in the original repository by them: PURE. PURE_A
to PURE_E
in our repo correspond to 6 models with different relation representations in our paper. Below we show an example running the model with relation representation A.
python 'PURE_A/run_entity.py' \
--do_train --do_eval \
--num_epoch=50 --print_loss_step=50 \
--learning_rate=1e-5 --task_learning_rate=5e-4 \
--train_batch_size=16 \
--eval_batch_size=16 \
--max_span_length=16 \
--context_window=300 \
--task chemprot_5 \
--seed=$seed \
--data_dir "chemprot_data/processed_data/json" \
--model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--output_dir "chemprot_models/chemprot_a/ent_$seed"
python 'PURE_A/run_relation.py' \
--task chemprot_5 \
--do_train --train_file "chemprot_data/processed_data/json/train.json" \
--do_eval \
--model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--do_lower_case \
--train_batch_size=16 \
--eval_batch_size=16 \
--learning_rate=2e-5 \
--num_train_epochs=10 \
--context_window=100 \
--max_seq_length=250 \
--seed=$seed \
--entity_output_dir "chemprot_models/chemprot_a/ent_$seed" \
--output_dir "chemprot_models/chemprot_a/rel_$seed"
python 'PURE_A/run_entity.py' \
--do_eval --eval_test \
--max_span_length=16 \
--context_window=300 \
--task chemprot_5 \
--data_dir 'chemprot_data/processed_data/json' \
--model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--output_dir "chemprot_models/chemprot_a/ent_$seed"
python 'PURE_A/run_relation.py' \
--task chemprot_5 \
--do_eval --eval_test \
--model microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract \
--do_lower_case \
--context_window=100 \
--max_seq_length=250 \
--entity_output_dir "chemprot_models/chemprot_a/ent_$seed" \
--output_dir "chemprot_models/chemprot_a/rel_$seed/"
python "PURE_A/run_eval.py" --prediction_file "chemprot_models/chemprot_a/rel_$seed/"/predictions.json