Objective: Run the Task-1 Subtask-A of AStitchInLanguageModels on MAGPIE dataset
Original dataset is available: MAGPIE_filtered_split_{*}.jsonl.
- The original source code that runs both training and evaluation is obtained from here. The local copy of this code is run_glue_f1_macro.py.
Notes on Reproducibility:
- The paths used in the notebooks are relative. Run every notebook from its own current directory.
- It is better to use even-numbered GPUs (2 is slow, 4 is better) for training & evaluation. Specifically, the batch size should be divisible by number of GPUs.
- When running the experiments on JarvisLabs.ai, follow the below steps:
a. Uninstall the existing version of PyTorch from the instance (it should be PyTorch 1.13)
b. Install the PyTorch 1.12.0 version for the correct CUDA version, using the below command:More details can be found herepip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
The code for adding single-token-representations is based on:
Variations:
The MAGPIE dataset contains idiom
column but the sentences can contain different surface form of those idioms(due discontiguity & variations of MWEs). Approximately 50% of the sentences contain a different form than the given idiom
column. Thus, two different ways of adding single-token-representations are used:
-
Option-1: Just convert the values in
idiom
column to tokens, irrespective of how they are used in the sentence. In other words, this approach will make the LM model to learn only those tokens which have an exact match. -
Option-2: Use the
offsets
column and extract the actual MWE from the sentence. This will capture all possible MWEs in the data, but the number of unique tokens would be very high
In both of the below experiments (exp3A and exp3B), the MWEs are replaced by their corresponding single tokens in the training data. The single-token-representations experiment has following variations:
-
exp3A
: The single-token-representation contain randomly initialized embeddings.
1.1exp3A_1
: Uses theoption-1
method of adding single-token-representations, as described above.
1.2exp3A_2
: Uses theoption-2
method of adding single-token-representations, as described above. -
exp3B
: The model with single-token-representation is first trained(fine-tuned) with a Masked-LM objective on Common Crawl News Dataset (as described in the AStitchInLanguageModels paper). The steps followed here are taken as reference.
Steps:i. Add the new tokens to the vocabulary of the model. This leads to two variations of models using
option-1
andoption-2
.ii. Train(fine-tune) the model with a Masked-LM objective on the pre-processed CC-News corpus.
CC News Data Preparation:
The pre-processed CC-News data for this purpose had to be generated with slight modifications. The original steps are described here. The modified preprocessing scripts are available here.First, download and preprocess the CC News Corpus using
experiments/exp3B_1/process_cc_hpc.sh
script.
Then, prepare the training data for pretraining with single-tokens usingexperiments/exp3B_1/create_pretrain_data_hpc.sh
script.
And then, split theall_replace_data.txt
file into train & eval sets usingexperiments/exp3B_1/split_pretrain_data_hpc.sh
script.Pre Training: Finally, train the model (with updated tokens in step i.) with a Masked-LM objective on this data. The original pretraining-script is refered from here. The customized training scripts are available exp3B_1/train_and_save_hpc.
iii. Use this fine-tuned model with a SequenceClassification objective on the MAGPIE dataset: - First convert the MaskedLM model to a SequenceClassification model using
exp3B_1/MLM_to_SeqClass_model_converter.ipynb
- Then fine-tune on MAGPIE dataset as done by previous experiments(that is, usingexp3B_1/hpc.sh
).iv. Follow these steps of
option-2
as well using the idioms of Option-2. This leads to two experiments:exp3B_1
andexp3B_2
.
Experiment | Code | Single Token Rep | Dataset | Model | Context | Status |
---|---|---|---|---|---|---|
exp0 | exp0 | No | Zero-shot | BERT base (cased) | No Context | Done (3GPUs) |
exp1 | exp1 | No | Zero-shot | XLNet base (cased) | No Context | Done (4GPUs) |
exp2 | exp2 | No | Zero-shot | BERT base (cased) | All Context | Done (4GPUs) |
exp3A_1 | exp3A_1 | Yes | Zero-shot | bert-base-uncased | No Context | Done (RTX5000 x 1) |
exp3A_2 | exp3A_2 | Yes | Zero-shot | BERT base (cased) | No Context | Done (4GPUs) |
exp3B_1 | exp3B_1 | Yes | Zero-shot | bert-base-uncased | No Context | Done (RTX5000 x 1) |
exp3B_2 | exp3B_2 | Yes | Zero-shot | ToBeDecided | ToBeDecided | TODO |
exp4 | exp4 | ToBeDecided | One-shot | ToBeDecided | ToBeDecided | TODO |
exp5 | exp5 | ToBeDecided | Few-shot | ToBeDecided | ToBeDecided | TODO |
*> exp2 and onwards should have used XLNet architecture, used BERT because it was faster
TODO:
- Conduct single-token-representations experiment with XLNet base model.
- The AStitchInLanguageModels paper does Idiom-includ/exclude experiment as well in Task-1. Try that as well, if required.
Experiment | Dev Accuracy | Dev F1 | Test Accuracy | Test F1 |
---|---|---|---|---|
exp0 | 85.16 | 83.00 | 0.0 | 0.0 |
exp1 | 87.60 | 85.38 | 0.0 | 0.0 |
exp2 | 84.91 | 81.50 | 0.0 | 0.0 |
*exp3A_1 | 79.23 | 71.42 | 78.33 | 71.57 |
exp3A_2 | 80.39 | 74.21 | 0.0 | 0.0 |
exp3B_1(deprecated) | 85.14 | 79.29 | 0.0 | 0.0 |
**exp3B_1 | 82.83 | 75.66 | 81.82 | 76.55 |
- *exp3A_1 metrics are from the latest run on (RTX5000 x 1)
- **exp3B_1 metrics are from the latest run on (RTX5000 x 1)
Approximate Training (Wallclock) time per experiment:
-
BERT base-cased (3 GPUs): ~1.5 hours
-
BERT base-cased (4 GPUs): ~1.2 hours
-
XLNet base-cased (4 GPUs): ~1.76 hours
-
Pretraining BERT base-uncased on MLM task (5 GPUs): ~23 hours
-
With (RTX5000 x 1) GPUs: ~1 hour 20 mins
For the error analysis and to study the idiom principle, the MAGPIE PIEs are grouped into different lists based on their characteristics.
The characteristics are observed in the MAGPIE as well as preprocessed CommonCrawl News corpus.
The implementation of grouping of PIEs is available at PIE_segregation_util.ipynb.
The classification reports (both overall and segreated) is generated for exp3A_1
and exp3B_1
using the script produce_test_results.py.
The statistical significance test is done using the script exp_helpers/statistical_significance_test.ipynb.
Wilcoxon signed-rank test is used to test the null hypothesis that two related paired samples come from the same distribution.
References:
Null Hypothesis: The two samples (i.e predicted probabilities of two experiments) come from the same distribution.
Experiments considered: exp3A_1
, exp3B_1
and bt2
Sample Size: 4840
1st Exp | 2nd Exp | W statistic | p-value | Conclusion |
---|---|---|---|---|
exp3A_1 | exp3B_1 | 5195121.0 | 9.433284e-12 | Reject Null Hypothesis |
exp3A_1 | bt2 | 3002096.0 | 1.232316e-189 | Reject Null Hypothesis |
exp3B_1 | bt2 | 3878590.0 | 4.04974e-92 | Reject Null Hypothesis |
Sample Size: 300 (Randomly selected 300 PIEs from the test set, 150 from each class)
1st Exp | 2nd Exp | W statistic | p-value | Conclusion |
---|---|---|---|---|
exp3A_1 | exp3B_1 | 17981.5 | 0.0022545 | Reject Null Hypothesis |
exp3A_1 | bt2 | 12223.5 | 5.8385e-12 | Reject Null Hypothesis |
exp3B_1 | bt2 | 17471.5 | 0.0006899 | Reject Null Hypothesis |
NOTES:
- A significance level of 0.01 is used for all the tests.
- Repetition of experiments doesn't produce much variations in the predicted labels.
- Wilcoxon signed-rank test is very sensitive. A small difference in the predicted probabilities lead to very different distributions, indicated by the very high p-values.
Below results are on repeated experiments with full test set (4840 instances) with a small difference in the predicted labels:
Two trials of the same Exp | W statistic | p-value |
---|---|---|
exp3A_1 | 5831043.0 | 0.75282 |
exp3B_1 | 5845520.5 | 0.88596 |
bt2 | 5855191.0 | 0.97709 |
[1] Hessel Haagsma, Johan Bos, and Malvina Nissim. 2020. MAGPIE: A Large Corpus of Potentially Idiomatic Expressions. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 279–287, Marseille, France. European Language Resources Association.
[2] H. Tayyar Madabushi, E. Gow-Smith, C. Scarton, and A. Villavicencio, “AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 2021, pp. 3464–3477. doi: 10.18653/v1/2021.findings-emnlp.294.
TODO