Idiom Principle on MAGPIE dataset

Objective: Run the Task-1 Subtask-A of AStitchInLanguageModels on MAGPIE dataset

Dataset used

Original dataset is available: MAGPIE_filtered_split_{*}.jsonl.

Experiment Setup

The original source code that runs both training and evaluation is obtained from here. The local copy of this code is run_glue_f1_macro.py.

Notes on Reproducibility:

The paths used in the notebooks are relative. Run every notebook from its own current directory.
It is better to use even-numbered GPUs (2 is slow, 4 is better) for training & evaluation. Specifically, the batch size should be divisible by number of GPUs.
When running the experiments on JarvisLabs.ai, follow the below steps:
a. Uninstall the existing version of PyTorch from the instance (it should be PyTorch 1.13)
b. Install the PyTorch 1.12.0 version for the correct CUDA version, using the below command:
```
pip install torch==1.12.0+cu116 torchvision==0.13.0+cu116 torchaudio==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu116
```
More details can be found here

Single Token Representation

The code for adding single-token-representations is based on:

Variations:
The MAGPIE dataset contains idiom column but the sentences can contain different surface form of those idioms(due discontiguity & variations of MWEs). Approximately 50% of the sentences contain a different form than the given idiom column. Thus, two different ways of adding single-token-representations are used:

Option-1: Just convert the values in idiom column to tokens, irrespective of how they are used in the sentence. In other words, this approach will make the LM model to learn only those tokens which have an exact match.
Option-2: Use the offsets column and extract the actual MWE from the sentence. This will capture all possible MWEs in the data, but the number of unique tokens would be very high

Variations of exp3:

In both of the below experiments (exp3A and exp3B), the MWEs are replaced by their corresponding single tokens in the training data. The single-token-representations experiment has following variations:

exp3A: The single-token-representation contain randomly initialized embeddings.
1.1 exp3A_1: Uses the option-1 method of adding single-token-representations, as described above.
1.2 exp3A_2: Uses the option-2 method of adding single-token-representations, as described above.
exp3B: The model with single-token-representation is first trained(fine-tuned) with a Masked-LM objective on Common Crawl News Dataset (as described in the AStitchInLanguageModels paper). The steps followed here are taken as reference.
Steps:

i. Add the new tokens to the vocabulary of the model. This leads to two variations of models using option-1 and option-2.

ii. Train(fine-tune) the model with a Masked-LM objective on the pre-processed CC-News corpus.

CC News Data Preparation:
The pre-processed CC-News data for this purpose had to be generated with slight modifications. The original steps are described here. The modified preprocessing scripts are available here.

First, download and preprocess the CC News Corpus using experiments/exp3B_1/process_cc_hpc.sh script.
Then, prepare the training data for pretraining with single-tokens using experiments/exp3B_1/create_pretrain_data_hpc.sh script.
And then, split the all_replace_data.txt file into train & eval sets using experiments/exp3B_1/split_pretrain_data_hpc.sh script.

Pre Training: Finally, train the model (with updated tokens in step i.) with a Masked-LM objective on this data. The original pretraining-script is refered from here. The customized training scripts are available exp3B_1/train_and_save_hpc.

iii. Use this fine-tuned model with a SequenceClassification objective on the MAGPIE dataset: - First convert the MaskedLM model to a SequenceClassification model using exp3B_1/MLM_to_SeqClass_model_converter.ipynb
- Then fine-tune on MAGPIE dataset as done by previous experiments(that is, using exp3B_1/hpc.sh).

iv. Follow these steps of option-2 as well using the idioms of Option-2. This leads to two experiments: exp3B_1 and exp3B_2.

Experiment Tracker

Experiment	Code	Single Token Rep	Dataset	Model	Context	Status
exp0	exp0	No	Zero-shot	BERT base (cased)	No Context	Done (3GPUs)
exp1	exp1	No	Zero-shot	XLNet base (cased)	No Context	Done (4GPUs)
exp2	exp2	No	Zero-shot	BERT base (cased)	All Context	Done (4GPUs)
exp3A_1	exp3A_1	Yes	Zero-shot	bert-base-uncased	No Context	Done (RTX5000 x 1)
exp3A_2	exp3A_2	Yes	Zero-shot	BERT base (cased)	No Context	Done (4GPUs)
exp3B_1	exp3B_1	Yes	Zero-shot	bert-base-uncased	No Context	Done (RTX5000 x 1)
exp3B_2	exp3B_2	Yes	Zero-shot	ToBeDecided	ToBeDecided	TODO
exp4	exp4	ToBeDecided	One-shot	ToBeDecided	ToBeDecided	TODO
exp5	exp5	ToBeDecided	Few-shot	ToBeDecided	ToBeDecided	TODO

*> exp2 and onwards should have used XLNet architecture, used BERT because it was faster

TODO:

Conduct single-token-representations experiment with XLNet base model.
The AStitchInLanguageModels paper does Idiom-includ/exclude experiment as well in Task-1. Try that as well, if required.

Results

Experiment	Dev Accuracy	Dev F1	Test Accuracy	Test F1
exp0	85.16	83.00	0.0	0.0
exp1	87.60	85.38	0.0	0.0
exp2	84.91	81.50	0.0	0.0
*exp3A_1	79.23	71.42	78.33	71.57
exp3A_2	80.39	74.21	0.0	0.0
exp3B_1(deprecated)	85.14	79.29	0.0	0.0
**exp3B_1	82.83	75.66	81.82	76.55

*exp3A_1 metrics are from the latest run on (RTX5000 x 1)
**exp3B_1 metrics are from the latest run on (RTX5000 x 1)

Approximate Training (Wallclock) time per experiment:

BERT base-cased (3 GPUs): ~1.5 hours
BERT base-cased (4 GPUs): ~1.2 hours
XLNet base-cased (4 GPUs): ~1.76 hours
Pretraining BERT base-uncased on MLM task (5 GPUs): ~23 hours
With (RTX5000 x 1) GPUs: ~1 hour 20 mins

Error Analysis & Study

For the error analysis and to study the idiom principle, the MAGPIE PIEs are grouped into different lists based on their characteristics.
The characteristics are observed in the MAGPIE as well as preprocessed CommonCrawl News corpus.

The implementation of grouping of PIEs is available at PIE_segregation_util.ipynb.

The classification reports (both overall and segreated) is generated for exp3A_1 and exp3B_1 using the script produce_test_results.py.

Statistical Significance test

The statistical significance test is done using the script exp_helpers/statistical_significance_test.ipynb.

Wilcoxon signed-rank test is used to test the null hypothesis that two related paired samples come from the same distribution.

References:

https://pythonfordatascienceorg.wordpress.com/wilcoxon-sign-ranked-test-python/

Results:

Null Hypothesis: The two samples (i.e predicted probabilities of two experiments) come from the same distribution.

Experiments considered: exp3A_1, exp3B_1 and bt2

Sample Size: 4840

1st Exp	2nd Exp	W statistic	p-value	Conclusion
exp3A_1	exp3B_1	5195121.0	9.433284e-12	Reject Null Hypothesis
exp3A_1	bt2	3002096.0	1.232316e-189	Reject Null Hypothesis
exp3B_1	bt2	3878590.0	4.04974e-92	Reject Null Hypothesis

Sample Size: 300 (Randomly selected 300 PIEs from the test set, 150 from each class)

1st Exp	2nd Exp	W statistic	p-value	Conclusion
exp3A_1	exp3B_1	17981.5	0.0022545	Reject Null Hypothesis
exp3A_1	bt2	12223.5	5.8385e-12	Reject Null Hypothesis
exp3B_1	bt2	17471.5	0.0006899	Reject Null Hypothesis

NOTES:

A significance level of 0.01 is used for all the tests.
Repetition of experiments doesn't produce much variations in the predicted labels.
Wilcoxon signed-rank test is very sensitive. A small difference in the predicted probabilities lead to very different distributions, indicated by the very high p-values.

Below results are on repeated experiments with full test set (4840 instances) with a small difference in the predicted labels:

Two trials of the same Exp	W statistic	p-value
exp3A_1	5831043.0	0.75282
exp3B_1	5845520.5	0.88596
bt2	5855191.0	0.97709

References

[1] Hessel Haagsma, Johan Bos, and Malvina Nissim. 2020. MAGPIE: A Large Corpus of Potentially Idiomatic Expressions. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 279–287, Marseille, France. European Language Resources Association.

[2] H. Tayyar Madabushi, E. Gow-Smith, C. Scarton, and A. Villavicencio, “AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 2021, pp. 3464–3477. doi: 10.18653/v1/2021.findings-emnlp.294.

License

TODO

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
data		data
exp_helpers		exp_helpers
experiments		experiments
local_models		local_models
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Idiom Principle on MAGPIE dataset

Dataset used

Experiment Setup

Single Token Representation

Variations of exp3:

Experiment Tracker

Results

Error Analysis & Study

Statistical Significance test

Results:

References

License

About

Releases

Packages

Languages

License

DarshanAdiga/idiom-principle-on-magpie-corpus

Folders and files

Latest commit

History

Repository files navigation

Idiom Principle on MAGPIE dataset

Dataset used

Experiment Setup

Single Token Representation

Variations of exp3:

Experiment Tracker

Results

Error Analysis & Study

Statistical Significance test

Results:

References

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages