Investigating the impact of changing the number of labelled training samples on performance and stability/variability of results in fine-tuning, prompting, in-context learning and instruction-tuning approaches
The code in this repository uses Python. The required dependencies are specified in the requirements.txt
.
Simply run pip install -r requiremets.txt
.
To run the specific experiment, follow these steps:
- Install the requirements.
- Choose dataset to run the investigation on. Currently we support following options: "sst2", "mrpc", "cola", "boolq", "rte", "trec", "ag_news", "snips", "db_pedia". However, as we use the HuggingFace, the set of datasets can be easily extended to include other ones (the dataset classes in
data.py
file needs to be extended with the loading and processing of the new dataset). - Choose the training dataset size to run the investigation on.
- Choose number of runs for the investigation.
- Run the investigation using following command (with SST-2 dataset, LLaMA-2 in-context learning on a subset of 1 000 training and test samples with 100 repeated runs):
python main.py --factor=golden_model --mitigation_runs=100 --investigation_runs=1 --dataset=sst2 --experiment_name=dataset_size_change --experiment_type=icl --model=llama2 --configuration_name=num_samples_1000 --num_labelled=1000 --num_labelled_test=1000 --full_test=0
- The results from these runs will be saved to the folder specified by the
experiment_name
,configuration_name
,experiment_type
,model
,dataset
andfactor
arguments. The above command will save the results into the following path:results/dataset_size_change/icl_llama2_base/num_samples_1000/sst2/golden_model
. After the experiments are run, this folder should contain 100 foldersmitigation_{idx}
with idx ranging 0-99, each containing 1 folderinvestigation_0
with the results.
To allow for reproducibility and unbiased comparison, we also provide arguments to set seeds that generate the configurations for the mitigation and investigation runs separately -- we provide default values for both.
To get the results from our paper, the investigation needs to be done for each model and dataset in the following list (using the default hyperparameters, the default mitigation and investigation seeds, and settings described in the paper):
- Datasets: sst2, mrpc, boolq, trec, ag_news, snips, db_pedia
- Models:
- (experiment_type) finetuning: bert, roberta
- (experiment_type) prompting: flan-t5, llama2, chatgpt, mistral, zephyr
- (experiment_type) icl: flan-t5, llama2, chatgpt, mistral, zephyr
- (experiment_type) instruction_tuning_steps: flan-t5, mistral, zephyr
- The number of mitigations runs for all models should be set to 20 (except for chatgpt where it should be just 10).
For the evaluation purposes, we provide python script visualise_dataset_size_change_results.py
that will visualise the results. For the script to run correctly, make sure all the parameters are set according to the experiments (dataset sizes)