The base version of this repo is a clone of Soft-Label Dataset Distillation and Text Dataset Distillation. The experiments in our project can be reproduced by using the commands given in run.sh
.
- Python 3
- NVIDIA GPU + CUDA
faiss==1.7.3
matplotlib==3.7.1
numpy==1.24.3
pandas==2.0.2
Pillow==9.5.0
PyYAML==5.4.1
scikit_learn==1.2.2
six==1.16.0
skimage==0.0
torch==1.13.1
torchtext==0.6.0
torchvision==0.14.1
tqdm==4.65.0
transformers==4.29.2
The experiments in our project can be reproduced using the commands in run.sh
.
Example:
python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT
--batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
--distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
--epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1
--device_id 0 --phase train
This would log everything related to this experiment, including disttled data, in text_results/
.
Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):
python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT
--batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
--distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
--epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1
--device_id 0 --phase test
The file docs/advanced.md
by the original authors gives a detailed description of useful parameters.
References:
- Soft-Label Dataset Distillation and Text Dataset Distillation Paper
- Dataset Distillation Dataset Distillation: The code in the original repo is written by Tongzhou Wang, Jun-Yan Zhu and Ilia Sucholutsky.