Exploring Multilingual Text Dataset Distillation

The base version of this repo is a clone of Soft-Label Dataset Distillation and Text Dataset Distillation. The experiments in our project can be reproduced by using the commands given in run.sh.

Methods Implemented:

Prerequisites

System requirements

Python 3
NVIDIA GPU + CUDA

Dependencies

faiss==1.7.3
matplotlib==3.7.1
numpy==1.24.3
pandas==2.0.2
Pillow==9.5.0
PyYAML==5.4.1
scikit_learn==1.2.2
six==1.16.0
skimage==0.0
torch==1.13.1
torchtext==0.6.0
torchvision==0.14.1
tqdm==4.65.0
transformers==4.29.2

Using this repo

The experiments in our project can be reproduced using the commands in run.sh. Example:

VanillaDistill on UMSAB dataset

Distillation

python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT 
 --batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
 --distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
 --epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1 
 --device_id 0 --phase train

This would log everything related to this experiment, including disttled data, in text_results/.

Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):

python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT 
 --batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
 --distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
 --epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1 
 --device_id 0 --phase test

The file docs/advanced.md by the original authors gives a detailed description of useful parameters.

References:

Soft-Label Dataset Distillation and Text Dataset Distillation Paper
Dataset Distillation Dataset Distillation: The code in the original repo is written by Tongzhou Wang, Jun-Yan Zhu and Ilia Sucholutsky.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
datasets		datasets
docs		docs
networks		networks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base_options.py		base_options.py
basics.py		basics.py
gen_exps.py		gen_exps.py
get_embeddings.py		get_embeddings.py
get_f1.py		get_f1.py
get_plot.py		get_plot.py
installation.txt		installation.txt
main.py		main.py
prep.py		prep.py
requirements.txt		requirements.txt
run.sh		run.sh
test_train_distilled_image.py		test_train_distilled_image.py
train_distilled_image.py		train_distilled_image.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploring Multilingual Text Dataset Distillation

Methods Implemented:

Prerequisites

System requirements

Dependencies

Using this repo

VanillaDistill on UMSAB dataset

Distillation

Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):

About

Releases

Packages

Languages

License

Harshp1802/text-dataset-distillation

Folders and files

Latest commit

History

Repository files navigation

Exploring Multilingual Text Dataset Distillation

Methods Implemented:

Prerequisites

System requirements

Dependencies

Using this repo

VanillaDistill on UMSAB dataset

Distillation

Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages