Skip to content

Harshp1802/text-dataset-distillation

Repository files navigation

Exploring Multilingual Text Dataset Distillation

The base version of this repo is a clone of Soft-Label Dataset Distillation and Text Dataset Distillation. The experiments in our project can be reproduced by using the commands given in run.sh.

Methods Implemented:

  1. VanillaDistill
  2. SkipLookupDistill
  3. VocabDistill (Softmax)
  4. VocabDistill (Gumbel)

Prerequisites

System requirements

  • Python 3
  • NVIDIA GPU + CUDA

Dependencies

  • faiss==1.7.3
  • matplotlib==3.7.1
  • numpy==1.24.3
  • pandas==2.0.2
  • Pillow==9.5.0
  • PyYAML==5.4.1
  • scikit_learn==1.2.2
  • six==1.16.0
  • skimage==0.0
  • torch==1.13.1
  • torchtext==0.6.0
  • torchvision==0.14.1
  • tqdm==4.65.0
  • transformers==4.29.2

Using this repo

The experiments in our project can be reproduced using the commands in run.sh. Example:

VanillaDistill on UMSAB dataset

Distillation

python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT 
 --batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
 --distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
 --epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1 
 --device_id 0 --phase train

This would log everything related to this experiment, including disttled data, in text_results/.

Testing distilled data generated by the above script (trains same model on the distilled data | also generates nearest word embeddings):

python main.py --mode distill_basic --dataset umsab --arch TextConvNet_BERT 
 --batch_size 1024 --distill_steps 1 --static_labels 0 --random_init_labels hard --textdata True --visualize ''
 --distilled_images_per_class_per_step 1 --distill_epochs 5 --distill_lr 0.01 --decay_epochs 10 --log_interval 5
 --epochs 25 --lr 0.01 --ntoken 10000 --ninp 768 --maxlen 75 --results_dir text_results/umsab_20by1_unkinit_repl1 
 --device_id 0 --phase test

The file docs/advanced.md by the original authors gives a detailed description of useful parameters.

References:

  1. Soft-Label Dataset Distillation and Text Dataset Distillation Paper
  2. Dataset Distillation Dataset Distillation: The code in the original repo is written by Tongzhou Wang, Jun-Yan Zhu and Ilia Sucholutsky.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published