Vaccine: Perturbation-aware Alignment for Large Language Model

Code to replicate the Vaccine paper and tools for evaluating defense against harmful fine-tuning. Vaccine is a perturbation-aware safety alignment for large language models (LLMs). The alignment technique is developed mainly against the security risk in the user finetuning stage.

Demo

Check out our Project homepage!

Main code logistic

We implement a customized trainer on top of the original HuggingFace Trainer. The main logistic is only 7 lines of code in trainer.py.

self.pre_first_step(model) //Register gradient tracker for hidden embedding
step()  //Backward
self.after_first_step(model) // Cancel the gradient tracker  
model.zero_grad()
self.pre_second_step(model) // Calculate the perturbation and register the forward hook with  perturbation. 
loss = step() //Backward
self.after_second_step(model) // Cancel the forward hook

Package requirement

The package requirement is listed in vaccine.yml and vaccine_pip.txt. Run the following code to install the packages with anaconda and pip.

conda env create -f vaccine.yml
pip install -r vaccine_pip.txt

Data preparation

For finetuning task, we first need to run the following scripts to prepare the sueprvised finetuning data.

cd sst2
python build_dataset.py
cd ../gsm8k
python build_dataset.py
cd ../ag_news
python build_dataset.py
cd ..

Huggingface Llama2 access

Llama2-7B is a gated repo, which need a formal request to get access to the model. Check out https://huggingface.co/meta-llama/Llama-2-7b-hf. After applying permission from meta, you should be able to access the model, but you first need to enter your token in the file huggingface_token.txt.

Example command to run

We prepare scripts for re-producing all the experiments in the paper. We recommend to use Slurm to reproduce the results as the logging file will be automatically organized into the script directory (if you don't use Slurm, just replace sbatch with bash in our example).

We first run Vaccine (with perturbation intensity=2) to produce the aligned model.

cd script/alignment
sbatch  Vaccine.sh 2

Then we finetune the model using 10% of harmful data with a total number of 1000 samples from SST2 dataset.

cd ../vaccine_finetune
sbatch  sst2.sh 2 0.1 1000
cd ../..

For comparison, the baseline SFT alignment can be run with the following code.

cd script/alignment
sbatch  SFT.sh

Then we finetune the model with the same data setting.

cd ../sft_finetune
sbatch  sst2.sh 2 0.1 1000

A line of defense designs

We are commited to design defenses from different angles to harmful fine-tuning. The current avaialble defenses built in the disl group include:

Alignment stage defense: Vaccine, Booster
Fine-tuning stage defense: Lisa
Post-fine-tuning stage defense: Antidote

We always welcome different forms of collaboration. If you are interested, please reach out Tiansheng Huang ([email protected]) for discussion.

OOM

Our experiments are done by H100 (or at lease A100 with 40G memory). For running experiments on consumer GPUs, e.g., RTX4090, you may check out the repo of another paper T-Vaccine.

Citation

If you feel our project is useful, you may cite our paper with the following bibtex.

@article{huang2024vaccine,
  title={Vaccine: Perturbation-aware alignment for large language model},
  author={Huang, Tiansheng and Hu, Sihao and Liu, Ling},
  journal={arXiv preprint arXiv:2402.01109},
  year={2024}
}

@article{huang2024booster,
  title={Booster: Tackling Harmful Fine-tuning for Large Language Models via Attenuating Harmful Perturbation},
  author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
  journal={arXiv preprint arXiv:2409.01586},
  year={2024}
}

@article{huang2024antidote,
  title={Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning},
  author={Huang, Tiansheng and Bhattacharya, Gautam and Joshi, Pratik and Kimball, Josh and Liu, Ling},
  journal={arXiv preprint arXiv:2408.09600},
  year={2024}
}

@article{huang2024lazy,
  title={Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning},
  author={Huang, Tiansheng and Hu, Sihao and Ilhan, Fatih and Tekin, Selim Furkan and Liu, Ling},
  journal={arXiv preprint arXiv:2405.18641},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
ag_news		ag_news
alpaca		alpaca
cache		cache
ckpt		ckpt
gsm8k		gsm8k
images		images
poison/evaluation		poison/evaluation
script		script
sst2		sst2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
huggingface_token.txt		huggingface_token.txt
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py
vaccine.yml		vaccine.yml
vaccine_pip.txt		vaccine_pip.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vaccine: Perturbation-aware Alignment for Large Language Model

Demo

Main code logistic

Package requirement

Data preparation

Huggingface Llama2 access

Example command to run

A line of defense designs

OOM

Citation

About

Releases

Packages

Languages

License

git-disl/Vaccine

Folders and files

Latest commit

History

Repository files navigation

Vaccine: Perturbation-aware Alignment for Large Language Model

Demo

Main code logistic

Package requirement

Data preparation

Huggingface Llama2 access

Example command to run

A line of defense designs

OOM

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages