Skip to content

Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses

Notifications You must be signed in to change notification settings

LukeBailey181/obfuscated-activations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Benign Robot Logo

Obfuscated Activations Bypass LLM Latent-Space Defenses

arXiv Website

This repository contains the code for the paper Obfuscated Activations Bypass LLM Latent-Space Defenses. The codebase is organized into two main components.

Repository Structure

.
├── inference_time_experiments/  # Code for inference-time attacks (Sections 3 & 5)
└── train_time_experiments/      # Code for training-time attacks (Sections 4 & 6.2)

Inference Time Experiments

The inference_time_experiments/ directory contains code for:

  • Section 3: Case Study: Detecting Harmfulness
  • Section 5: Case Study: SAE Probing

This includes implementations for attacking models at inference time through suffix based attacks, with experiments on harmfulness detection and SAE probing.

Training Time Experiments

The train_time_experiments/ directory contains code for:

  • Section 4: Case Study: Backdoor Obfuscation
  • Section 6.2: Training Models to be Easier to Monitor (Obfuscated Adversarial Training)

This includes implementations for training backdoored models with both explicit and natural obfuscation strategies, as well as Obfuscated Adversarial Training (OAT).

Getting Started

Each subdirectory contains its own README with detailed instructions for installation and running experiments. Please refer to:

  • inference_time_experiments/README.md
  • train_time_experiments/README.md

Contact

If you have any issues with the code, feel free to submit a github issue or contact Luke Bailey at [email protected].

Citation

@article{bailey2024obfuscated,
	title={Obfuscated Activations Bypass LLM Latent-Space Defenses},
	author={Bailey, Luke and Serrano, Alex and Sheshadri, Abhay and Seleznyov, Mikhail and Taylor, Jordan and Jenner, Erik and Hilton, Jacob and Casper, Stephen and Guestrin, Carlos and Emmons, Scott},
	journal={arXiv preprint arXiv:2412.09565},
	year={2024}
}

About

Codebase for Obfuscated Activations Bypass LLM Latent-Space Defenses

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published