This codebase is built upon the great work of Qian et. al. (2022). Using this library, you can easily integrate neural augmentation models with your NLP pipelines or train new perturber models from scratch.
perturbers
is available on PyPI and can be installed using pip:
pip install perturbers
Using a perturber is as simple as creating a new instance of the Perturber
class and calling the generate
method
with the sentence you want to perturb along with the target word and the attribute you want to change:
from perturbers import Perturber
perturber = Perturber()
unperturbed = "Jack was passionate about rock climbing and his love for the sport was infectious to all men around him."
perturber.generate(unperturbed, "Jack", "female")
# "Jane was passionate about rock climbing and her love for the sport was infectious to all men around her."
You can also perturb a sentence without specifying a target word or attribute:
perturber("Jack was passionate.", retry_unchanged=True)
# "Jackie was passionate."
To train a new perturber model, take a look at the train_perturber.py
script. This script will train a new perturber
model using the PANDA dataset. Currently the scripts only support training BART models, but any encoder-decoder model
can be used.
Perturber models are evaluated based on the following metrics:
bleu4
: The 4-gram BLEU score of the perturbed sentence compared to the original sentenceperplexity
: The perplexity of the perturbed sentenceperplexity_perturbed
: The perplexity of only the perturbed tokens from the perturbed sentence
In addition to the codebase, we also provide pre-trained perturber models in a variety of sizes:
Base model | Parameters | Perplexity | Perplexity (perturbed idx)* | BLEU4 | |
---|---|---|---|---|---|
perturber-small | bart-small | 70M | 1.076 | 4.079 | 0.822 |
perturber-base | bart-base | 139M | 1.058 | 2.769 | 0.794 |
perturber (original) | bart-large | 406M | 1.06** | N/A | 0.88** |
*Measures perplexity only of perturbed tokens, as the majority of tokens remains unchanged leading to Perplexity scores approaching 1
**The perplexity and BLEU4 scores are those reported in the original paper and not measured via this codebase.
- Add default perturber model
- Pretrain small and medium perturber models
- Train model to identify target words and attributes
- Add training of unconditional perturber models (i.e. only get a sentence, no target word/attribute)
- Add self-training by pretraining perturber base model (e.g. BART) on self-perturbed data
Other features could include
- Data cleaning of PANDA (remove non-target perturbations)
- Multilingual perturbation