Skip to content

Commit

Permalink
kb autocommit
Browse files Browse the repository at this point in the history
  • Loading branch information
Jemoka committed May 19, 2024
1 parent b33580f commit f5ea1cd
Show file tree
Hide file tree
Showing 2 changed files with 41 additions and 16 deletions.
55 changes: 40 additions & 15 deletions content/posts/KBhast_project_update.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,19 +4,29 @@ author = ["Houjun Liu"]
draft = false
+++

Note that, after discussion and analysis of our previous formulation, we elected to reformulate our previous problem almost in its entirety. So, we first introduce here the new problem formulation, and discuss existing work.
## Introduction {#introduction}

Recent advances in language modeling has brought fourth the emergence of open domain chat systems built using decoder-only language models ((<a href="#citeproc_bib_item_1">Brown et al. 2020</a>)). Unfortunately, due to the inclusion of toxic content in massive online training sets, even in-distribution autoregressive sampling of these systems could degenerate into undesirable toxic trajectories ((<a href="#citeproc_bib_item_20">Zhang et al. 2021</a>; <a href="#citeproc_bib_item_10">McGuffie and Newhouse 2020</a>)).

## Background {#background}
Red teaming, the general class of methods to identify potentially harmful trajectories in language models for both understanding and pruning, is traditionally done with a human-in-the-loop with research focusing on sampling strategies and evaluation metrics ((<a href="#citeproc_bib_item_4">Ganguli et al. 2022</a>)). One classic strategy of identifying and benchmarking these possibly undesirable trajectories focuses on eliciting toxicity using a known sampled dataset ((<a href="#citeproc_bib_item_5">Gehman et al. 2020</a>)), which typically involves testing a series of human-written prompts for the chance that their direct entailments could result in toxicity. Yet, the emergence of toxicity may be spontaneous---without perceptible toxicity in the prompt---and model specific ((<a href="#citeproc_bib_item_11">Mehrabi et al. 2022</a>)).

Our fundamental motivation remains the same: that the classic processes of detoxifying multi-turn dialogue tasks suffer from long horizon ((<a href="#citeproc_bib_item_9">Ramamurthy et al. 2023</a>)) and adversarial ((<a href="#citeproc_bib_item_11">Si et al. 2022</a>)) trajectories. Previously, all typical strategies of detoxification focuses on eliciting toxicity using a known sampled dataset ((<a href="#citeproc_bib_item_2">Gehman et al. 2020</a>)), which typically involves using a series of human-written prompt to sample maximum-likely direct entailments of toxicity, yet the emergence of toxicity may be spontaneous---without perceptible toxicity in the prompt---and model specific ((<a href="#citeproc_bib_item_7">Mehrabi et al. 2022</a>)). Red teaming, the class of methods to identify potentially harmful trajectories in language models for both understanding and pruning, is traditionally done with a human-in-the-loop with research focusing on sampling strategies and evaluation metrics ((<a href="#citeproc_bib_item_1">Ganguli et al. 2022</a>)). Automated methods in this area typically focus similarly on the iterative selection of prompts and measuring the toxicity of the resulting trajectories, through direct search ((<a href="#citeproc_bib_item_13">Yu et al. 2023</a>)), search with LM reasoning ((<a href="#citeproc_bib_item_8">Mehrotra et al. 2024</a>)), or actual, rhetorical persuasive strategies ((<a href="#citeproc_bib_item_14">Zeng et al. 2024</a>)).
Automated methods in this area typically focus similarly on the iterative selection of prompts and measuring the toxicity of the resulting trajectories, through direct search ((<a href="#citeproc_bib_item_18">Yu et al. 2023</a>)), search with LM reasoning ((<a href="#citeproc_bib_item_12">Mehrotra et al. 2024</a>)), or actual, rhetorical persuasive strategies ((<a href="#citeproc_bib_item_19">Zeng et al. 2024</a>)) developed through manual engineering. These methods result in model-specific prompts which are limited in diversity due to exogenous selection of the space of prompts, and may even require manual engineering for each model being tested.

Previous work focuses mostly on non-guided, exhaustive search strategies; where heuristics exist, they typically focus on the toxicity of the resulting output _without_ regard to the likelihood of the toxicity prompt automatically emerging; yet, (<a href="#citeproc_bib_item_7">Mehrabi et al. 2022</a>) discusses the fact that toxicity can emerge without such toxic prompts. Therefore, a gap in literature exists in automated, heuristic-guided red-teaming strategies which not only elicit toxicity but also do so with likely sequences from the original LM being red-teamed.
Most recently, approaches have also emerged that uses Reinforcement Learning (RL) techniques for prompt optimization. These approaches range from using gradient steps to optimize embedding level "soft prompts" ((<a href="#citeproc_bib_item_14">Qian et al. 2022</a>)), optimizing discrete token choices through a differentiable reward ((<a href="#citeproc_bib_item_3">Deng et al. 2022</a>)), or optimizing a non-differentiable reward formulated solely by entailment toxicity ((<a href="#citeproc_bib_item_2">Casper et al. 2023</a>; <a href="#citeproc_bib_item_13">Perez et al. 2022</a>)).

Embedding level soft prompts are not understandable by humans in their nature, and of course will not arise naturally in the auto-regression process as decoder models are designed to choose in the vocab space. Even in discrete optimization approaches, the resulting prompts may appear to be disfluent or nonsensical ((<a href="#citeproc_bib_item_3">Deng et al. 2022</a>; <a href="#citeproc_bib_item_2">Casper et al. 2023</a>)) without further restrictions to the prompt space. Further, these methods ofte

## AST {#ast}
In these methods, heuristics are designed typically to focus on the toxicity of the resulting output _without_ regard to the likelihood of the toxicity prompt automatically emerging; yet, toxicity can emerge both naturally within a language model ((<a href="#citeproc_bib_item_11">Mehrabi et al. 2022</a>)), occasionally without even conditioning the model on toxic content ((<a href="#citeproc_bib_item_16">Si et al. 2022</a>)). Even when fluency is used as a part of the scoring

In this work, we borrow from the literature of autonomous vehicle planning, in particular Adaptive Stress Testing (AST) ((<a href="#citeproc_bib_item_4">Koren et al. 2018</a>; <a href="#citeproc_bib_item_5">Lee et al. 2020</a>)), as a reinforcement learning (RL) formulation for the automatic discovery of problematic input trajectories which elicits toxicity. In AST, we formulate the task of finding _likely_ cases of _failure_ of any Markov Decision Process (MDP) (\\(S, A, R, T\\)) as a reinforcement learning problem, where failure is defined by some set \\(E \subset S\\).
Therefore, a gap in literature exists in automated, heuristic-guided red-teaming strategies which not only elicit toxicity but also do so with likely sequences from the original LM being red-teamed.


## Preliminaries {#preliminaries}


### AST {#ast}

In this work, we borrow from the literature of autonomous vehicle planning, in particular Adaptive Stress Testing (AST) ((<a href="#citeproc_bib_item_7">Koren et al. 2018</a>; <a href="#citeproc_bib_item_8">Lee et al. 2020</a>)), as a reinforcement learning (RL) formulation for the automatic discovery of problematic input trajectories which elicits toxicity. In AST, we formulate the task of finding _likely_ cases of _failure_ of any Markov Decision Process (MDP) (\\(S, A, R, T\\)) as a reinforcement learning problem, where failure is defined by some set \\(E \subset S\\).

An AST policy (which we abbreviate here as "an AST") acts to perturb the state of the underlying MDP (which we call "defender"). The AST takes state \\(s \in S\\) as input, and takes actions \\(a \in A\\) on the environment to obtain \\(s'\\), which the defender then acts on. The goal of the AST is to choose actions that maximize:

Expand All @@ -31,16 +41,19 @@ d\_{E}(s'), \text{if}\ s' \in E, s\ \text{is terminal}\\\\
where \\(d\_{E}(s')\\) is some inverse distance metric between \\(s'\\) and a state of likely failure. That is, the AST attempts to identify likely actions that the _defender_ may take at a state which maximizes a failure.


## Progress and Next Steps {#progress-and-next-steps}
### Automated Red-Teaming {#automated-red-teaming}


## Approach {#approach}

### Done: Final Formulation {#done-final-formulation}

### Problem Formulation {#problem-formulation}

From our previous discussion, we continue to define dialogue language modeling as a finite-horizon MDP. Each \\(a \sim p\_{\theta}\\) is a single utterance given by a language model in an open domain dialogue task, each \\(s \in S\\) is the dialogue so far, and \\(T(s'|s,a) = p\_{\theta}(s'|s)\\) the conditional probability which tokens some utterance \\(s'\\) has given dialogue \\(s\\) and last turn statement \\(a\\).

In this work, we aim to learn some language model \\(p\_{AST}\\) which, when placed in a open domain dialogue with a defender (i.e. untuned) language model \\(p\_{\theta}\\), elicit toxic sequences from \\(p\_{\theta}\\).

Toxicity detection models exist to formulate a dense reward objective which helps score toxicity of a trajectory automatically ((<a href="#citeproc_bib_item_3">Hanu and Unitary 2020</a>; <a href="#citeproc_bib_item_6">Lees et al. 2022</a>)). They take a sequence \\(s\\) and produce \\(R\_{\phi}(s) \in [0,1]\\), which indicates the degree of toxicity in this sequence. We formulate our reward function for \\(p\_{AST}\\) following the original AST reward as much as possible, adjusting slightly for an LM context:
Toxicity detection models exist to formulate a dense reward objective which helps score toxicity of a trajectory automatically ((<a href="#citeproc_bib_item_6">Hanu and Unitary 2020</a>; <a href="#citeproc_bib_item_9">Lees et al. 2022</a>)). They take a sequence \\(s\\) and produce \\(R\_{\phi}(s) \in [0,1]\\), which indicates the degree of toxicity in this sequence. We formulate our reward function for \\(p\_{AST}\\) following the original AST reward as much as possible, adjusting slightly for an LM context:

\begin{equation}
R(s,a, s') = \begin{cases}
Expand All @@ -52,18 +65,30 @@ R\_{\phi}(s'), \text{if}\ R\_{\phi}(s') > 0.3 \\\\

where \\(\kappa\\) is any normalization constant of choice. Though this formulation doesn't naively appear to be corresponding to the AST reward naively, it fits the formulation when under the context of LM red-teaming.

We first define a **terminal state** as one which the language model is toxic. Indeed, a toxicity elicitation procedure in a two-participant dialogue can go on for a fairly large number of turns before toxicity is elicited ((<a href="#citeproc_bib_item_7">Mehrabi et al. 2022</a>)). Second, as \\(R\_{\phi}\\) reports its results densely based on the amount of toxicity, it essentially acts as an inverse toxicity metric to failure.
We first define a **terminal state** as one which the language model is toxic. Indeed, a toxicity elicitation procedure in a two-participant dialogue can go on for a fairly large number of turns before toxicity is elicited ((<a href="#citeproc_bib_item_11">Mehrabi et al. 2022</a>)). Second, as \\(R\_{\phi}\\) reports its results densely based on the amount of toxicity, it essentially acts as an inverse toxicity metric to failure.


### In Progress: Tuning {#in-progress-tuning}
### Optimization {#optimization}

As with before, we are tuning our policy with proximal policy optimization ((<a href="#citeproc_bib_item_10">Schulman et al. 2017</a>)), testing the application of the PPO objective at levels of both entire conversations ((<a href="#citeproc_bib_item_15">Ziegler et al. 2020</a>)) as well as per utterance ((<a href="#citeproc_bib_item_12">Wu et al. 2023</a>)).
As with before, we are tuning our policy with proximal policy optimization ((<a href="#citeproc_bib_item_15">Schulman et al. 2017</a>)), testing the application of the PPO objective at levels of both entire conversations ((<a href="#citeproc_bib_item_21">Ziegler et al. 2020</a>)) as well as per utterance ((<a href="#citeproc_bib_item_17">Wu et al. 2023</a>)).

In our work, we have finalized the above formulation, obtained baseline policies, and are beginning to use PPO to tune our system. To build the experimental environment, we have obtained and cleaned a corpus of Reddit conversations, using our chosen \\(R\_{\phi}\\) ((<a href="#citeproc_bib_item_3">Hanu and Unitary 2020</a>)) to find low (&lt;0.5) toxicity input trajectories to serve as \\(s\_0\\) for our defender model.
In our work, we have finalized the above formulation, obtained baseline policies, and are beginning to use PPO to tune our system. To build the experimental environment, we have obtained and cleaned a corpus of Reddit conversations, using our chosen \\(R\_{\phi}\\) ((<a href="#citeproc_bib_item_6">Hanu and Unitary 2020</a>)) to find low (&lt;0.5) toxicity input trajectories to serve as \\(s\_0\\) for our defender model.

To initialize the tuning process, we are using the RealToxicityPrompts dataset ((<a href="#citeproc_bib_item_2">Gehman et al. 2020</a>)) as a means of stabilizing our policy model prior to eliciting toxicity directly using the roll-outs.
To initialize the tuning process, we are using the RealToxicityPrompts dataset ((<a href="#citeproc_bib_item_5">Gehman et al. 2020</a>)) as a means of stabilizing our policy model prior to eliciting toxicity directly using the roll-outs.


### Next Steps: Evaluation {#next-steps-evaluation}
## Detoxification Task {#detoxification-task}

After obtaining the policy, we aim to sample roll-outs of conversations between \\(P\_{AST}\\) and \\(P\_{\theta}\\), and use their resulting toxicity scores as as a standard dataset for applying PPO \\(P\_{\theta}\\). In this way, we can not only benchmark the performance of our model on our specific AST task, as well as provide motivation for the system's downstream use.


### Experiments {#experiments}


### Baselines {#baselines}


### Results {#results}


## Next Steps {#next-steps}
2 changes: 1 addition & 1 deletion content/posts/KBhresearch_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ Welcome to my academic homepage! This is my little homestead on the internet abo
<div style="margin: 10px 0">
<span style="color: #262626; font-weight:500; color: #292929; opacity:0.6; font-size: 14px">Recent goings on</span>
<div style="margin-top: 10px; display: grid; column-gap: 20px; row-gap: 5px; grid-template-columns: 120px auto">
<span style="font-weight: 500">May. 18, 24'</span> <span>Journal Article (NACC) <a href="https://doi.org/10.1097/WAD.0000000000000619">published@LWW AD</a></span>
<span style="font-weight: 500">Apr. 29, 24'</span> <span>ArXiv preprint <a href="https://arxiv.org/abs/2404.19055">released</a> on POMDP LM decoding.</span>
<span style="font-weight: 500">Apr. 7, 24'</span> <span>Journal Article (NACC) accepted + in press at <a href="https://journals.lww.com/alzheimerjournal/pages/default.aspx">LWW AD</a></span>
<span style="font-weight: 500">Mar. 15, 24'</span> <span>TalkBank's Mandarin <a href="https://huggingface.co/talkbank/CHATUtterance-zh_CN">utterance segmentation model</a> released. </span>
<span style="font-weight: 500">Feb. 29, 24'</span> <span>We released Stanza <a href="https://github.com/stanfordnlp/stanza/releases/tag/v1.8.1">1.8.1</a> with PEFT support! </span>
<span style="font-weight: 500">Feb. 26-27, 24'</span> <span>AAAI 2024. Vancouver was fun!</span>
Expand Down

0 comments on commit f5ea1cd

Please sign in to comment.