Model Name | Model Type (Encoder-Decoder, etc.) | Pre-train Objective | Tokenization | Vocab Size | OOV Handling | Embeddings | Attention | Activations | Parameters | Training | Pre-Train Data | Batch Size |
---|
The goal is to explore the idea of using reinforcement learning (RL) to learn value functions that are complex and defined by human judgment via reward learning.
In this paper, pre-trained models are fine-tuned with RL rather than the usual supervised learning objective. Interestingly, to prevent the model from drifting too far from a pre-trained model, they use KL divergenceto keep the model from not straying from the pre-trained distribution.
Fine-tuning:
This RL, human preference task is defined 2 ways:
- Stylistic continuation: 5k human comparisons were made where a human chose the best of 4.
The goal is to learn the reward function that weighs r(input,output_i)
via a softmax loss function. This function is penalized by a KL term that considers the language model probability distribution. There is also a separate policy function ```pi```` that is trained via Proximal Policy Optimization (PPO). The policy function is initialized by the language model.
- Summarization tasks: 60k human-curated examples where someone copies relevant sections of a larger text.
Stylistic results:
-
RL fine-tuned v. zero-shot -> human won 86% of the time
-
RL fine-tuned v. supervised fine-tuned -> human won 77% of the time
Summarization results:
The authors were underwhelmed by these results, believing that stylistic tasks require very little data.
(from original paper)
(from original paper)
(from original paper)