Published on

Preference Alignment: RLHF and DPO

Authors
Article Cover
Table of Contents

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is an advanced approach which leverages human preferences to train and enhance the quality of language models. This framework combines elements of reinforcement learning and supervised learning, allowing systems to learn and make decision in a manner that aligns more closely with human preferences. Unlike traditional reinforcement learning methods, where models learn from reward generated through interactions with the environment, RLHF utilized human feedback as a source of guidance for the model. This feedback aids system in navigating complex decision-making process, aligning with human expectations. RLHF can be observed in a variety of applications across different domains, ranging from recommendation systems and natural language processing to robotics and autonomous vehicles. By incorporating human feedback into the training process, RLHF has the potential to enhance model performance, improve user experience, and contribute to the development of responsible and ethical AI technologies.

RLHF is a multi-stage process that utilizes human guidance to effectively train AI models. The core steps involved are as follows [1]:

Step 1: Pretraining a Language Models

The process begins with a pre-trained language model that is trained using conventional model training methods. These training methods can learn from various existing language models such as BERT, RoBERTa, T5, GPT, and others. This initial model serves as the starting point for the RLHF process. This process falls under supervised learning and is often referred to as Supervised Fine-Tuning (SFT).

The choice of the pre-trained language model may vary, encompassing smaller models to those with a large number of parameters, including modern architectures with billions of parameters.

Step 2: Gathering data and training a reward model

Data involves human interaction, where users or experts provide feedback and evaluation on the actions of the agent, is generated by various LMs to trained the reward model. The collected data is used to train the reward model (RM) or a preference model, which distinguishes RLHF from previous techniques, with the primary goal being the optimization of the reward function within the reinforcement learning framework. The training process for RM [2] is described below:

SFT model is prompted with a set of prompts xx to generate pairs of answers (y1,y2)(y_1, y_2) drawn from the distribution πSFT(yx)\pi_{\text{SFT}}(y \mid x). Human labelers then evaluate these pairs, expressing preferences for one answer, denoted as ywy_w (the preferred completion) and yly_l (the dispreferred completion) for the prompt xx.

To optimize the RM, we assume that the human preferences are governed by a latent reward model r(x,y)r^*(x, y), which remains inaccessible. We can express the human preference distribution pp^* using the Bradley-Terry model as follows:

p(y1y2x)=exp(r(x,y1))exp(r(x,y1))+exp(r(x,y2))p^*(y_1 \succ y_2 \mid x) = \frac{\exp(r^*(x, y_1))}{\exp(r^*(x, y_1)) + \exp(r^*(x, y_2))}

Given access to a static dataset of comparisons D={(x(i),yw(i),yl(i))}i=1ND = \{(x^{(i)}, y_w^{(i)}, y_l^{(i)})\}_{i=1}^N sampled from this preference distribution, we can parametrize a reward model r(x,y)r(x, y) and estimate its parameters via maximum likelihood.

To frame this problem as a binary classification, we can formulate the loss function for training the RM as:

LR(rθ,D)=E(x,yw,yl)D[logσ(rθ(x,yw)rθ(x,yl))]L_R(r_\theta, D) = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

where σ\sigma is the logistic function. In the context of language models, the network rθ(x,y)r_\theta(x, y) is typically initialized based on the SFT model πSFT(yx)\pi_{\text{SFT}}(y \mid x) and augmented with a linear layer on top of the final transformer layer, producing a single scalar prediction for the reward value.

Moreover, to ensure the reward function has lower variance, prior work often normalizes the rewards such that E(x,y)D[r(x,y)]=0\mathbb{E}_{(x, y) \sim D}[r(x, y)] = 0 for all xx.

Step 3: Fine-tuning the LM with Reinforcement learning

During the reinforcement learning (RL) phase, the learned reward function is utilized to provide feedback to the language model. This process requires two language models (LMs): one from the initial phase, which we refer to as SFT model, and another which we will denote as the Proximal Policy Optimization (PPO) model.

Initially, a new prompt xx is introduced as input for the process. Using this prompt, we generate pairs of responses (y1,y2)(y_1, y_2) from the SFT model, which represents the base policy πbase\pi_{\text{base}}. The output can be viewed as a probability distribution over the vocabulary based on the input prompt. Human labelers then evaluate these responses, selecting a preferred response denoted as ywy_w and a dispreferred response yly_l.

Next, the PPO model generates text utilizing the newly introduced prompt. Once the text has been produced, the trained reward model (RM) evaluates the generated segment to update the reward function for the PPO model. The optimization is formulated as:

maxπθExD,yπθ(yx)[rϕ(x,y)]βDKL[πθ(yx)πref(yx)]\max_{\pi_\theta} \mathbb{E}_{x \sim D, y \sim \pi_\theta(y|x)} \left[ r_\phi(x, y) \right] - \beta D_{KL} \left[ \pi_\theta(y | x) \| \pi_{\text{ref}}(y | x) \right]

where β\beta is a hyperparameter controlling the deviation from the base reference policy, namely the initial SFT model πSFT\pi_{\text{SFT}}. The KL divergence constraint is crucial as it prevents the PPO model from diverging too far from the distribution on which the reward model is accurate, while also maintaining generation diversity and avoiding mode collapse to a single high-reward response.

The update function for the PPO model can be expressed as follows:

r(x,y)=rϕ(x,y)β(logπθ(yx)logπref(yx))r(x, y) = r_\phi(x, y) - \beta \left( \log \pi_\theta(y | x) - \log \pi_{\text{ref}}(y | x) \right)

This allows us to refine the PPO model iteratively, ensuring that the generated outputs remain aligned with human preferences as indicated by the reward model.

Direct Preference Optimization

Despite the effectiveness of the Proximal Policy Optimization (PPO) method, it has a significant drawback: it requires training a completely separate model, specifically the RM, leading to high costs and the necessity for large amounts of additional data. With Direct Preference Optimization (DPO), we eliminate the use of the RM for aligning LLMs, which reduces costs associated with data generation and resource utilization. DPO simplifies the training process by creating a dataset of human preference pairs, each consisting of a prompt and two options: one preferred and one dispreferred. The LLM is then fine-tuned to maximize the likelihood of generating text segments that align with human preferences while minimizing undesirable outputs. This approach effectively improves the quality of outputs based on directly observed human choices.

The key in DPO is the introduction of a KL-constrained optimization, allowing us to derive an ideal policy to maximize KL-constrained rewards [3]:

maxπExD,yπ(yx)[r(x,y)]βDKL[π(yx)πref(yx)]=maxπExD,yπ(yx)[r(x,y)βlog(πref(yx)π(yx))]=minπExD,yπ(yx)[log(πref(yx)π(yx))1βr(x,y)]=minπExD,yπ(yx)[log(1Z(x)πref(yx)exp(1βr(x,y)))logZ(x)]\begin{aligned} & \max_{\pi} \mathbb{E}_{x \sim D, y \sim \pi(y \mid x)} \left[ r(x, y) \right] - \beta D_{\text{KL}} \left[ \pi(y \mid x) \| \pi_{\text{ref}}(y \mid x) \right] \\ = \, & \max_{\pi} \mathbb{E}_{x \sim D, y \sim \pi(y \mid x)} \left[ r(x, y) - \beta \log\left( \frac{\pi_{\text{ref}}(y \mid x)}{\pi(y \mid x)} \right) \right] \\ = \, & \min_{\pi} \mathbb{E}_{x \sim D, y \sim \pi(y \mid x)} \left[ \log\left( \frac{\pi_{\text{ref}}(y \mid x)}{\pi(y \mid x)} \right) - \frac{1}{\beta} r(x, y) \right] \\ = \, & \min_{\pi} \mathbb{E}_{x \sim D, y \sim \pi(y \mid x)} \left[ \log \left( \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\left( \frac{1}{\beta} r(x, y) \right) \right) - \log Z(x) \right] \end{aligned}

where Z(x)Z(x) is defined as:

Z(x)=yπref(yx)exp(1βr(x,y))Z(x) = \sum_{y} \pi_{\text{ref}}(y \mid x) \exp\left( \frac{1}{\beta} r(x, y) \right)

The most crucial aspect to note is that we can obtain a policy πr\pi_r from which we can easily derive the reward function rr:

πr(yx)=1Z(x)πref(yx)exp(1βr(x,y))\pi_r(y \mid x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y \mid x) \exp\left( \frac{1}{\beta} r(x, y) \right)

We can then immediately compute rr as follows:

r(x,y)=βlog(πr(yx)πref(yx))+βlogZ(x)r(x, y) = \beta \log\left( \frac{\pi_r(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \right) + \beta \log Z(x)

Returning to the equation for the optimal probability distribution, we can rewrite it such that each instance of rr is replaced by the equation above:

p(y1y2x)=11+exp(βlog(πref(y2x)πr(y2x))βlog(πref(y1x)πr(y1x)))p^*(y_1 \succ y_2 \mid x) = \frac{1}{1 + \exp\left(\beta \log\left( \frac{\pi_{\text{ref}}(y_2 \mid x)}{\pi_r(y_2 \mid x)} \right) - \beta \log\left( \frac{\pi_{\text{ref}}(y_1 \mid x)}{\pi_r(y_1 \mid x)} \right)\right)}

This equation does not require a reward model to optimize the policy according to the probability distribution of human preferences. Instead, we can work directly on the policy itself to enhance its quality. Finally, we express the loss function as follows:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(βlog(πθ(ywx)πref(ywx))βlog(πθ(ylx)πref(ylx)))]\begin{aligned} \mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l) \sim D} \bigg[ \log \sigma\bigg( \beta \log\left( \frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} \right) \\ - \beta \log\left( \frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)} \right) \bigg) \bigg] \end{aligned}

At this stage, we have an equation that compares the probabilities between the old policy πref\pi_{\text{ref}} and the new policy πθ\pi_\theta for a selected response ywy_w and a non-selected response yly_l. Our task is to optimize the probabilities such that ywy_w is favored, indicating that the policies are improving in their ability to produce the preferred responses compared to those that are not preferred.

In conclusion, DPO offers several advantages over RLHF:

  • Elimination of the Reward Model: By removing the need for a separate reward model, DPO relies on high-quality data to effectively differentiate between favorable and unfavorable responses. This simplification conserves valuable time and resources.

  • Swift Adaptation: DPO facilitates quick adaptation to new data, thus avoiding the need for retraining.

  • Dual Focus on Responses: Additionally, DPO enables the model to learn not only which responses are desirable but also to recognize and steer clear of undesirable ones. This dual focus enhances the model's ability to refine its interactions.

Overall, DPO ultimately results in improved performance in generating contextually relevant and appropriate responses, making it a robust approach for optimizing language model behavior.

References

1. Lambert, N., Castricato, L., von Werra, L., & Havrilla, A. (2022). Illustrating Reinforcement Learning from Human Feedback (RLHF). Hugging Face Blog. https://huggingface.co/blog/rlhf

2. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2020). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593. https://arxiv.org/abs/1909.08593

3. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728-53741.