Machine learning

A better training method for reinforcement learning with human feedback

Contrasting training pairs with large reward differences mitigate spurious correlations and improve performance of direct-alignment algorithms by as much as 20%–40%.

May 2, 2025

Reinforcement learning with human feedback (RLHF) is the standard method for aligning large language models (LLMs) with human preferences — such as the preferences for nontoxic language and factually accurate responses. Recently, one of the most popular RLHF methods has been direct preference optimization, in which the LLM chooses between two output options, one of which has been labeled as preferred by a human annotator.

With direct preference optimization (DPO), however — and with other, similar direct-alignment algorithms — LLMs run the risk of learning spurious correlations from the data. In toxicity datasets, for instance, it’s common for the serious, thoughtful responses to be longer than the offensive responses. During RLHF, an LLM could thus learn to favor longer responses over shorter ones, which may not be preferable in general.

LLM debug -prompt for code explanation 16x9.png

Direct preference optimization

Reinforcement learning is a trial-and-error method in which an agent interacts with the world and, depending on the actions it takes, receives greater or lesser rewards. Over time, the agent attempts to learn a policy that maximizes its cumulative reward.

In classical reinforcement learning, the interaction with the world can be literal: a robot, for instance, might receive a large reward for successfully navigating to a prescribed location and a negative reward for bumping into a wall. In RLHF, however, the reward depends on how well an LLM’s output aligns with a paradigm case specified by a human.

SeRA

With SeRa, we first perform conventional DPO, using a dataset of human-annotated example pairs. After this first pass through the data, the LLM has learned something about the types of outputs that humans prefer.

We then use the updated model to generate a new set of training examples. For every generated response pair, we assign each response a preference score, which is based on the updated model’s probability of generating that response. We then keep only those pairs in which the preferred response scores significantly higher than the non-preferred response.

With SeRa (self-reviewing and alignment), the updated model generates a new response pair (a winner, or *y_w*, and loser, or *y_l*) for each sample input (x). Each response receives a preference score, which is based on the updated model’s probability of generating it. Pairs in which the score of the preferred response is significantly higher than that of the non-preferred response *(green)* are kept; the others *(red)* are discarded.

Using the same metric, we next filter the data in the original, human-annotated dataset. Then we combine filtered samples from the original dataset with filtered samples from our new, generated dataset and perform DPO once again. This process repeats, with the generated samples constituting a larger and larger fraction of the dataset, until model performance converges.

The intuition here is that if a dataset is designed to represent some contrast, but it also contains spurious correlations, then the intended contrast — between, say, toxic and non-toxic data — will be significantly greater than the unintended contrast — between, say, long and short responses.

A better training method for reinforcement learning with human feedback

Contrasting training pairs with large reward differences mitigate spurious correlations and improve performance of direct-alignment algorithms by as much as 20%–40%.

Direct preference optimization

SeRA

Related content

Work with us