Most alignment methods for language models can be described with a small set of ranking ideas. We assign scores to candidates, compare those scores, and update the model so that preferred candidates receive higher scores. This is exactly the problem studied in learning-to-rank (LTR). The modern LLM alignment literature often uses different language, such as reward modeling, RLHF, DPO, or preference optimization, but much of the machinery is inherited from LTR.

In this post, we will make this connection explicit. We will start from classical LTR losses, reinterpret reward modeling as an LTR problem, and then show how DPO-style direct alignment falls out by substituting the implicit reward induced by a policy into those ranking losses. The main takeaway is the following:

Direct alignment is often just learning-to-rank after the DPO reward reparameterization.

This view unifies a large family of direct alignment methods, including DPO and its variants. The post mostly follows the analysis in this paper, which introduced the LiPO- $\lambda$ method and its connection to LambdaRank. The main purpose of this post is to expand on that analysis and place it in a broader context of learning-to-rank and reward modeling.

Learning-to-Rank (LTR)

Suppose we have a dataset

\mathcal{D} = \bigl\{(y_i, s_i)\bigr\}_{i=1}^{N},

where each item $y_i$ is associated with a real-valued label $s_i$ . The label indicates some predefined preference, relevance, quality, or utility. We write

y_i \succ y_j \quad \Longleftrightarrow \quad s_i > s_j,

meaning that $y_i$ should be ranked above $y_j$ . In many practical settings, the absolute scores are not observed, and we only know relative preferences between items.

A learning-to-rank problem learns a scoring function $r_\psi$ that assigns

\hat{s}_i \coloneqq r_\psi(y_i)

to each item, so that sorting by $\hat{s}_i$ recovers the true preference order induced by $s_i$ . A generic LTR objective can be written as

\operatorname*{minimize}_\psi \;\; \mathbb{E}_{\{(y_i,s_i)\}_{i=1}^{K} \sim \mathcal{D}} \Bigl[ \ell(s_1, \ldots, s_K, \hat{s}_1, \ldots, \hat{s}_K) \Bigr].

The choices of $K$ and $\ell$ determine whether the method is pointwise, pairwise, or listwise.

Pointwise LTR

In pointwise LTR, we set $K=1$ . The model is trained to predict the score of each item independently. If the labels are real-valued, we can use mean squared error:

\ell_{\textsf{MSE}}(s,\hat{s}) = (s - \hat{s})^2.

If the labels are binary, we can use binary cross-entropy:

\ell_{\textsf{BCE}}(s,\hat{s}) = -s\log\sigma(\hat{s}) - (1-s)\log(1-\sigma(\hat{s})).

Pointwise learning is natural when the labels are actually meaningful as scalar targets. However, if we only have preferences, pointwise learning forces us to invent absolute labels. This is often an unnecessary burden, because rankings are invariant to many transformations of the score scale.

Pairwise LTR

In pairwise LTR, we set $K=2$ . The loss only asks that a preferred item receive a larger predicted score than a less preferred item.

A classic example is RankSVM ¹. For a preferred pair $y_1 \succ y_2$ , the hinge loss is

\ell_{\textsf{RankSVM}}(s_1,s_2,\hat{s}_1,\hat{s}_2) = \mathbb{I}[s_1 > s_2]\max(0, 1-(\hat{s}_1-\hat{s}_2)).

The loss is zero once the preferred item is ahead by margin at least $1$ .

Another classic example is RankNet ²:

\ell_{\textsf{RankNet}}(s_1,s_2,\hat{s}_1,\hat{s}_2) = \mathbb{I}[s_1 > s_2]\log\sigma(\hat{s}_1-\hat{s}_2).

This is a smooth logistic loss over the score difference. Instead of demanding a hard margin, it keeps increasing the probability that the preferred item receives a higher score.

The key feature of pairwise losses is their dependence on score differences. This will matter later because prompt-level normalizers cancel inside differences.

Listwise LTR

In listwise LTR, we set $K>2$ and train on a whole ranked list at once. This is often a better match to real preference data, because annotators or reward models may rank several candidate responses to the same prompt.

One listwise method is ListNet ³. It converts both the ground-truth scores and the predicted scores into top-one softmax distributions:

P_s(i) = \frac{\exp(s_i)}{\sum_{j=1}^{K}\exp(s_j)}, \qquad P_{\hat{s}}(i) = \frac{\exp(\hat{s}_i)}{\sum_{j=1}^{K}\exp(\hat{s}_j)}.

Then, the top-one ListNet loss is then cross-entropy between these distributions.

\ell_{\textsf{ListNet}}(s_1,\ldots,s_K,\hat{s}_1,\ldots,\hat{s}_K) = -\sum_{i=1}^{K} P_s(i)\log P_{\hat{s}}(i).

Another listwise method is ListMLE ⁴. Let $\tau_s$ be the permutation that sorts the candidates by the ground-truth labels:

s_{\tau_s(1)} > s_{\tau_s(2)} > \cdots > s_{\tau_s(K)}.

ListMLE models the probability of observing this full ranking as a sequential choice process:

\Pr\left(y_{\tau_s(1)} \succ \cdots \succ y_{\tau_s(K)}\right) = \prod_{i=1}^{K} \frac{\exp(\hat{s}_{\tau_s(i)})}{\sum_{j=i}^{K}\exp(\hat{s}_{\tau_s(j)})}.

Then, the loss is the negative log-likelihood:

\ell_{\textsf{ListMLE}}(s_1,\ldots,s_K,\hat{s}_1,\ldots,\hat{s}_K) = -\log\Pr\left(y_{\tau_s(1)} \succ \cdots \succ y_{\tau_s(K)}\right) = -\sum_{i=1}^{K} \log\left(\frac{\exp(\hat{s}_{\tau_s(i)})}{\sum_{j=i}^{K}\exp(\hat{s}_{\tau_s(j)})}\right).

Finally, LambdaRank ⁵ adds metric awareness. Instead of treating all preference inversions equally, it weights pairwise logistic losses by the change in a ranking metric such as NDCG.

\ell_{\lambda}(s_1,\ldots,s_K,\hat{s}_1,\ldots,\hat{s}_K) = \sum_{\substack{w,\ell \ : \ s_w > s_\ell}} \Bigl[\!\!\Bigl[\bigl|\Delta \mathrm{NDCG}_{w\ell}\bigr|\Bigr]\!\!\Bigr] \log\sigma(\hat{s}_w-\hat{s}_\ell).

Here $\left[\!\left[\cdot\right]\!\right]$ denotes stop-gradient, meaning that the gradient of the NDCG weight is not backpropagated through the network. The difference in NDCG is computed by:

\mathrm{DCG}(\tau) = \sum_{i=1}^{K} \frac{2^{s_i}-1}{\log_2(1 + \tau(i))}, \qquad \Delta\mathrm{NDCG}_{w\ell} = \frac{\mathrm{DCG}(\tau_{\hat{s}:w,\ell}) - \mathrm{DCG}(\tau_{\hat{s}})}{\mathrm{DCG}(\tau_s)},

where $\tau_{s}$ and $\tau_{\hat{s}}$ are permutations that sort the candidates by the ground-truth and predicted scores, respectively. The notation $\tau_{\hat{s}:w,\ell}$ means the same permutation as $\tau_{\hat{s}}$ except that items $w$ and $\ell$ are swapped. The NDCG weight is the absolute change in NDCG if the model gets the order of $w$ and $\ell$ wrong, normalized by the optimal DCG. This encourages the model to focus on getting the most important pairs right.

Reward Modeling as Learning-to-Rank

Reward modeling is usually motivated by the agent alignment problem, where the goal is to create agents that behave according to user intentions. The approach separates learning what to optimize from learning how to optimize it. First learn a reward model from user feedback, then optimize the agent against that learned reward ⁶.

When user intentions are expressed as preferences over trajectories or outputs, reward modeling becomes an LTR problem. The candidate item $y_i$ may be a trajectory, a completion, or any other behavior. The reward model $r_\psi(y_i)$ is the scoring function. The human preference labels define the target ranking. Other feedback types, such as demonstrations, ratings, corrections, or natural language feedback, also exist, but pairwise and listwise preferences are the case we care about here.

There is one conceptual difference. Standard LTR mostly cares about recovering the ordering. Reward modeling often wants scores on a cardinal scale, because downstream RL uses reward magnitudes. Even so, the training objectives are essentially the same ranking objectives.

Bradley-Terry is RankNet

The most common reward modeling objective is the Bradley-Terry model ⁷. This is a probabilistic model for pairwise comparisons. Given two items $y_w$ and $y_\ell$ , it defines the preference probability as

\Pr(y_w \succ y_\ell) = \frac{\exp(r_\psi(y_w))}{\exp(r_\psi(y_w)) + \exp(r_\psi(y_\ell))} = \sigma(r_\psi(y_w)-r_\psi(y_\ell)).

If the dataset contains pairs where $y_w \succ y_\ell$ , maximum likelihood minimizes

\mathcal{L}_{\textsf{BT}}(\psi) = \mathbb{E}_{\begin{subarray}{l}(y_w,y_\ell) \sim \mathcal{D} \\ \textsf{s.t.} \ y_w \succ y_\ell\end{subarray}} \Bigl[-\log \Pr(y_w \succ y_\ell)\Bigr].

Note that this is exactly RankNet, just written as a reward-modeling likelihood.

Plackett-Luce is ListMLE

The standard listwise generalization of Bradley-Terry is the Plackett-Luce model ^8,9. For a ranked list $y_1 \succ y_2 \succ \cdots \succ y_N$ , it defines the full-ranking probability as

\Pr(y_1 \succ \cdots \succ y_N) = \prod_{i=1}^{N} \frac{\exp(r_\psi(y_i))}{\sum_{j=i}^{N}\exp(r_\psi(y_j))}.

This can be interpreted as repeatedly choosing the best remaining item. First choose the best among all $N$ items. Then choose the best among the remaining $N-1$ items. Continue until the list is exhausted. When $N=2$ , this reduces to Bradley-Terry.

Maximum likelihood gives

\mathcal{L}_{\textsf{PL}}(\psi) = \mathbb{E}_{\begin{subarray}{l}(y_1,\ldots,y_N) \sim \mathcal{D} \\ \textsf{s.t.} \ y_1 \succ \cdots \succ y_N\end{subarray}} \Bigl[-\log \Pr(y_1 \succ \cdots \succ y_N)\Bigr],

which is exactly ListMLE.

Therfore, the standard reward modeling objectives are just LTR objectives in disguise. The reward model is the scoring function, the preference labels define the target ranking, and the loss functions are essentially the same.

Preference Alignment of LLMs

One popular application of reward modeling is preference alignment of LLMs. The goal is to train a language model to generate responses that are preferred by humans. The reward model is trained on human preferences over model-generated completions, and then the policy is optimized against that reward.

For LLM preference alignment, the items are completions conditioned on the same prompt. A typical dataset is

\mathcal{D} = \left\{ \left( x^{(k)}, \left\{\left(y_i^{(k)},s_i^{(k)}\right)\right\}_{i=1}^{N} \right) \right\}_{k=1}^{M},

where each prompt $x^{(k)}$ has $N$ sampled completions. Each prompt-completion pair $(x^{(k)}, y_i^{(k)})$ has a preference label $s_i^{(k)}$ . As before, we write

y_i^{(k)} \succ y_j^{(k)} \quad \Longleftrightarrow \quad s_i^{(k)} > s_j^{(k)}.

We usually do not compare completions from different prompts. The statement $y_i^{(k)} \succ y_j^{(\ell)}$ is undefined when $k \neq \ell$ . This mirrors learning-to-rank in search, where each prompt acts like a query and the responses are the documents attached to that query.

Reinforcement Learning from Human Feedback (RLHF)

The standard approach to LLM preference alignment is Reinforcement Learning from Human Feedback (RLHF) ¹⁰. In general, RLHF consists of two steps:

Train a reward model $r_\psi(x,y)$ on the preference dataset.
Optimize the policy using KL-regularized RL against that reward model.

As we observed above, reward modeling can be written as an LTR problem. The dataset contains prompts and ranked lists of completions, the reward model is the scoring function, and the loss is a ranking loss. Thus, the reward modeling step of RLHF can be written as

\operatorname*{minimize}_\psi \;\; \mathbb{E}_{(x,\{(y_i,s_i)\}_{i=1}^{K}) \sim \mathcal{D}} \Bigl[ \ell\bigl(s_1,\ldots,s_K,r_\psi(x,y_1),\ldots,r_\psi(x,y_K)\bigr) \Bigr].

For example, for pairwise preferences, choosing $\ell$ as RankNet gives the standard Bradley-Terry reward model. For ranked lists, choosing $\ell$ as ListMLE gives a Plackett-Luce reward model. This is the first half of the connection.

The second half comes from the KL-regularized RL objective. Given an optimized reward model $r_\psi$ , standard RLHF solves

\operatorname*{maximize}_\theta \;\; \mathbb{E}_{\begin{subarray}{l}x \sim \mathcal{D} \\ y \sim \pi_\theta(\cdot \mid x)\end{subarray}} \left[ r_\psi(x,y) - \beta \log\frac{\pi_\theta(y \mid x)}{\pi_{\textsf{ref}}(y \mid x)} \right].

Standard RLHF solves this objective using an RL algorithm such as PPO. However, we may also view this as a standalone optimization problem. The closed-form optimal policy for this objective is well known:

\pi_{r_\psi}(y \mid x) = \frac{1}{Z(x)}\pi_{\textsf{ref}}(y \mid x) \exp\left(\frac{1}{\beta}r_\psi(x,y)\right),

where

Z(x) = \sum_{y \in \mathcal{Y}} \pi_{\textsf{ref}}(y \mid x) \exp\left(\frac{1}{\beta}r_\psi(x,y)\right).

Then we may invert this equation. The reward that makes an arbitrary policy $\pi_\theta$ optimal is

r_{\pi_\theta}(x,y) = \beta \log\frac{\pi_\theta(y \mid x)}{\pi_{\textsf{ref}}(y \mid x)} + \beta\log Z(x).

This is the key DPO reparameterization ¹¹. However, this expression itself is conceptually useful but computationally annoying; the normalizer $Z(x)$ is a sum over all possible completions, which is intractable to compute. However, we will see that in many cases, the unknown $Z(x)$ cancels out or can be safely ignored. This is the key insight that leads to DPO-style direct alignment.

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) builds on the above insight, collapsing the two steps of RLHF into a single step of direct optimization on the preference dataset.

As we have a closed-form expression for the optimal reward model for any policy, we can substitue this back to the reward modeling objective in step 1:

r_\psi(x,y) \leftarrow r_{\pi_\theta}(x,y) = \beta \log\frac{\pi_\theta(y \mid x)}{\pi_{\textsf{ref}}(y \mid x)} + \beta\log Z(x).

If the unknown $Z(x)$ cancels, the reward-modeling objective becomes a direct policy optimization objective. No explicit reward model, no RL loop, just supervised optimization on preferences.

If we use the Bradley-Terry objective, we start with

\operatorname*{minimize}_\psi \;\; \mathbb{E}_{\begin{subarray}{l}(x, y_w,y_\ell) \sim \mathcal{D} \\ \textsf{s.t.} \ y_w \succ y_\ell\end{subarray}} \Bigl[ -\log\sigma\bigl(r_\psi(x,y_w)-r_\psi(x,y_\ell)\bigr) \Bigr],

where $y_w$ is preferred over $y_\ell$ . Substituting the implicit policy reward gives

r_{\pi_\theta}(x,y_w)-r_{\pi_\theta}(x,y_\ell) = \beta\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\textsf{ref}}(y_w \mid x)} - \beta\log\frac{\pi_\theta(y_\ell \mid x)}{\pi_{\textsf{ref}}(y_\ell \mid x)}.

The two $\beta\log Z(x)$ terms cancel because the pair shares the same prompt. This gives the standard Bradley-Terry DPO loss.

\mathcal{L}_{\textsf{BT-DPO}}(\theta) = \mathbb{E}_{\begin{subarray}{l}(x, y_w,y_\ell) \sim \mathcal{D} \\ \textsf{s.t.} \ y_w \succ y_\ell\end{subarray}} \left[ -\log\sigma\left( \beta\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\textsf{ref}}(y_w \mid x)} - \beta\log\frac{\pi_\theta(y_\ell \mid x)}{\pi_{\textsf{ref}}(y_\ell \mid x)} \right) \right].

DPO provides a very clean recipe for LLM preference alignment. More importantly, it achieves strong, competitive results while using a much simpler training loop than RLHF. This simplicity has contributed to its popularity and influence.

Generalized DPO

Now that we have seen how DPO emerges from the LTR perspective, we can ask: what about other LTR losses? The same substitution can be applied to any LTR-style reward modeling objective.

\operatorname*{minimize}_\psi \;\; \mathbb{E}_{(x,\{(y_i,s_i)\}_{i=1}^{K}) \sim \mathcal{D}} \Bigl[ \ell\bigl(s_1,\ldots,s_K,r_\psi(x,y_1),\ldots,r_\psi(x,y_K)\bigr) \Bigr].

After substitution, we get

\operatorname*{minimize}_\theta \;\; \mathbb{E}_{(x,\{(y_i,s_i)\}_{i=1}^{K}) \sim \mathcal{D}} \Bigl[ \ell\bigl(s_1,\ldots,s_K,r_{\pi_\theta}(x,y_1),\ldots,r_{\pi_\theta}(x,y_K)\bigr) \Bigr].

Note that this is actionable only when the prompt-level normalizers induced by $r_{\pi_\theta}$ either cancel exactly or can be safely approximated. Pairwise difference losses are clean because $Z(x)$ cancels in every difference. ListMLE is also clean because every Plackett-Luce denominator is a softmax over candidates from the same prompt, and adding the same constant to all candidate scores changes nothing.

Pointwise losses are not clean, because they depend on absolute scores. The unknown $\beta\log Z(x)$ remains. ListNet-style objectives are also subtle in preference-only settings. The model-side softmax is translation-invariant, but the target top-one distribution requires meaningful cardinal scores or an approximation to the optimal reward distribution. This is where contrastive views such as InfoNCA become useful.

Therefore, other than RankNet/Bradley-Terry, we can get a variety of direct alignment methods by applying the DPO reparameterization to different LTR losses. It is convinient to define the policy-implied reward score

\rho_\theta(x,y) \coloneqq \beta\log\frac{\pi_\theta(y \mid x)}{\pi_{\textsf{ref}}(y \mid x)}.

Then, in pairwise losses, the score differences in $r_{\pi_\theta}$ are just differences in $\rho_\theta$ . In listwise losses, softmaxes over $r_{\pi_\theta}$ are just softmaxes over $\rho_\theta$ .

Pairwise hinge via SLiC and RRHF

The RankSVM-style hinge loss becomes

\mathcal{L}_{\textsf{hinge}}(\theta) = \mathbb{E}_{\begin{subarray}{l}(x, y_w,y_\ell) \sim \mathcal{D} \\ \textsf{s.t.} \ y_w \succ y_\ell\end{subarray}} \left[ \max\Bigl(0, 1 - \bigl(\rho_\theta(x,y_w) - \rho_\theta(x,y_\ell)\bigr)\Bigr) \right].

This is closely related to SLiC and SLiC-HF ^12,13, as well as RRHF ¹⁴, especially when sequence likelihoods are normalized appropriately for length ¹⁵. The basic idea is margin ranking over preferred and dispreferred completions.

ListMLE gives DPO with Plackett-Luce and PRO

For a ranked list

y_1 \succ y_2 \succ \cdots \succ y_K,

the Plackett-Luce/ListMLE direct objective becomes

\mathcal{L}_{\textsf{PL-DPO}}(\theta) = \mathbb{E}_{(x,y_1,\ldots,y_K) \sim \mathcal{D}} \left[ -\sum_{i=1}^{K} \log \frac{\exp(\rho_\theta(x,y_i))}{\sum_{j=i}^{K}\exp(\rho_\theta(x,y_j))} \right].

This is the listwise analogue of DPO. It corresponds to DPO with Plackett-Luce, and is also closely connected to Preference Ranking Optimization (PRO) ¹⁶. If preferences are given as listwise labels, pairwise DPO throws away most of a ranked list by reducing it to pairs. PL-DPO uses the whole ordering directly.

ListNet gives InfoNCA

The ListNet-style top-one objective compares a target distribution over candidates with the model distribution

Q_\theta(i \mid x,y_1,\ldots,y_K) = \frac{\exp(\rho_\theta(x,y_i))}{\sum_{j=1}^{K}\exp(\rho_\theta(x,y_j))}.

Now, suppose we know the true reward function $r^\star(x,y)$ that induces the true preference distribution. Then, the optimal target distribution is the softmax over the true reward:

P^\star(i \mid x,y_1,\ldots,y_K) = \frac{\exp(r^\star(x,y_i))}{\sum_{j=1}^{K}\exp(r^\star(x,y_j))}.

Then the loss is

\mathcal{L}_{\textsf{InfoNCA}}(\theta) = -\mathbb{E}_{(x,y_1,\ldots,y_K) \sim \mathcal{D}} \left[ \sum_{i=1}^{K} P^\star(i \mid x,y_1,\ldots,y_K) \log Q_\theta(i \mid x,y_1,\ldots,y_K) \right].

This is exactly the ListNet form, and it corresponds to InfoNCA in the Noise Contrastive Alignment (NCA) framework ¹⁷. The subtle part is estimating or constructing the target distribution $P^\star$ from reward labels, preference labels, or samples. To do so, InfoNCA uses an approximation to the denominator of the optimal distribution,

\frac{1}{K}\sum_{j=1}^{K}\exp(r^\star(x,y_j)) \approx \sum_{y \in \mathcal{Y}}\pi_{\textsf{ref}}(y \mid x)\exp(r^\star(x,y)) = Z(x),

assuming that the candidate list is a representative sample from the full distribution. This is a common assumption in contrastive learning, and it is often good enough in practice. Then, writing $\mathcal{D}$ as a distribution over prompts sampled from an underlying distribution $p_0$ and candidate lists sampled identically and independently from $\pi_{\textsf{ref}}$ , the loss becomes

\begin{aligned} \mathcal{L}_{\textsf{InfoNCA}}(\theta) &= -\mathbb{E}_{\begin{subarray}{l}x \sim p_0 \\ (y_1,\ldots,y_K) \sim \pi_{\textsf{ref}}(\cdot \mid x)\end{subarray}} \left[ \sum_{i=1}^{K} \frac{\exp(r^\star(x,y_i))}{\sum_{j=1}^{K}\exp(r^\star(x,y_j))} \log Q_\theta(i \mid x,y_1,\ldots,y_K) \right] \\ &\approx -\mathbb{E}_{\begin{subarray}{l}x \sim p_0 \\ (y_1,\ldots,y_K) \sim \pi_{\textsf{ref}}(\cdot \mid x)\end{subarray}} \left[ \frac{1}{K}\sum_{i=1}^{K} \frac{\exp(r^\star(x,y_i))}{Z(x)} \log Q_\theta(i \mid x,y_1,\ldots,y_K) \right]. \end{aligned}

Since the candidates are i.i.d. from $\pi_{\textsf{ref}}$ , the expectation over the candidate list is symmetric in $y_1,\ldots,y_K$ . Therefore, we can drop the average over $i$ and just pick one of the candidates to be the "positive" sample:

\mathcal{L}_{\textsf{InfoNCA}}(\theta) \approx -\mathbb{E}_{\begin{subarray}{l}x \sim p_0 \\ (y_1,\ldots,y_K) \sim \pi_{\textsf{ref}}(\cdot \mid x)\end{subarray}} \left[ \frac{\exp(r^\star(x,y_i))}{Z(x)} \log Q_\theta(i \mid x,y_1,\ldots,y_K) \right].

Now recall that the optimal policy and optimal reward are related by

\pi^\star(y \mid x) = \frac{1}{Z(x)}\pi_{\textsf{ref}}(y \mid x)\exp\left(\frac{r^\star(x,y)}{\beta}\right).

Assuming $\beta=1$ for simplicity, the objective can be rewritten as

\mathcal{L}_{\textsf{InfoNCA}}(\theta) \approx -\mathbb{E}_{\begin{subarray}{l}x \sim p_0 \\ (y_1,\ldots,y_K) \sim \pi_{\textsf{ref}}(\cdot \mid x)\end{subarray}} \left[ \frac{\pi^\star(y_i \mid x)}{\pi_{\textsf{ref}}(y_i \mid x)} \log Q_\theta(i \mid x,y_1,\ldots,y_K) \right].

Finally, treating the probability ratio as an importance weight, we can rewrite this expectation as if we had sampled a single positive item from the optimal policy $\pi^\star$ instead of $\pi_{\textsf{ref}}$ :

\mathcal{L}_{\textsf{InfoNCA}}(\theta) \approx -\mathbb{E}_{\begin{subarray}{l}x \sim p_0 \\ y_i \sim \pi^\star(\cdot \mid x) \\ (y_1,\ldots,y_{i-1},y_{i+1},\ldots,y_K) \sim \pi_{\textsf{ref}}(\cdot \mid x)\end{subarray}} \Bigl[\log Q_\theta(i \mid x,y_1,\ldots,y_K)\Bigr].

Therefore, instead of computing a softmax cross-entropy against an unknown target distribution, we can just do a standard InfoNCE ¹⁸-style contrastive loss. We draw one positive sample from the optimal policy $\pi^\star$ , or, use expert completions as a proxy for the optimal policy, and several negatives from the reference policy $\pi_{\textsf{ref}}$ . The loss is just the negative log-probability that the model assigns to the positive sample among the candidate list. In fact, this is the reason for the name InfoNCA; it is a noise-contrastive estimation of the optimal policy distribution, where the noise distribution is the reference policy.

LambdaRank gives LiPO- $\lambda$

Finally, the LambdaRank-style objective becomes

\mathcal{L}_{\lambda\textsf{-DPO}}(\theta) = \mathbb{E}_{(x,y_1,\ldots,y_K) \sim \mathcal{D}} \left[ \sum_{w, \ell \ : \ y_w \succ y_\ell} \Bigl[\!\!\Bigl[\bigl|\Delta\mathrm{NDCG}_{w\ell}\bigr|\Bigr]\!\!\Bigr] \log\sigma\bigl(\rho_\theta(x,y_w) - \rho_\theta(x,y_\ell)\bigr) \right].

This is the main idea behind LiPO- $\lambda$ ¹⁹. It uses pairwise logistic comparisons, but weights them by their listwise ranking impact. However, different from InfoNCA, this method requires the true optimal rewards to compute the NDCG weights, and does not have a clean alternative formulation. Therefore, it is more of a theoretical proposal than a practical algorithm. Still, when the true rewards are available, it is a natural way to inject metric awareness into direct alignment, and we should expect it to outperform others that discard the true reward magnitudes.

Correspondence Table

The analogy can be summarized in a compact table.

LTR method	Reward modeling view	Direct alignment analogue	Main idea
RankSVM	Margin-based pairwise ranking	SLiC / RRHF with likelihood normalization	Preferred responses should beat dispreferred responses by a margin
RankNet	Bradley-Terry MLE	DPO with BT	Logistic pairwise preference classification
ListNet	Top-one softmax cross-entropy	InfoNCA	Match a target soft distribution over a candidate list
ListMLE	Plackett-Luce MLE	DPO with PL / PRO	Maximize likelihood of the full ranked list
LambdaRank	NDCG-weighted pairwise ranking	LiPO- $\lambda$	Weight pairwise errors by their listwise metric impact

This table summarizes the main point of the post. Many DPO-style methods are different ways of applying ranking losses directly to policies. The connection comes from KL-regularized RL, which lets us rewrite rewards in terms of policy log-ratios.

Conclusion

The path from LTR to direct alignment is straightforward:

Start with a ranking loss over item scores.
Interpret the score as a reward model.
Use the KL-regularized RLHF solution to express rewards through policy log-ratios.
Substitute that policy-implied reward into the ranking loss.
Keep the objectives where the unknown prompt normalizer cancels, or approximate it carefully.

This gives a clean map from classical ranking methods to modern alignment algorithms. The practical takeaway is simple. The training objective should match the feedback format. Pairwise feedback fits pairwise losses, while ranked-list feedback fits listwise losses. Turning a ranked list into pairs is convenient, but it can lose useful information about the full ordering. Since this issue has already been studied in learning-to-rank, LLM alignment can reuse those ideas instead of treating each new objective as unrelated.