Most alignment methods for language models can be described with a small set of ranking ideas. We assign scores to candidates, compare those scores, and update the model so that preferred candidates receive higher scores. This is exactly the problem studied in learning-to-rank (LTR). The modern LLM alignment literature often uses different language, such as reward modeling, RLHF, DPO, or preference optimization, but much of the machinery is inherited from LTR.
In this post, we will make this connection explicit. We will start from classical LTR losses, reinterpret reward modeling as an LTR problem, and then show how DPO-style direct alignment falls out by substituting the implicit reward induced by a policy into those ranking losses. The main takeaway is the following:
Direct alignment is often just learning-to-rank after the DPO reward reparameterization.
This view unifies a large family of direct alignment methods, including DPO and its variants. The post mostly follows the analysis in this paper, which introduced the LiPO-λ method and its connection to LambdaRank. The main purpose of this post is to expand on that analysis and place it in a broader context of learning-to-rank and reward modeling.
Learning-to-Rank (LTR)
Suppose we have a dataset
D={(yi,si)}i=1N,
where each item yi is associated with a real-valued label si. The label indicates some predefined preference, relevance, quality, or utility. We write
yi≻yj⟺si>sj,
meaning that yi should be ranked above yj. In many practical settings, the absolute scores are not observed, and we only know relative preferences between items.
A learning-to-rank problem learns a scoring function rψ that assigns
s^i:=rψ(yi)
to each item, so that sorting by s^i recovers the true preference order induced by si. A generic LTR objective can be written as
The choices of K and ℓ determine whether the method is pointwise, pairwise, or listwise.
Pointwise LTR
In pointwise LTR, we set K=1. The model is trained to predict the score of each item independently. If the labels are real-valued, we can use mean squared error:
ℓMSE(s,s^)=(s−s^)2.
If the labels are binary, we can use binary cross-entropy:
ℓBCE(s,s^)=−slogσ(s^)−(1−s)log(1−σ(s^)).
Pointwise learning is natural when the labels are actually meaningful as scalar targets. However, if we only have preferences, pointwise learning forces us to invent absolute labels. This is often an unnecessary burden, because rankings are invariant to many transformations of the score scale.
Pairwise LTR
In pairwise LTR, we set K=2. The loss only asks that a preferred item receive a larger predicted score than a less preferred item.
A classic example is RankSVM1. For a preferred pair y1≻y2, the hinge loss is
This is a smooth logistic loss over the score difference. Instead of demanding a hard margin, it keeps increasing the probability that the preferred item receives a higher score.
The key feature of pairwise losses is their dependence on score differences. This will matter later because prompt-level normalizers cancel inside differences.
Listwise LTR
In listwise LTR, we set K>2 and train on a whole ranked list at once. This is often a better match to real preference data, because annotators or reward models may rank several candidate responses to the same prompt.
One listwise method is ListNet3. It converts both the ground-truth scores and the predicted scores into top-one softmax distributions:
Finally, LambdaRank5 adds metric awareness. Instead of treating all preference inversions equally, it weights pairwise logistic losses by the change in a ranking metric such as NDCG.
Here [[⋅]] denotes stop-gradient, meaning that the gradient of the NDCG weight is not backpropagated through the network. The difference in NDCG is computed by:
where τs and τs^ are permutations that sort the candidates by the ground-truth and predicted scores, respectively. The notation τs^:w,ℓ means the same permutation as τs^ except that items w and ℓ are swapped. The NDCG weight is the absolute change in NDCG if the model gets the order of w and ℓ wrong, normalized by the optimal DCG. This encourages the model to focus on getting the most important pairs right.
Reward Modeling as Learning-to-Rank
Reward modeling is usually motivated by the agent alignment problem, where the goal is to create agents that behave according to user intentions. The approach separates learning what to optimize from learning how to optimize it. First learn a reward model from user feedback, then optimize the agent against that learned reward 6.
When user intentions are expressed as preferences over trajectories or outputs, reward modeling becomes an LTR problem. The candidate item yi may be a trajectory, a completion, or any other behavior. The reward model rψ(yi) is the scoring function. The human preference labels define the target ranking. Other feedback types, such as demonstrations, ratings, corrections, or natural language feedback, also exist, but pairwise and listwise preferences are the case we care about here.
There is one conceptual difference. Standard LTR mostly cares about recovering the ordering. Reward modeling often wants scores on a cardinal scale, because downstream RL uses reward magnitudes. Even so, the training objectives are essentially the same ranking objectives.
Bradley-Terry is RankNet
The most common reward modeling objective is the Bradley-Terry model 7. This is a probabilistic model for pairwise comparisons. Given two items yw and yℓ, it defines the preference probability as
Note that this is exactly RankNet, just written as a reward-modeling likelihood.
Plackett-Luce is ListMLE
The standard listwise generalization of Bradley-Terry is the Plackett-Luce model 8,9. For a ranked list y1≻y2≻⋯≻yN, it defines the full-ranking probability as
This can be interpreted as repeatedly choosing the best remaining item. First choose the best among all N items. Then choose the best among the remaining N−1 items. Continue until the list is exhausted. When N=2, this reduces to Bradley-Terry.
Therfore, the standard reward modeling objectives are just LTR objectives in disguise. The reward model is the scoring function, the preference labels define the target ranking, and the loss functions are essentially the same.
Preference Alignment of LLMs
One popular application of reward modeling is preference alignment of LLMs. The goal is to train a language model to generate responses that are preferred by humans. The reward model is trained on human preferences over model-generated completions, and then the policy is optimized against that reward.
For LLM preference alignment, the items are completions conditioned on the same prompt. A typical dataset is
D={(x(k),{(yi(k),si(k))}i=1N)}k=1M,
where each prompt x(k) has N sampled completions. Each prompt-completion pair (x(k),yi(k)) has a preference label si(k). As before, we write
yi(k)≻yj(k)⟺si(k)>sj(k).
We usually do not compare completions from different prompts. The statement yi(k)≻yj(ℓ) is undefined when k=ℓ. This mirrors learning-to-rank in search, where each prompt acts like a query and the responses are the documents attached to that query.
Reinforcement Learning from Human Feedback (RLHF)
The standard approach to LLM preference alignment is Reinforcement Learning from Human Feedback (RLHF)10. In general, RLHF consists of two steps:
Train a reward model rψ(x,y) on the preference dataset.
Optimize the policy using KL-regularized RL against that reward model.
As we observed above, reward modeling can be written as an LTR problem. The dataset contains prompts and ranked lists of completions, the reward model is the scoring function, and the loss is a ranking loss. Thus, the reward modeling step of RLHF can be written as
For example, for pairwise preferences, choosing ℓ as RankNet gives the standard Bradley-Terry reward model. For ranked lists, choosing ℓ as ListMLE gives a Plackett-Luce reward model. This is the first half of the connection.
The second half comes from the KL-regularized RL objective. Given an optimized reward model rψ, standard RLHF solves
Standard RLHF solves this objective using an RL algorithm such as PPO. However, we may also view this as a standalone optimization problem. The closed-form optimal policy for this objective is wellknown:
πrψ(y∣x)=Z(x)1πref(y∣x)exp(β1rψ(x,y)),
where
Z(x)=y∈Y∑πref(y∣x)exp(β1rψ(x,y)).
Then we may invert this equation. The reward that makes an arbitrary policy πθ optimal is
rπθ(x,y)=βlogπref(y∣x)πθ(y∣x)+βlogZ(x).
This is the key DPO reparameterization 11. However, this expression itself is conceptually useful but computationally annoying; the normalizer Z(x) is a sum over all possible completions, which is intractable to compute.
However, we will see that in many cases, the unknown Z(x) cancels out or can be safely ignored. This is the key insight that leads to DPO-style direct alignment.
Direct Preference Optimization (DPO)
Direct Preference Optimization (DPO) builds on the above insight, collapsing the two steps of RLHF into a single step of direct optimization on the preference dataset.
As we have a closed-form expression for the optimal reward model for any policy, we can substitue this back to the reward modeling objective in step 1:
If the unknown Z(x) cancels, the reward-modeling objective becomes a direct policy optimization objective. No explicit reward model, no RL loop, just supervised optimization on preferences.
If we use the Bradley-Terry objective, we start with
DPO provides a very clean recipe for LLM preference alignment. More importantly, it achieves strong, competitive results while using a much simpler training loop than RLHF. This simplicity has contributed to its popularity and influence.
Generalized DPO
Now that we have seen how DPO emerges from the LTR perspective, we can ask: what about other LTR losses? The same substitution can be applied to any LTR-style reward modeling objective.
Note that this is actionable only when the prompt-level normalizers induced by rπθ either cancel exactly or can be safely approximated. Pairwise difference losses are clean because Z(x) cancels in every difference. ListMLE is also clean because every Plackett-Luce denominator is a softmax over candidates from the same prompt, and adding the same constant to all candidate scores changes nothing.
Pointwise losses are not clean, because they depend on absolute scores. The unknown βlogZ(x) remains. ListNet-style objectives are also subtle in preference-only settings. The model-side softmax is translation-invariant, but the target top-one distribution requires meaningful cardinal scores or an approximation to the optimal reward distribution. This is where contrastive views such as InfoNCA become useful.
Therefore, other than RankNet/Bradley-Terry, we can get a variety of direct alignment methods by applying the DPO reparameterization to different LTR losses. It is convinient to define the policy-implied reward score
ρθ(x,y):=βlogπref(y∣x)πθ(y∣x).
Then, in pairwise losses, the score differences in rπθ are just differences in ρθ. In listwise losses, softmaxes over rπθ are just softmaxes over ρθ.
This is closely related to SLiC and SLiC-HF12,13, as well as RRHF14, especially when sequence likelihoods are normalized appropriately for length 15. The basic idea is margin ranking over preferred and dispreferred completions.
ListMLE gives DPO with Plackett-Luce and PRO
For a ranked list
y1≻y2≻⋯≻yK,
the Plackett-Luce/ListMLE direct objective becomes
This is the listwise analogue of DPO. It corresponds to DPO with Plackett-Luce, and is also closely connected to Preference Ranking Optimization (PRO)16. If preferences are given as listwise labels, pairwise DPO throws away most of a ranked list by reducing it to pairs. PL-DPO uses the whole ordering directly.
ListNet gives InfoNCA
The ListNet-style top-one objective compares a target distribution over candidates with the model distribution
Now, suppose we know the true reward function r⋆(x,y) that induces the true preference distribution. Then, the optimal target distribution is the softmax over the true reward:
This is exactly the ListNet form, and it corresponds to InfoNCA in the Noise Contrastive Alignment (NCA) framework 17. The subtle part is estimating or constructing the target distribution P⋆ from reward labels, preference labels, or samples. To do so, InfoNCA uses an approximation to the denominator of the optimal distribution,
assuming that the candidate list is a representative sample from the full distribution. This is a common assumption in contrastive learning, and it is often good enough in practice. Then, writing D as a distribution over prompts sampled from an underlying distribution p0 and candidate lists sampled identically and independently from πref, the loss becomes
Since the candidates are i.i.d. from πref, the expectation over the candidate list is symmetric in y1,…,yK. Therefore, we can drop the average over i and just pick one of the candidates to be the "positive" sample:
Finally, treating the probability ratio as an importance weight, we can rewrite this expectation as if we had sampled a single positive item from the optimal policy π⋆ instead of πref:
Therefore, instead of computing a softmax cross-entropy against an unknown target distribution, we can just do a standard InfoNCE18-style contrastive loss. We draw one positive sample from the optimal policy π⋆, or, use expert completions as a proxy for the optimal policy, and several negatives from the reference policy πref. The loss is just the negative log-probability that the model assigns to the positive sample among the candidate list. In fact, this is the reason for the name InfoNCA; it is a noise-contrastive estimation of the optimal policy distribution, where the noise distribution is the reference policy.
This is the main idea behind LiPO-λ19. It uses pairwise logistic comparisons, but weights them by their listwise ranking impact. However, different from InfoNCA, this method requires the true optimal rewards to compute the NDCG weights, and does not have a clean alternative formulation. Therefore, it is more of a theoretical proposal than a practical algorithm. Still, when the true rewards are available, it is a natural way to inject metric awareness into direct alignment, and we should expect it to outperform others that discard the true reward magnitudes.
Correspondence Table
The analogy can be summarized in a compact table.
LTR method
Reward modeling view
Direct alignment analogue
Main idea
RankSVM
Margin-based pairwise ranking
SLiC / RRHF with likelihood normalization
Preferred responses should beat dispreferred responses by a margin
RankNet
Bradley-Terry MLE
DPO with BT
Logistic pairwise preference classification
ListNet
Top-one softmax cross-entropy
InfoNCA
Match a target soft distribution over a candidate list
ListMLE
Plackett-Luce MLE
DPO with PL / PRO
Maximize likelihood of the full ranked list
LambdaRank
NDCG-weighted pairwise ranking
LiPO-λ
Weight pairwise errors by their listwise metric impact
This table summarizes the main point of the post. Many DPO-style methods are different ways of applying ranking losses directly to policies. The connection comes from KL-regularized RL, which lets us rewrite rewards in terms of policy log-ratios.
Conclusion
The path from LTR to direct alignment is straightforward:
Start with a ranking loss over item scores.
Interpret the score as a reward model.
Use the KL-regularized RLHF solution to express rewards through policy log-ratios.
Substitute that policy-implied reward into the ranking loss.
Keep the objectives where the unknown prompt normalizer cancels, or approximate it carefully.
This gives a clean map from classical ranking methods to modern alignment algorithms. The practical takeaway is simple. The training objective should match the feedback format. Pairwise feedback fits pairwise losses, while ranked-list feedback fits listwise losses. Turning a ranked list into pairs is convenient, but it can lose useful information about the full ordering. Since this issue has already been studied in learning-to-rank, LLM alignment can reuse those ideas instead of treating each new objective as unrelated.