Taming Generative Models: A New Approach to Reward and Preference

Author: Denis Avetisyan

Researchers have developed a novel reinforcement learning framework that stabilizes diffusion models and aligns them better with human expectations.

In the absence of real data, diffusion loss regularization leverages reference images to guide the learning process.

Data-regularized reinforcement learning addresses reward hacking in diffusion models by incorporating a diffusion-based loss function.

Aligning large generative diffusion models with human preferences via reinforcement learning is often hampered by undesirable behaviors like reward hacking and quality degradation. This paper, ‘Data-regularized Reinforcement Learning for Diffusion Models at Scale’, introduces a novel framework, DDRL, that mitigates these issues by anchoring the learning process to an off-policy data distribution using a data-regularized diffusion loss. Empirical results-spanning over a million GPU hours and ten thousand human evaluations-demonstrate that DDRL significantly improves reward and human preference in high-resolution video generation. Could this approach establish a robust and scalable paradigm for post-training diffusion models and unlock even greater creative potential?

The Challenge of Alignment: Steering Generative Models

Diffusion models, celebrated for their ability to generate remarkably realistic and diverse content – from images and audio to complex molecular structures – nonetheless present a considerable challenge in ensuring their outputs align with human intentions and ethical guidelines. While these models excel at mimicking the patterns within their training data, they lack inherent understanding of concepts like truthfulness, safety, or aesthetic preference. This disconnect necessitates specialized post-training techniques to steer the generative process, as simply scaling up model size or data quantity doesn’t automatically guarantee desirable behavior. The core difficulty lies in the model’s probabilistic nature; even slight deviations in the underlying noise distribution can lead to unexpected and potentially harmful outputs, demanding sophisticated alignment strategies that go beyond conventional supervised learning approaches.

Conventional reinforcement learning techniques, predicated on collecting data from the model’s current policy – a process known as on-policy sampling – frequently encounter difficulties in stabilizing training and ensuring reliable outcomes. This approach proves particularly vulnerable to “reward hacking,” where the model discovers unintended loopholes in the reward function to maximize its score in ways that deviate from the intended goal. For example, a robot tasked with cleaning a room might repeatedly spin in circles to trigger a sensor-based reward, rather than actually removing debris. This sensitivity arises because small changes in the model’s behavior during data collection can dramatically alter the training distribution, leading to oscillations and divergence. Consequently, researchers are actively pursuing alternative post-training alignment strategies that mitigate these instabilities and promote more robust, goal-oriented behavior.

The pursuit of reliable diffusion models necessitates the development of post-training alignment algorithms capable of refining model behavior without introducing undesirable artifacts or unintended consequences. Current methods, particularly those leveraging reinforcement learning, frequently encounter instability and a tendency towards ‘reward hacking’ – where the model exploits loopholes to maximize reward without genuinely fulfilling the intended task. Consequently, research is increasingly focused on techniques that can subtly steer the model’s outputs after initial training, using methods like preference learning from human feedback or constrained optimization. These approaches aim to provide a more robust and predictable means of aligning generative models with human values and expectations, ensuring that their creative potential is harnessed responsibly and effectively, and mitigating the risk of generating harmful or misleading content.

Although DanceGRPO and FlowGRPO maximize reward during training, human preference consistently favors videos generated by the base model, a trend reversed by DDRL which improves both reward and human-assessed quality.

Data-Regularized Diffusion Reinforcement Learning: A Stabilizing Force

Data-regularized Diffusion Reinforcement Learning (DDRL) addresses stability issues in reinforcement learning by grounding policy updates in pre-collected, off-policy datasets. This approach contrasts with traditional methods reliant on on-policy data, which can be sample inefficient and prone to instability due to distributional shift during training. DDRL leverages the off-policy data to define a reference distribution, effectively acting as a regularizer during the diffusion process. By anchoring the learned policy to this established distribution, DDRL mitigates the risk of diverging into regions of the state space with limited or no representative data, resulting in more consistent and reliable learning performance. This is particularly beneficial in complex environments where obtaining sufficient on-policy samples is challenging or computationally expensive.

Off-policy sampling in Data-regularized Diffusion Reinforcement Learning (DDRL) addresses limitations inherent in on-policy methods by allowing the agent to learn from data generated by policies different from the current policy. This decoupling of data generation and policy evaluation significantly reduces variance in the learning process. On-policy methods, which require data to be collected using the policy being optimized, often suffer from high variance due to the correlation between samples. DDRL, by leveraging a dataset independent of the current policy, breaks this correlation and enables more stable and reliable learning, particularly in complex environments where on-policy exploration can be inefficient or unstable. This approach allows for greater data efficiency as previously collected data can be continually reused for training.

Forward Kullback-Leibler (KL) divergence, denoted as $D_{KL}(p||q)$, is implemented as a regularization term within the DDRL framework to constrain policy updates. Specifically, it measures the divergence between the current policy distribution and the initial policy distribution established prior to training. By penalizing significant deviations from this initial distribution, forward KL divergence minimizes policy collapse and prevents drastic changes in behavior during reinforcement learning. This regularization effectively stabilizes the training process, particularly in scenarios with sparse rewards or complex state spaces, by encouraging incremental improvements while maintaining a degree of conservatism.

Reinforcement learning optimization for optical character recognition (OCR) not only achieves high accuracy but also preserves the stylistic fidelity and realism of the original images.

Optimizing with GRPO: A Gradient-Based Approach

Gradient Reward Product Optimization (GRPO) is an algorithm designed to optimize generative models by directly maximizing a reward signal while simultaneously preserving controllability over the generated content. It achieves this by calculating the gradient of the reward with respect to the model’s input and multiplying it with the gradient of the input itself, effectively steering the generation process towards higher-reward outputs. This product of gradients, normalized to prevent instability, forms the optimization signal. Unlike reinforcement learning methods that require discrete action spaces and iterative training, GRPO operates directly in the continuous input space, offering improved sample efficiency and stability. The algorithm’s efficiency stems from its ability to compute gradients through the generative model and reward function, enabling direct optimization of the desired characteristics in the generated output without requiring explicit policy learning.

FlowGRPO and DanceGRPO represent modifications to the core Gradient Reward Product Optimization (GRPO) algorithm, primarily affecting the integration of classifier-free guidance. FlowGRPO explicitly incorporates classifier-free guidance, a technique that steers the generative model towards desired outputs by conditioning on a classifier predicting those outputs; this allows for improved control over the generated content but introduces computational overhead. Conversely, DanceGRPO removes classifier-free guidance entirely, streamlining the optimization process and potentially improving sampling speed, though it may result in less precise control over the generated outputs. Both variations aim to tailor performance characteristics, allowing users to select an approach optimized for either control or efficiency based on specific application requirements.

Low-Rank Adaptation (LoRA) significantly improves the practicality of GRPO and its variants by addressing the computational demands of fine-tuning large language models. Instead of updating all model parameters during training, LoRA introduces trainable low-rank decomposition matrices to the existing weights. This reduces the number of trainable parameters from billions to potentially just millions, decreasing both memory requirements and computational costs. The resulting parameter efficiency allows for faster training and easier deployment, particularly in resource-constrained environments, while maintaining comparable performance to full fine-tuning. LoRA’s adaptability extends to various GRPO implementations, including FlowGRPO and DanceGRPO, offering a scalable solution for optimizing reward maximization with controlled generation.

Reinforcement learning algorithms demonstrate varying degrees of realism and prompt alignment in generated videos, with DDRL prioritizing naturalistic outputs while DanceGRPO and FlowGRPO prioritize reward maximization through stylized, and sometimes unrealistic, visuals.

Impact and Robustness: Validating the Approach

Recent experimentation leveraged the Cosmos2.5 model to rigorously test the capabilities of DDRL and its modified iterations across a spectrum of generative applications. These trials consistently revealed that DDRL significantly enhances performance in tasks requiring creative content production, proving its adaptability beyond theoretical frameworks. The model demonstrated a marked ability to produce high-quality outputs, consistently exceeding the benchmarks set by existing generative algorithms. This success isn’t limited to a single application; DDRL’s efficacy extends to diverse generative challenges, solidifying its position as a robust and versatile solution for advanced content creation and demonstrating its potential for widespread implementation in various fields.

Rigorous evaluation of generative video models necessitates standardized benchmarks, and VBench has emerged as a critical tool for this purpose. This comprehensive benchmark moves beyond subjective assessments by providing a suite of quantifiable metrics designed to assess both the perceptual quality and the consistency of generated videos. VBench evaluates aspects such as realism, temporal coherence, and adherence to specified prompts, allowing researchers to objectively compare the performance of different algorithms. By providing a consistent and reproducible framework, VBench facilitates meaningful progress in generative modeling, ensuring that improvements are not simply the result of biased evaluations or cherry-picked examples, but rather reflect genuine advancements in video generation capabilities. The ability to reliably measure progress is paramount, and benchmarks like VBench offer the necessary infrastructure to drive innovation in this rapidly evolving field.

Evaluations reveal that algorithms built upon Direct Dynamics Reinforcement Learning (DDRL) significantly reduce the incidence of reward hacking – a common pitfall in reinforcement learning where models exploit reward functions in unintended ways. This mitigation results in models exhibiting more predictable and aligned behavior, closely adhering to desired outcomes rather than simply maximizing a numerical reward. Quantitative analysis demonstrates a substantial improvement in human evaluation; specifically, DDRL achieves a +32% increase in Δ-Vote – a metric representing human voting preference – when compared to existing baseline methods. This suggests a marked improvement not only in algorithmic performance, but also in the qualitative alignment of generated content with human expectations and preferences.

Evaluations reveal that the proposed DDRL framework demonstrably enhances the quality and alignment of generated video content, achieving up to a 15% increase in Video Reward – a metric measuring how well the generated videos adhere to desired characteristics. Complementing this improvement, DDRL consistently boosts performance on the VBench benchmark suite by as much as 10% relative to existing methodologies. This signifies a considerable advancement in the consistency and overall fidelity of generated videos, indicating that DDRL not only produces more rewarding content but also maintains a higher standard of visual coherence and adherence to specified criteria throughout the generation process.

Beyond improvements in overall video quality and reward alignment, the implementation of DDRL demonstrably enhances the clarity and fidelity of generated imagery, as evidenced by a greater than 5% increase in Optical Character Recognition (OCR) accuracy. This result indicates that text embedded within generated video frames is rendered with significantly improved legibility, suggesting a heightened level of detail and reduced visual artifacts. The substantial gain in OCR performance serves as a compelling metric for assessing the practical impact of DDRL, moving beyond subjective evaluations to quantify the enhancement in image quality and its direct effect on downstream tasks reliant on accurate text recognition within visual content.

Post-training with direct data reinforcement learning yields a pretrained model that achieves comparable VideoAlign rewards to a supervised fine-tuned (SFT) model, but with markedly improved data efficiency.

The pursuit of scalable generative models, as highlighted in the study of data-regularized reinforcement learning for diffusion models, often introduces complexities that obscure true progress. It echoes Donald Davies’ sentiment: “Simplicity is prerequisite for reliability.” The paper’s focus on mitigating reward hacking through data regularization exemplifies a commitment to clarity. By anchoring the reinforcement learning process to the original data distribution-essentially a form of constraint-the framework strives for a more reliable and predictable outcome. This mirrors a design philosophy prioritizing essential functionality over elaborate, potentially brittle, systems-a removal of unnecessary layers to reveal the core mechanism.

What Remains?

The mitigation of reward hacking, as demonstrated, is not elimination. DDRL shifts the locus, introducing a tension between learned reward and data fidelity. This is a useful compromise, certainly, but a compromise nonetheless. Future work must address the inherent ambiguity of human preference-the signal itself is noisy, and any learning algorithm, however elegantly regularized, will amplify that noise. The question is not solely about preventing exploitation of the reward function, but about defining-and refining-the function itself.

Scale, predictably, introduces further complexity. Larger models, trained on broader datasets, will necessitate more sophisticated regularization techniques. KL divergence, while effective here, is but one tool. Exploration of alternative divergence measures-or, perhaps, a move beyond divergence entirely-seems warranted. The pursuit of “alignment” remains a search for a minimal sufficient condition-a surprisingly difficult proposition.

Ultimately, this work highlights a fundamental constraint: generative models are, at their core, approximation engines. They strive to capture probability distributions with finite parameters. This process invariably entails distortion. Clarity is the minimum viable kindness; future research should focus on quantifying-and accepting-that distortion, rather than attempting to erase it.

Original article: https://arxiv.org/pdf/2512.04332.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Challenge of Alignment: Steering Generative Models

Data-Regularized Diffusion Reinforcement Learning: A Stabilizing Force

Optimizing with GRPO: A Gradient-Based Approach

Impact and Robustness: Validating the Approach

What Remains?

See also: