Author: Denis Avetisyan
New research tackles the problem of artificial intelligence ‘imagining’ details not actually present in videos, a crucial step toward reliable multimodal AI systems.

Researchers introduce DualityForge, a framework leveraging counterfactual video data and contrastive learning, alongside the DualityVidQA dataset to improve video understanding and mitigate hallucinations in large language models.
Despite recent advances, multimodal large language models remain vulnerable to visually ungrounded hallucinations, particularly when reasoning about improbable or counterfactual scenarios. This limitation is addressed in ‘Taming Hallucinations: Boosting MLLMs’ Video Understanding via Counterfactual Video Generation’, which introduces DualityForge, a framework for automatically generating paired, realistic, and counterfactual videos alongside corresponding question-answer pairs. By leveraging this synthetic data – comprising the DualityVidQA dataset – and a novel contrastive learning approach, the authors demonstrate a substantial reduction in hallucinations and improved generalization across benchmarks. Could this method of systematically forging counterfactual data unlock more robust and reliable video understanding in multimodal AI?
Unveiling the Illusion: The Hallucination Problem in Multimodal LLMs
Multimodal Large Language Models, despite their impressive capabilities, are increasingly prone to “visual hallucinations” – a disconcerting tendency to generate textual descriptions that demonstrably contradict the provided visual input. This isn’t simply a matter of minor inaccuracies; models may confidently assert details absent from an image, misidentify objects, or fabricate entire scenes. The phenomenon arises because these models, trained on massive datasets of paired images and text, sometimes prioritize statistical correlations over genuine understanding of visual content. Consequently, they can produce fluent, grammatically correct narratives that are nonetheless detached from reality, posing significant challenges for applications demanding factual accuracy, such as image captioning, visual question answering, and assistive technologies.
The emergence of visual hallucinations in multimodal large language models isn’t merely a cosmetic flaw, but a fundamental issue arising from insufficient connection to the provided visual data. These models, while adept at language generation, often struggle to reliably anchor their textual outputs in concrete visual evidence, leading to inconsistencies and factual errors. This weak grounding compromises the model’s ability to perform tasks requiring accurate perception and reasoning, such as detailed image description, visual question answering, and informed decision-making based on visual scenes. Essentially, the model prioritizes fluent language generation over faithful representation of the visual input, hindering its usefulness in applications where accuracy and trustworthiness are paramount, and creating a barrier to truly intelligent multimodal systems.
Despite considerable research into mitigating visual hallucinations in Multimodal Large Language Models (MLLMs), current approaches consistently fall short of achieving reliable performance. Techniques range from refining training datasets to incorporating attention mechanisms and knowledge retrieval, yet inconsistencies between generated text and visual inputs persist. This lack of robust grounding isn’t merely a cosmetic issue; it directly impacts the trustworthiness of MLLM outputs, hindering their deployment in critical applications like medical diagnosis, autonomous navigation, and detailed image description. Consequently, the practical utility of these models remains limited until a substantial breakthrough addresses the fundamental challenge of ensuring visual fidelity in textual generation, preventing the propagation of factually incorrect or misleading information.

Constructing Robust Understanding: DualityVidQA as a Contrastive Dataset
DualityVidQA is a large-scale dataset designed for evaluating video understanding capabilities, consisting of 144,000 training samples. The dataset’s core structure revolves around paired examples: each sample includes a realistic video segment alongside a corresponding counterfactual video. These counterfactuals are not simply random variations, but are specifically generated to present subtle alterations to the original scene, creating challenging contrastive data for model training and evaluation. This paired structure facilitates the assessment of a model’s ability to discern meaningful differences and maintain robust performance even with minor visual or semantic changes in the input video.
DualityForge is a video editing framework designed to facilitate the creation of contrastive video examples for machine learning applications. The system allows for controlled manipulation of video content, enabling the generation of paired realistic and counterfactual scenarios. This controllability is achieved through specific edits targeting visual elements and semantic consistency, allowing researchers to create challenging test cases that go beyond simple data augmentation. The framework outputs paired videos designed to test a model’s ability to discern subtle but critical differences, and to improve robustness against anomalies or distortions commonly found in real-world video data.
DualityVidQA is designed to improve the robustness of video understanding models by presenting paired realistic and subtly altered video scenarios. This approach compels models to move beyond superficial feature recognition and develop a more nuanced comprehension of visual content, specifically addressing vulnerabilities to visual distortions and semantic inconsistencies. By training on data that explicitly includes these types of alterations, models are encouraged to focus on core semantic information rather than being misled by surface-level changes, resulting in improved generalization and reliability in real-world applications.
The DualityVidQA dataset incorporates counterfactual video generation through the systematic introduction of anomalies. These anomalies fall into three categories: video-level manipulations affecting visual fidelity, semantic inconsistencies altering object properties or relationships, and commonsense violations that depict physically implausible events. The resulting dataset comprises 144,000 training samples, each consisting of a paired real video and a corresponding counterfactual example exhibiting one or more of these anomaly types. This approach aims to challenge video understanding models by requiring them to discern subtle differences and demonstrate robustness to deviations from typical visual and semantic expectations.

Disciplining Perception: DNA-Train’s Two-Stage Training Methodology
DNA-Train is a two-stage training methodology implemented on the Qwen2.5-VL base model to address the problem of hallucination and enhance performance in video understanding tasks. This regime departs from single-stage training by initially establishing a strong foundation of video-text alignment through Supervised Fine-Tuning (SFT). Subsequently, it refines this understanding with a Reinforcement Learning (RL) stage, specifically targeting the reduction of inconsistencies between video content and generated text. The two-stage approach allows for a more controlled and effective learning process, leading to improved accuracy and reliability in video-based applications.
Supervised Fine-Tuning (SFT) forms the initial stage of the DNA-Train methodology, establishing a core competency in video-text alignment. This process leverages a labeled dataset consisting of video clips paired with corresponding textual descriptions. By training the Qwen2.5-VL base model on this data, the SFT stage optimizes the model’s parameters to predict accurate text given a video input, and vice-versa. The resulting model, pre-trained through SFT, demonstrates an improved capacity to associate visual content with relevant textual information, providing a strong foundation for subsequent reinforcement learning and ultimately reducing the incidence of hallucinations.
The second training stage of DNA-Train utilizes Reinforcement Learning (RL) to refine the model’s ability to identify and rectify inconsistencies between video content and textual descriptions. This is achieved through training on the DualityVidQA dataset, which presents video-question-answer triplets specifically designed to challenge the model’s reasoning and factuality. The RL process rewards the model for generating answers consistent with the video and penalizes responses containing unsupported or contradictory information, thereby enhancing its capacity for accurate video understanding and reducing the occurrence of hallucinations.
During the Reinforcement Learning (RL) phase of DNA-Train, ℓ_1-normalization is applied to maintain training stability and prevent potential gradient issues. This technique involves normalizing the incoming vectors by summing their absolute values and then dividing each element by that sum. By constraining the magnitude of gradients, ℓ_1-normalization mitigates the risk of exploding gradients, a common problem in deep reinforcement learning, and promotes more consistent and reliable model updates. This process ensures that no single feature dominates the learning process, leading to improved convergence and overall performance during the RL stage.

Beyond Accuracy: The Broader Implications for Reliable MLLMs
Evaluations on the DualityVidQA-Test benchmark reveal that DNA-Train consistently surpasses the performance of established baseline methods, including GRPO and DAPO, in detecting visual hallucinations within multimodal large language models. This improved detection capability suggests a fundamental advancement in aligning model outputs with actual visual content; while other techniques may generate plausible-sounding answers, DNA-Train demonstrates a heightened ability to identify when a model’s response contradicts the provided visual evidence. This consistent outperformance isn’t simply about achieving a higher score, but indicates a more robust mechanism for grounding language generation in visual reality, a critical step towards building trustworthy and reliable artificial intelligence systems.
The developed approach establishes a new benchmark in detecting inaccuracies within multimodal large language models, achieving a state-of-the-art hallucination detection accuracy of 76.8% on the challenging DualityVidQA-Test benchmark. This substantial improvement signifies a significant leap forward in ensuring the reliability of these models, particularly in scenarios demanding precise visual understanding. By accurately identifying instances where the model’s responses diverge from the provided visual evidence, this technology minimizes the risk of generating misleading or factually incorrect outputs. The achievement underscores the efficacy of the methodology in grounding model responses in verifiable data, a critical step toward deploying trustworthy MLLMs in real-world applications.
The enhanced performance demonstrated by this approach indicates a fundamental shift in how Multimodal Large Language Models (MLLMs) process information. By effectively grounding responses in concrete visual evidence, the system minimizes the generation of unsupported or fabricated details – a common challenge with earlier models. This grounding isn’t merely about identifying objects in an image; it’s about establishing a demonstrable link between visual input and the textual output, ensuring that claims are substantiated by what is actually seen. Consequently, the resulting outputs are not only more accurate but also inherently more trustworthy, fostering greater confidence in the MLLM’s reasoning and decision-making processes. This increased reliability is crucial for deploying these models in sensitive applications where factual correctness is paramount, representing a significant step towards more dependable artificial intelligence.
The advancements detailed in this research hold considerable promise for translating multimodal large language models (MLLMs) into dependable tools for real-world deployment. Specifically, improved accuracy in detecting visual hallucinations-coupled with strong performance on benchmarks like MVBench and TVBench, rivaling or exceeding that of GPT-4o-suggests a pathway toward confidently utilizing MLLMs in complex applications. Areas poised to benefit include autonomous navigation systems requiring precise environmental understanding, and the field of medical diagnosis where accurate visual reasoning is paramount. Furthermore, surpassing the performance of other open-source models in EventHallusion Accuracy indicates a significant step forward in building MLLMs capable of discerning genuine events from fabricated details within video content, broadening the scope of their reliable application.

The pursuit of robust video understanding, as demonstrated by DualityForge, echoes a fundamental principle of elegant design. The framework doesn’t merely address the symptoms of visual hallucinations within Multimodal Large Language Models; it actively reshapes the learning process itself, fostering a deeper, more nuanced comprehension of visual information. As Yann LeCun aptly stated, “Everything we are doing in deep learning is about building systems that can learn representations.” DualityForge exemplifies this by synthesizing counterfactual data, effectively providing the model with a richer, more complete understanding of the visual world, and enabling it to discern reality from mere statistical probability. This careful tuning, akin to a well-crafted interface, allows the model to ‘sing’ rather than ‘shout’ when interpreting video content.
Beyond the Mirror: Charting Future Directions
The pursuit of mitigating visual hallucinations in Multimodal Large Language Models, as exemplified by DualityForge and DualityVidQA, reveals a fundamental truth: mere scale is insufficient. A model can ingest vast quantities of data, yet still mistake the shadow for the substance. The elegance of this work lies not simply in generating counterfactuals, but in forcing the model to reconcile conflicting realities-a subtle, yet profound, shift in perspective. However, the current reliance on synthetic data introduces its own set of biases, a ghostly echo of the assumptions embedded within the generation process itself. Future efforts must grapple with the question of authenticity – how to ground these models in a more robust understanding of the physical world, beyond the curated confines of a dataset.
A particularly intriguing, though currently unaddressed, challenge lies in extending this counterfactual reasoning to temporal dynamics. Generating a single counterfactual frame is one matter; constructing a coherent counterfactual video – one that adheres to the laws of physics and the subtleties of cause and effect – is a far more demanding task. This necessitates not only advances in generative modeling, but also a more formal integration of physics-based simulation within the learning framework.
Ultimately, the goal isn’t merely to reduce hallucinations, but to cultivate a form of ‘visual humility’ within these models – an awareness of their own limitations and a capacity for genuine uncertainty. The minor elements-the subtle inconsistencies flagged, the ambiguous scenarios presented-these create a sense of harmony between perception and understanding. It is a poetic notion, perhaps, but one that hints at the true measure of intelligence.
Original article: https://arxiv.org/pdf/2512.24271.pdf
Contact the author: https://www.linkedin.com/in/avetisyan/
See also:
- 39th Developer Notes: 2.5th Anniversary Update
- Avantor’s Plunge and the $23M Gamble
- Gold Rate Forecast
- The Sega Dreamcast’s Best 8 Games Ranked
- Costco Is One of the Largest Consumer Goods Companies by Market Cap. But Is It a Buy?
- :Amazon’s ‘Gen V’ Takes A Swipe At Elon Musk: Kills The Goat
- When Machine Learning Meets Soil: A Reality Check for Geotechnical Engineering
- Movies That Faced Huge Boycotts Over ‘Forced Diversity’ Casting
- DeFi’s Legal Meltdown 🥶: Next Crypto Domino? 💰🔥
- Ethereum’s Affair With Binance Blossoms: A $960M Romance? 🤑❓
2026-01-04 08:35