Seeing Through the Fake: Predicting Video to Spot Deepfakes

Author: Denis Avetisyan

A new approach to deepfake detection uses future frame prediction and cross-modal analysis to identify manipulated videos and pinpoint exactly where the tampering occurs.

A pipeline constructs cross-modal features from unimodal embeddings, then leverages three masked-prediction modules to identify inconsistencies both within and between modalities by predicting subsequent frame features and quantifying deviations, ultimately fusing these intra- and cross-modal insights via alternating cross-attention layers to enable deepfake detection or precise temporal localization.

This work introduces a single-stage framework leveraging masked prediction and convolutional attention for robust deepfake detection and temporal localization across audio and visual modalities.

Despite advances in multimodal deepfake detection, current methods often struggle to generalize across unseen manipulations and may overlook subtle, intra-modal artifacts. This limitation motivates our work, ‘Next-Frame Feature Prediction for Multimodal Deepfake Detection and Temporal Localization’, which introduces a novel single-stage framework that enhances both generalization and precise temporal localization. By incorporating next-frame prediction for uni-modal and cross-modal features, alongside a window-level attention mechanism, our model effectively captures discrepancies indicative of deepfake tampering. Can this approach pave the way for more robust and reliable deepfake detection in increasingly sophisticated multimedia content?

Unveiling the Pattern: The Rising Threat of Deepfakes

The proliferation of increasingly convincing synthetic media, commonly known as deepfakes, represents a significant and escalating threat to both individual trust and broader societal security. Driven by rapid advancements in generative modeling – particularly techniques like Generative Adversarial Networks (GANs) and diffusion models – these artificially generated videos, images, and audio recordings are becoming remarkably difficult to distinguish from authentic content. This heightened realism erodes the public’s ability to reliably assess information, potentially manipulating opinions, damaging reputations, and even inciting real-world harm. The ease with which convincing forgeries can now be created, coupled with their widespread dissemination through social media, poses a unique challenge to traditional methods of verifying truth and maintaining a stable information ecosystem, demanding proactive strategies to mitigate the risks associated with this evolving technology.

As synthetic media rapidly evolves, conventional deepfake detection techniques – often relying on identifying broad statistical anomalies or pixel-level inconsistencies – are proving increasingly inadequate. These methods, effective against earlier generations of forgeries, struggle with the nuanced realism now achievable through advanced generative models. The arms race between creation and detection demands a shift towards solutions exhibiting greater adaptability and resilience; static, signature-based approaches are quickly bypassed. Current research emphasizes the development of techniques leveraging spatiotemporal inconsistencies, physiological signal analysis, and even the subtle ‘fingerprints’ left by the generative models themselves, aiming for robust systems capable of generalizing to unseen manipulation techniques and maintaining accuracy in the face of ever-improving forgeries.

Determining when and where alterations occur within a video is paramount to establishing its veracity, a process known as temporal localization. Unlike methods that simply flag a video as potentially manipulated, temporal localization pinpoints the precise frames where tampering has occurred, allowing for a nuanced assessment of the evidence. This capability moves beyond a binary ‘real or fake’ judgment and enables investigators to understand the scope and nature of the manipulation – was it a subtle audio edit, a complete facial replacement, or a minor object insertion? Accurate temporal localization is especially critical in forensic contexts, legal proceedings, and journalism, where the credibility of video evidence can have profound consequences. Researchers are actively developing algorithms that not only detect deepfakes but also provide a ‘heat map’ of alterations, offering a granular view of the manipulation and bolstering confidence in the remaining, unaltered portions of the video.

By predicting and comparing frame-level features, combined with local convolutional attention and contrastive loss, the system effectively identifies inconsistencies to distinguish between authentic and manipulated video frames.

A Multimodal Lens: Towards Robust Detection

The presented deepfake detection system operates on a multimodal framework, integrating both audio and visual data streams to improve detection accuracy and pinpoint the temporal location of manipulations. This approach addresses the limitations of unimodal systems by exploiting potential inconsistencies between the visual and auditory components of a video; for example, lip movements not aligning with spoken words. By simultaneously analyzing both modalities, the framework aims to increase robustness against sophisticated deepfakes that may convincingly alter one modality while failing to accurately represent the other, and to provide precise frame-level localization of the manipulated regions within the video content.

The proposed deepfake detection framework utilizes a single-stage training process to directly learn audio-visual consistencies, eliminating the need for independent pre-training of audio and visual components. Traditional methods often rely on pre-trained models, which can introduce biases and limit the model’s ability to effectively integrate multimodal information. By training the entire system end-to-end, the framework optimizes feature extraction and fusion specifically for the task of deepfake detection, resulting in improved generalization performance across diverse datasets and scenarios. This approach allows the model to learn subtle correlations between audio and visual cues that might be lost in a two-stage training paradigm.

The deepfake detection framework employs AV-Hubert for feature extraction, utilizing a ResNet-18 encoder to process visual data and a ViT (Vision Transformer) encoder for audio data. ResNet-18, a convolutional neural network, efficiently captures spatial hierarchies within video frames, while the ViT encoder processes audio waveforms by applying self-attention mechanisms to identify temporal patterns. This dual-encoder approach allows AV-Hubert to create distinct, yet complementary, feature representations from both modalities, which are then fused for improved detection performance. The selection of ResNet-18 and ViT balances computational efficiency with representational capacity, enabling real-time processing without significant accuracy loss.

Cross-modal feature fusion creates a refined representation by combining visual and audio encodings through linear transformations.

Uncovering the Anomaly: Discrepancy Detection Through Masked Prediction

The masked-prediction feature extraction module operates by forecasting features present in subsequent frames based on analysis of preceding frames. This predictive process is not intended to perfectly reconstruct future frames, but rather to establish a baseline expectation of temporal consistency. Significant deviations between the predicted features and the actual features observed in the subsequent frames are flagged as potential discrepancies. These discrepancies serve as indicators of video tampering, as manipulations frequently disrupt the natural flow and predictability of video sequences. The magnitude and characteristics of these prediction errors are then utilized as input for the discrepancy detection process, enabling identification of even subtle alterations.

The masked prediction feature extraction module utilizes a Causal Transformer architecture comprising both an encoder and a decoder. The encoder processes sequential past frame information, generating a contextualized representation of the video history. This representation is then fed to the decoder, which is tasked with predicting features for subsequent, masked frames. The causal nature of both the encoder and decoder ensures that predictions are based solely on past information, preventing the use of future data and maintaining temporal consistency. This architecture allows the model to learn and extrapolate temporal patterns, facilitating the detection of anomalies or discrepancies introduced by video tampering.

The system utilizes both Local Window-Based Attention and Convolutional Attention mechanisms to model fine-grained temporal relationships within video sequences, enhancing the detection of subtle video manipulations. Local Window-Based Attention focuses on short-range dependencies by attending to features within a limited temporal window, while Convolutional Attention captures local patterns through convolutional filters. These attention mechanisms are further improved by the integration of cross-attention, which allows the model to attend to relevant features across different temporal locations and modalities, enabling a more comprehensive understanding of temporal dynamics and improved detection of inconsistencies indicative of tampering.

Frame-level contrastive loss is implemented to facilitate the learning of robust and discriminative feature representations. This loss function operates by minimizing the distance between feature embeddings of corresponding frames in authentic videos while maximizing the distance between embeddings of frames from different videos or tampered frames within the same video. The objective is defined as minimizing a loss value calculated from paired positive and negative examples; positive pairs consist of features extracted from the same, unaltered video segment, and negative pairs comprise features from distinct videos or manipulated portions of videos. This process encourages the model to generate feature embeddings that are both highly similar for authentic content and distinctly different for manipulated or foreign content, thereby enhancing the model’s ability to identify discrepancies and improve generalization performance.

The regression head combines features extracted from intra- and cross-modal masked prediction—specifically, features A, V, and C—and processes them using an adapted UMMAFormer model to predict continuous values.

Validating the System: Performance Across Datasets

The proposed framework underwent training and evaluation utilizing the FakeAVCeleb dataset, a benchmark resource for audio-visual spoofing detection. Performance on this dataset reached a maximum accuracy of 92%, indicating a high level of efficacy in distinguishing between genuine and manipulated audio-visual samples within the specific characteristics of FakeAVCeleb. This metric represents the percentage of correctly classified instances out of the total number of samples evaluated, serving as a primary indicator of the model’s ability to generalize to unseen data within the dataset’s distribution.

To enhance the framework’s resilience and mitigate potential biases present in the initial training dataset, the training process was supplemented with data from the VoxCeleb2 dataset. This augmentation strategy increased the diversity of voices and acoustic conditions used during training, thereby improving the model’s ability to generalize to unseen data. Specifically, VoxCeleb2 provides a larger and more varied corpus of speech data compared to the original dataset, addressing limitations in speaker representation and environmental variability. This resulted in a more robust model less susceptible to overfitting and capable of maintaining performance across different recording conditions and speaker demographics.

Evaluation on the KoDF dataset resulted in 100% accuracy, indicating the model’s capacity to generalize beyond the training data and perform reliably on unseen examples. This dataset, comprising deepfake videos, presents a distinct challenge due to variations in compression artifacts and facial expressions compared to the FakeAVCeleb and VoxCeleb2 datasets used during training. Achieving perfect accuracy on KoDF suggests the framework has learned robust features capable of identifying deepfakes across a range of visual characteristics and is not overly sensitive to the specific data distribution of the training sets.

Evaluation on the LAV-DF dataset demonstrates state-of-the-art performance in temporal localization, with the proposed framework achieving an Average Precision (AP) of 19.82% at an Intersection over Union (IoU) threshold of 95%. This represents a significant improvement over the UMMAFormer baseline, exceeding its performance by 19.82% under the same evaluation criteria. The reported AP score indicates the model’s ability to accurately identify and localize instances of manipulated audio within the LAV-DF dataset, while the IoU threshold defines the level of overlap required to consider a prediction accurate.

Analysis of KoDF dataset heatmaps reveals substantial discrepancies between predicted and actual features in the visual modality, while audio and cross-modal features show minimal differences.

Looking Ahead: Implications and Future Directions

This research provides a crucial stepping stone towards building deepfake detection systems capable of withstanding increasingly sophisticated forgeries. By demonstrating the efficacy of a novel approach to identifying subtle inconsistencies introduced during the deepfake creation process, the work moves beyond reliance on easily circumvented superficial cues. The demonstrated resilience stems from focusing on inherent artifacts within the generative models themselves, rather than merely detecting visual discrepancies. This foundation allows for the development of detectors less susceptible to adversarial attacks and evolving deepfake techniques, ultimately paving the way for more trustworthy digital content authentication and bolstering defenses against the spread of synthetic misinformation.

Investigations are shifting toward streamlining deepfake detection through more computationally efficient neural network designs, moving beyond resource-intensive models. This includes exploring techniques like network pruning and quantization to reduce processing demands without significantly compromising accuracy. Simultaneously, researchers are integrating diverse data streams – encompassing textual transcripts accompanying videos and associated metadata like creation dates and device information – to enhance detection capabilities. The rationale is that manipulations often leave subtle inconsistencies across these modalities, providing complementary signals that improve the robustness and reliability of current systems. This multi-modal approach promises a more holistic and resilient defense against increasingly sophisticated deepfake technology, potentially uncovering forgeries that would remain undetected by methods relying solely on visual analysis.

The proliferation of deepfake technology necessitates a concurrent and robust ethical framework to preempt potential societal damage and cultivate responsible advancement. Beyond the technical challenges of detection, careful consideration must be given to the implications for privacy, reputation, and the erosion of trust in digital content. Proactive ethical guidelines are needed to govern the creation and dissemination of synthetic media, addressing issues of consent, attribution, and the potential for malicious use, such as disinformation campaigns or identity theft. Furthermore, fostering public awareness and media literacy regarding deepfakes is paramount, empowering individuals to critically evaluate online content and discern authenticity from fabrication. A holistic approach, integrating technological safeguards with ethical principles and public education, is essential to harness the benefits of this powerful technology while minimizing its inherent risks.

The escalating sophistication of digital manipulation necessitates robust methods for not only detecting deepfakes but also precisely localizing the alterations within media. This pinpoint accuracy is paramount for rebuilding trust in a digital landscape increasingly susceptible to misinformation; simply identifying a fake is often insufficient, as understanding what has been changed allows for a more nuanced assessment of its intent and potential impact. Reliable localization empowers verification tools to highlight manipulated regions, providing users with concrete evidence and fostering critical evaluation of content, while also enabling targeted forensic analysis to trace the origins and spread of deceptive media. Ultimately, the ability to confidently identify and map these manipulations is not merely a technical challenge, but a crucial step in safeguarding individuals, institutions, and the integrity of public discourse.

The presented framework emphasizes discerning subtle inconsistencies within and between modalities – a principle echoed in Geoffrey Hinton’s observation: “The fundamental idea is that if you want to know something about the world, you have to look at a lot of examples.” This research, by masking and predicting future frames, effectively creates a ‘looking at a lot of examples’ scenario for the model, forcing it to learn robust features indicative of genuine temporal coherence. The convolutional attention mechanism further refines this process, highlighting discrepancies that might otherwise be masked by superficial similarities, ultimately strengthening the system’s ability to detect manipulations. This aligns with the core idea of the article – temporal localization – pinpointing when inconsistencies arise within the video sequence.

What Lies Ahead?

The pursuit of robust deepfake detection invariably reveals a deeper truth: the imperfections inherent in attempting to replicate reality. This work, by framing the problem as one of future frame prediction, cleverly shifts the focus from detecting forgery to understanding the generative process itself. Each reconstructed frame, each fused modality, highlights structural dependencies that must be uncovered, but also exposes the limitations of current approaches. The subtle failures of prediction aren’t merely errors; they are indicators of the underlying inconsistencies within and between the audio and visual streams.

Future investigation should not dwell solely on refining architectural details. The field risks becoming trapped in an iterative cycle of increasingly complex models that offer diminishing returns. Instead, attention should be directed toward disentangling the features that define genuine motion and sound. A more fruitful path may lie in exploring alternative loss functions—metrics that prioritize the preservation of temporal coherence and physical plausibility over pixel-perfect reconstruction.

Ultimately, interpreting these models is more important than producing visually appealing results. The true test will not be whether a system can flawlessly identify existing deepfakes, but whether it can anticipate and expose the patterns of forgery as they evolve. The goal, perhaps, is not to create an impenetrable shield, but to illuminate the inherent fragility of simulated reality.

Original article: https://arxiv.org/pdf/2511.10212.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/