Robots That Watch Their Own Work: Real-Time Anomaly Detection for Agile Manipulation

Author: Denis Avetisyan


A new approach enables robots to monitor their actions and identify unexpected deviations during complex tasks, improving reliability in dynamic environments.

A robotic system utilizes a real-time monitoring framework-specifically, a Robot-Conditioned Normalizing Flow-that processes visual input from object masks, encodes task prompts with spherical uniform encoding, and integrates robot proprioception to compute anomaly scores within affine coupling layers of a proposed network, triggering either task replanning for out-of-distribution scenarios at the task level or task rollback for state-level anomalies when those scores exceed a defined threshold.
A robotic system utilizes a real-time monitoring framework-specifically, a Robot-Conditioned Normalizing Flow-that processes visual input from object masks, encodes task prompts with spherical uniform encoding, and integrates robot proprioception to compute anomaly scores within affine coupling layers of a proposed network, triggering either task replanning for out-of-distribution scenarios at the task level or task rollback for state-level anomalies when those scores exceed a defined threshold.

This paper introduces Robot-Conditioned Normalizing Flow (RC-NF), a real-time anomaly detection model leveraging Vision-Language-Action models and normalizing flows to enhance robustness and adaptability in robotic manipulation.

Despite advances in robotic manipulation driven by Vision-Language-Action models, reliable operation in dynamic, real-world scenarios remains a challenge due to limitations in handling out-of-distribution (OOD) conditions. This paper introduces Robot-Conditioned Normalizing Flow (RC-NF), a real-time anomaly detection framework designed to monitor robotic tasks and ensure consistent state and trajectory alignment. By decoupling robot and object states within a normalizing flow, RC-NF accurately quantifies deviations from expected behavior using only positive samples, achieving state-of-the-art performance on the newly introduced LIBERO-Anomaly-10 benchmark and demonstrating sub-100ms latency in real-world experiments. Could this approach unlock truly adaptive and robust robotic systems capable of navigating unforeseen circumstances?


The Fragility of Robotic Perception

Contemporary robotic systems increasingly rely on Vision-Language-Action Models to interpret surroundings and execute tasks; however, these sophisticated architectures exhibit a notable fragility when confronted with situations outside their training data. This susceptibility to ‘out-of-distribution’ data represents a significant impediment to deploying robots in unpredictable real-world environments. While capable of impressive performance under controlled conditions, even slight deviations – an unfamiliar object, unexpected lighting, or a novel arrangement of known elements – can trigger cascading failures. The models, trained to recognize patterns within a defined dataset, struggle to generalize to the infinite variability inherent in everyday life, leading to inaccurate perceptions and potentially unsafe actions. This inherent limitation underscores the need for more resilient perception systems capable of handling the unexpected with grace and reliability.

Despite advancements in Vision-Language-Action Models, robotic systems frequently falter when faced with unforeseen circumstances – a critical limitation for deployment in dynamic real-world environments. These models, trained on finite datasets, exhibit a fragility when encountering ‘out-of-distribution’ data, meaning anything deviating from their training parameters can induce errors. Consequently, robust anomaly detection becomes paramount; the ability to identify these subtle, unexpected inputs allows a robot to flag potentially problematic situations, initiate safety protocols, or request human intervention. Without this inherent resilience and the supporting anomaly detection systems, even highly capable robots remain vulnerable to failure in unpredictable settings, hindering their widespread adoption and reliable performance.

Conventional anomaly detection techniques often falter when confronted with the nuanced shifts in data encountered in real-world robotic applications. These methods frequently rely on pre-defined thresholds or statistical models trained on limited datasets, proving inadequate at recognizing deviations that are slight, yet significant for safe and effective operation. Consequently, a robot employing such systems may misinterpret subtle changes in its environment – a partially obscured object, unusual lighting, or an unexpected human gesture – as normal behavior. This inability to discern subtle anomalies severely restricts the robot’s capacity to adapt to unforeseen circumstances, hindering its ability to recover from errors and potentially leading to system failures or unsafe interactions. The challenge lies not in identifying gross deviations, but in accurately flagging the subtle anomalies that precede more critical issues, demanding more sophisticated perception strategies.

Vision-language models (VLMs) can be prompted to assess robotic anomalies, providing a scoring mechanism for identifying unusual behavior.
Vision-language models (VLMs) can be prompted to assess robotic anomalies, providing a scoring mechanism for identifying unusual behavior.

Modeling Expected Behavior with Robot-Conditioned Normalizing Flows

The Robot-Conditioned Normalizing Flow (RC-NF) is a generative model designed to learn the probability distribution governing expected robot states and trajectories. This is achieved by mapping a simple, known distribution – typically Gaussian – through a series of invertible transformations. These transformations are learned from robot experience data, enabling the model to represent complex, multi-modal distributions of plausible robot behavior. The output of the RC-NF is a probability density function over the space of robot states and future trajectories, allowing for both sampling of likely behaviors and the assessment of the probability of observed robot actions. This probabilistic representation is crucial for applications such as anomaly detection, predictive control, and reinforcement learning.

Robot-Conditioned Normalizing Flows (RC-NF) leverage the established mathematical framework of Normalizing Flows, which are a class of generative models capable of learning complex, non-linear probability distributions. These flows achieve this by transforming a simple, known distribution – typically a Gaussian – into a more intricate distribution representing the robot’s behavioral data through a series of invertible transformations. This approach allows RC-NF to model the dependencies within high-dimensional robot state spaces, capturing nuanced relationships between robot configurations, actions, and environmental factors. By learning this underlying distribution, the model can then generate realistic and diverse robot trajectories, and assess the likelihood of observed behaviors, providing a powerful tool for robotic learning and control.

The Robot-Conditioned Normalizing Flow (RC-NF) integrates three primary data streams to establish contextual awareness of robotic operations. These consist of the robot’s current state – encompassing joint angles, velocities, and end-effector pose – providing a kinematic description. Task embeddings, generated from task specifications, supply semantic information regarding the intended goal. Finally, object-centric point sets, derived from sensor data, represent the surrounding environment and relevant objects in a geometrically explicit manner. By conditioning the generative model on this combined input – robot state, task embedding, and object point cloud – RC-NF can learn a distribution over plausible future states specifically relevant to the current operational context.

The Robot-Conditioned Point Query Network (RCPQNet) functions as an affine coupling layer within the RC-NF architecture to produce shift and scale parameters.
The Robot-Conditioned Point Query Network (RCPQNet) functions as an affine coupling layer within the RC-NF architecture to produce shift and scale parameters.

Detailed Feature Extraction via the RC-PQNet

The Robot-Conditioned Point Query Network (RC-PQNet) serves as the core feature extraction component within the RC-NF framework, employing an affine coupling layer to integrate data from multiple modalities. This coupling layer facilitates the fusion of information, allowing the network to consider diverse input types-such as point clouds, images, and robot state-during the feature extraction process. The affine coupling architecture transforms the input data through a series of affine transformations, enabling efficient and effective feature combination and subsequent representation learning. This design prioritizes maintaining information flow while enabling the network to learn complex relationships between the various input modalities.

The Robot-Conditioned Point Query Network (RC-PQNet) employs Segment Anything Model 2 (SAM2) to generate object masks, providing pixel-level segmentation for identified objects within the scene. Complementing this, a Dual-Branch Point Feature Encoding strategy is implemented to capture detailed geometric information. This involves processing point cloud data through two separate branches: one focused on shape features derived from point normals and curvatures, and the other on positional data representing the 3D coordinates of each point. The outputs of these branches are then combined to create a comprehensive feature representation that captures both the object’s form and its location in space, enabling precise manipulation planning.

The RC-PQNet incorporates task embeddings as conditional inputs to modulate its feature extraction process. These embeddings, representing the specific manipulation goal – such as picking, placing, or assembling – are integrated into the affine coupling layers. This allows the network to dynamically adjust its weighting and processing of multi-modal input data – including visual and tactile information – based on the current task. Consequently, the extracted features become task-relevant, prioritizing information crucial for successful execution of the manipulation and enabling contextual awareness without requiring separate networks for each action.

During real-world deployment with consecutive out-of-distribution (OOD) events, our <span class="katex-eq" data-katex-display="false">\pi_0 + RC-NF</span> policy demonstrates improved robustness compared to the <span class="katex-eq" data-katex-display="false">\pi_0</span> model, effectively handling ball repositioning at <span class="katex-eq" data-katex-display="false">t_{1a}</span>, ball rolling behind the gripper at <span class="katex-eq" data-katex-display="false">t_{2a}</span>, and control return after homing at <span class="katex-eq" data-katex-display="false">t_{3r}</span>.
During real-world deployment with consecutive out-of-distribution (OOD) events, our \pi_0 + RC-NF policy demonstrates improved robustness compared to the \pi_0 model, effectively handling ball repositioning at t_{1a}, ball rolling behind the gripper at t_{2a}, and control return after homing at t_{3r}.

Demonstrating Robustness and Real-Time Anomaly Characterization

The efficacy of this robotic anomaly detection system was rigorously tested using the LIBERO-Anomaly-10 benchmark dataset, a curated collection specifically designed to challenge and evaluate robotic perception and control systems. This dataset focuses on three prevalent issues encountered in robotic manipulation: instances where the gripper unexpectedly opens during a task, situations involving unintended slippage of the grasped object, and errors related to spatial misalignment between the robot and its intended target. By evaluating performance against these common anomalies, researchers can reliably assess the system’s ability to identify and respond to real-world robotic failures, ultimately improving the robustness and safety of automated processes.

Rigorous evaluation of the proposed Residual Correction – Neural Filter (RC-NF) demonstrates its robust performance in identifying critical robotic anomalies. When tested against the LIBERO-Anomaly-10 benchmark – encompassing Gripper Open, Gripper Slippage, and Spatial Misalignment – RC-NF consistently outperformed existing methods. Specifically, the approach achieved an approximate 8% improvement in Area Under the Curve (AUC) and a 10% improvement in Average Precision (AP) when compared to the strongest baseline. These gains indicate a substantial advancement in anomaly detection capability, suggesting RC-NF’s potential for more reliable and safe robotic operation by proactively identifying and addressing potential failures.

The system demonstrates a capacity for real-time corrective action when faced with spatial misalignment anomalies; upon detection, a Homing Procedure is automatically triggered to recalibrate the robot’s trajectory and preempt potential errors. Critically, this anomaly detection and subsequent procedural response operates with a remarkably low latency of 100 milliseconds, even when processed on commercially available hardware – specifically, a consumer-grade Nvidia GeForce RTX 3090 GPU. This speed is essential for maintaining seamless robotic operation and ensuring the robot can swiftly recover from deviations without disrupting the overall task, highlighting the practical viability of the approach for deployment in dynamic environments.

An anomaly score indicates successful detection and correction of an out-of-distribution state where the ball is repositioned and obscured, as demonstrated by the yellow curve contrasting with the baseline VLA model <span class="katex-eq" data-katex-display="false">\pi_0</span> performance (green).
An anomaly score indicates successful detection and correction of an out-of-distribution state where the ball is repositioned and obscured, as demonstrated by the yellow curve contrasting with the baseline VLA model \pi_0 performance (green).

The pursuit of robust robotic systems necessitates a focus on anticipating the unexpected. This research, introducing Robot-Conditioned Normalizing Flow, embodies that principle by establishing a framework for real-time anomaly detection. The model doesn’t simply react to errors; it learns a representation of ‘normal’ behavior, allowing it to discern deviations during robotic manipulation. As Andrew Ng once stated, “AI is about enabling machines to do things that previously required human intelligence.” This work advances that ambition by providing robots with a form of ‘situational awareness’, improving their adaptability in dynamic environments and moving closer to truly intelligent, autonomous operation. The elegance of this approach lies in its ability to monitor task execution and proactively identify out-of-distribution scenarios.

What Lies Ahead?

The pursuit of truly adaptable robotic systems necessitates a shift from simply reacting to novelty, to anticipating it. Robot-Conditioned Normalizing Flow represents a measured step in that direction, offering a framework for discerning subtle deviations from expected behavior. However, the elegance of a model is always inversely proportional to the volume of assumptions it quietly makes. This work, while promising, presently relies on a defined ‘normal’ derived from observed demonstrations. The crucial, and arguably more difficult, task remains: how to instill in these systems the capacity to gracefully handle the genuinely unforeseen – the situations for which no prior example exists?

Future iterations should perhaps focus less on the sophistication of the normalizing flow itself, and more on the quality – and importantly, the diversity – of the conditioning data. A system trained solely on ‘successful’ manipulations risks becoming brittle, unable to distinguish between a deliberate adaptation and a genuine error. Furthermore, exploring methods to actively query the system’s uncertainty-to prompt it to articulate why a particular action is flagged as anomalous-could provide valuable diagnostic information for both the robot and its human overseer.

Ultimately, anomaly detection is not merely a technical challenge; it’s an exercise in understanding the very nature of expertise. A truly intelligent system shouldn’t just know what is normal, but why it is normal, and what contextual factors might reasonably justify a deviation. Refactoring toward that goal is not merely a refinement of the algorithm; it is a pursuit of fundamental principles.


Original article: https://arxiv.org/pdf/2603.11106.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-03-15 08:26