Beyond the Numbers: Evaluating Graph Neural Networks for Time Series Anomalies

Author: Denis Avetisyan

A new open-source framework provides a critical assessment of graph-based time series anomaly detection, revealing key factors for reliable performance.

A graph-based dynamical network (GDN) demonstrably outperforms a recurrent neural network (GRU) in time series forecasting on the SWaT dataset, achieving stable predictions and precise anomaly localization due to its inherent representation of system topology, whereas the GRU’s lack of structural awareness results in forecast instability and diffused anomaly detection-effectively obscuring the relationships between sensor readings and system state.

This review introduces a unified evaluation approach, highlighting the impact of graph structure, robust metrics, and interpretable anomaly scoring.

Despite growing interest in applying graph neural networks (GNNs) to time series anomaly detection (TSAD), a lack of standardized evaluation frameworks hinders reliable progress and comparative analysis. This work, ‘GNNs for Time Series Anomaly Detection: An Open-Source Framework and a Critical Evaluation’, addresses this gap by introducing a flexible, open-source framework for reproducible experimentation alongside a critical examination of common evaluation practices. Our results demonstrate that GNNs not only enhance detection performance but also offer improved interpretability, particularly with attention-based architectures and robust graph structures. How can we best leverage these insights to develop more trustworthy and insightful graph-based TSAD systems for real-world applications?

The Inherent Instability of Conventional Anomaly Detection

Conventional anomaly detection techniques, designed for simpler datasets, increasingly falter when applied to the intricate patterns found in modern time series. The sheer volume of data points – representing high dimensionality – coupled with the subtle, non-linear relationships within the data, overwhelms these methods. Consequently, algorithms frequently generate false positives, flagging normal fluctuations as anomalies, or, more critically, miss genuine, impactful events. This is because traditional approaches often rely on statistical assumptions about data distribution that are violated by the complex dependencies inherent in real-world time series – for example, assuming independence between data points when, in reality, a value at one time step is heavily influenced by preceding values. The resulting inaccuracies diminish the reliability of automated monitoring systems and necessitate costly manual review, hindering effective decision-making in fields ranging from predictive maintenance to fraud detection.

The capacity to reliably identify anomalous patterns within data streams is increasingly vital across a surprisingly broad spectrum of disciplines. In industrial settings, anomaly detection serves as a cornerstone of predictive maintenance, flagging deviations from normal operation that could signal equipment failure and prevent costly downtime. Simultaneously, in the realm of cybersecurity, these techniques are essential for identifying malicious activity, such as network intrusions or data breaches, by pinpointing unusual traffic patterns or system behaviors. This demand extends to financial markets, where anomalies can indicate fraudulent transactions, and even to environmental monitoring, where unexpected shifts in data can signal ecological disturbances. Consequently, the development of robust and scalable anomaly detection solutions is not merely an academic pursuit, but a practical necessity for safeguarding critical infrastructure, protecting sensitive information, and ensuring operational resilience across numerous sectors.

Reliable anomaly detection in time series data fundamentally depends on accurately modeling the inherent temporal dependencies and complex relationships within the data stream. Unlike static datasets, time series exhibit autocorrelation – where past values influence future ones – and often involve intricate, non-linear interactions between variables. Consequently, methods that treat each data point in isolation, or rely solely on statistical measures like mean and standard deviation, frequently fail to distinguish genuine anomalies from normal fluctuations. Advanced techniques, such as recurrent neural networks and state-space models, are increasingly employed to capture these temporal dynamics, effectively learning the expected patterns and identifying deviations that signal unusual or critical events. The ability to discern these subtle relationships is not merely a matter of improved accuracy; it is essential for proactive monitoring, predictive maintenance, and timely intervention across a range of applications, from financial markets to climate modeling.

Despite producing a fragmented anomaly detection with numerous false positives, the model can achieve deceptively high evaluation scores when assessed using range-based metrics like existence-only recall and precision without cardinality penalties.

Graph Neural Networks: A Mathematically Sound Approach to Time Series Analysis

Deep learning methodologies, and specifically Graph Neural Networks (GNNs), provide a robust approach to time series modeling by explicitly representing the interdependencies within the data. Traditional time series analysis often treats data points as independent or relies on simplistic assumptions about their relationships; however, many real-world time series exhibit complex correlations. GNNs address this by constructing a graph where individual time series variables are nodes, and the edges represent statistical dependencies – such as Granger causality or correlation – between those variables. This graph-based representation allows the model to learn patterns not only from the individual time series but also from the relationships between them, leading to improved accuracy in forecasting, classification, and anomaly detection tasks. The ability to capture these dependencies is particularly valuable in domains where variables are heavily interconnected, such as financial markets, climate modeling, and complex industrial processes.

Graph Neural Networks (GNNs) facilitate time series anomaly detection by transforming sequential data into a graph structure. In this representation, individual time series variables are designated as nodes, and the statistical dependencies – such as correlation or Granger causality – between these variables are modeled as edges connecting the nodes. The strength or weight of each edge can reflect the magnitude of the relationship. This graph-based approach allows GNNs to capture complex interdependencies within the time series data, which traditional methods may miss. By analyzing the graph’s structure and node features, the model can identify subtle anomalies that manifest as unusual patterns or deviations in the relationships between variables, thereby improving detection capabilities.

Anomaly detection using Graph Neural Networks (GNNs) relies on establishing a baseline representation of normal graph structure and feature distributions. The GNN is trained on historical time series data represented as graphs, learning the typical relationships between variables and their expected characteristics. During inference, deviations from this learned normal structure – such as unexpected changes in edge weights, the appearance of novel connections, or significant alterations in node feature values – are flagged as anomalies. This approach allows for the identification of anomalies based on relational context, rather than solely on individual variable thresholds, which demonstrably improves detection accuracy and minimizes false positive rates by accounting for the inherent dependencies within the time series data.

On the SWaT test set, GRU and GDN models demonstrate superior anomaly detection due to clearer separation between normal (green) and anomalous (red) score distributions, while GCN and MTAD-GAT models suffer from substantial overlap that hinders effective thresholding.

GraGOD: A Framework Rooted in Modular Design and Mathematical Rigor

GraGOD is an open-source framework designed to simplify the process of developing and assessing graph-based time series anomaly detection methods. It provides a unified structure encompassing data loading, graph construction, model training, and anomaly scoring. This unified approach contrasts with prior work often requiring custom implementations for each stage. The framework’s modularity is achieved through clearly defined interfaces between components, allowing researchers to substitute different graph neural network (GNN) architectures, anomaly detection algorithms, and datasets without significant code modification. This design accelerates research by reducing the engineering effort required to test new ideas and facilitates standardized evaluation of different approaches within a consistent environment, improving reproducibility.

GraGOD leverages the PyTorch deep learning framework to facilitate the implementation and evaluation of diverse graph neural network (GNN) architectures for time series anomaly detection. Supported GNNs include, but are not limited to, Graph Convolutional Networks (GCNs), Graph Attention Networks (GATs), and GraphSAGE. The framework accommodates both reconstruction-based anomaly detection methods, which identify anomalies as deviations from learned representations of normal behavior, and forecasting-based approaches that predict future time steps and flag significant prediction errors as anomalies. This architectural flexibility allows for comparative analysis and the combination of different anomaly detection paradigms within a unified system.

GraGOD’s modularity is achieved through a component-based architecture, enabling researchers to isolate and swap individual elements such as graph neural network (GNN) layers, embedding methods, anomaly scoring functions, and data loaders. This design facilitates experimentation with diverse configurations without requiring extensive code modification; new components can be integrated via a defined interface. Configuration is further streamlined through a parameter-driven system, allowing for systematic variation of hyperparameters and architectural choices. The resulting flexibility promotes both rapid prototyping of novel anomaly detection approaches and reproducible research, as specific configurations can be precisely documented and replicated.

Attention mechanisms effectively highlight the root cause of anomalies-specifically, a FIT sensor failure-by focusing on physically connected nodes within the SWaT topology, enhancing interpretability and mirroring system flow.

Rigorous Validation: Demonstrating GraGOD’s Superior Performance

GraGOD’s efficacy has been substantiated through rigorous testing on established benchmark datasets, specifically the TELCO and SWaT systems, which simulate critical infrastructure scenarios. Comparative analyses consistently demonstrate that GraGOD surpasses the performance of existing anomaly detection methods in identifying and localizing malicious activities within these complex systems. These evaluations weren’t limited to simple accuracy; the framework’s ability to reliably detect a wider range of attack vectors and minimize false positives contributed to its superior standing. The consistently strong results across both datasets suggest GraGOD’s robustness and adaptability to diverse operational environments, marking a significant advancement in securing critical infrastructure against evolving cyber threats.

A robust evaluation of GraGOD’s anomaly detection capabilities relies on a suite of performance metrics designed to offer a comprehensive understanding of its strengths. Traditional measures like Precision and Recall assess the accuracy of identified anomalies, while the F1-Score provides a balanced harmonic mean of these two. However, recognizing the limitations of these classification-based metrics in continuous anomaly detection, researchers also utilize Volume Under Surface (VUS), which quantifies the area under the receiver operating characteristic curve-expressed as VUS-ROC-and directly measures the model’s ability to discriminate between normal and anomalous behavior. This focus on VUS-ROC as a primary comparison metric ensures a more nuanced and reliable assessment of performance, particularly when dealing with imbalanced datasets or subtle anomalies that might be missed by simpler classification scores.

The selection of an optimal detection threshold is often guided by the F1-score, a metric balancing precision and recall; however, studies reveal this approach can be misleading. A high F1-score during model validation doesn’t consistently guarantee robust performance when applied to unseen test data. This discrepancy arises because the F1-score prioritizes a specific balance between correctly identified anomalies and false alarms, potentially overlooking the overall area under the detection curve. Consequently, comprehensive evaluation metrics like Volume Under Surface (VUS) offer a more holistic assessment of anomaly detection capabilities, quantifying the total volume detected as anomalous and providing a more reliable indicator of a model’s true performance, particularly when dealing with imbalanced datasets or nuanced anomaly characteristics.

Analysis reveals a crucial disconnect between standard regression loss minimization and effective anomaly detection, as evidenced by the varying Pearson correlation coefficients between validation loss and Volume Under Surface (VUS) across different model architectures. Graph Convolutional Networks (GCNs) exhibited a strong negative correlation, indicating that minimizing regression loss effectively drove improvements in anomaly detection performance – lower loss consistently corresponded to higher VUS. Conversely, Graph Deconvolutional Networks (GDNs) and Gated Recurrent Units (GRUs) displayed weak or even positive correlations, suggesting that simply reducing regression loss did not reliably translate to better anomaly detection capabilities. This discrepancy highlights the limitations of relying solely on regression loss for training anomaly detection models and underscores the need for metrics like VUS to directly assess and optimize performance on the specific task of identifying anomalous behaviors.

Despite achieving high recall (0.8) and perfect precision, point-wise evaluation can be misleading as it fails to detect the majority of anomalies within a dataset, illustrated by its inability to identify a long-duration anomaly.

Future Directions: Expanding the Mathematical Foundation of Anomaly Detection

Future development of the GraGOD framework prioritizes the integration of contrastive learning techniques to refine its discriminatory capabilities. This approach aims to train the model to not only identify anomalous patterns, but to distinctly differentiate them from normal operational states by embedding similar normal instances closer together in a feature space, while simultaneously pushing anomalous instances further away. By explicitly teaching the model what constitutes ‘normal’ through comparative examples, researchers anticipate a significant improvement in both the accuracy and robustness of anomaly detection, particularly in scenarios involving subtle or previously unseen deviations from established baselines. This focus on contrastive representation learning promises a more nuanced and reliable system for identifying and responding to critical events within complex industrial control systems.

The accurate depiction of temporal relationships within complex systems is crucial for effective anomaly detection, and future work aims to refine how GraGOD constructs its foundational graphs. Current methods may not fully capture the nuanced dependencies between variables over time; therefore, researchers are investigating more sophisticated techniques, such as the Meinshausen-Bühlmann Method. This approach, rooted in graphical model selection, focuses on identifying direct dependencies between variables, effectively streamlining the graph and reducing noise. By prioritizing these direct connections, the framework can better represent the system’s true temporal dynamics, potentially leading to earlier and more precise identification of anomalous behavior. This refined graph construction promises to improve not only the accuracy of GraGOD but also its computational efficiency by focusing on the most critical relationships within the data.

The integration of attention mechanisms represents a significant step toward more insightful and accurate anomaly detection within complex systems. Inspired by models like the Graph Deviation Network, these mechanisms don’t simply identify that an anomaly exists, but highlight where within the system it originates and propagates. Qualitative analyses reveal a compelling pattern: attention consistently focuses on physically connected nodes conforming to the pre-defined Supervisory Control and Data Acquisition (SCADA) topology of the system – essentially, the known flow of operations. This focus isn’t arbitrary; it aligns with the expected pathways of information and control, providing a degree of interpretability often absent in ‘black box’ anomaly detection algorithms. By emphasizing these physically connected components, the system offers a clearer explanation of anomalous behavior, improving trust and facilitating faster, more informed responses to critical events.

The pursuit of reliable anomaly detection, as detailed in this framework, necessitates a rigorous foundation akin to mathematical proof. The paper highlights the critical role of evaluation metrics – not merely observing performance on a dataset, but establishing invariants that guarantee robustness across varying conditions. This echoes Barbara Liskov’s assertion: “It’s important to design programs so that changing one part doesn’t break another part.” A well-defined evaluation suite, capable of exposing weaknesses in graph construction or anomaly scoring, serves as that invariant – a means of verifying the correctness of the system, moving beyond empirical observation to demonstrable truth. The study’s emphasis on interpretable anomaly scoring further reinforces this principle; a solution’s validity isn’t simply indicated by its output, but by the demonstrable logic underpinning its conclusions.

What Lies Ahead?

The pursuit of anomaly detection in time series data, particularly when leveraging graph neural networks, has, as this work demonstrates, become less a question of architectural novelty and more a demand for mathematical rigor. The presented evaluation framework, while a necessary step, merely exposes the fragility inherent in relying on statistically convenient, rather than provably correct, methodologies. A statistically significant improvement on a benchmark, after all, does not equate to a solution immune to adversarial perturbations or shifts in underlying data distributions.

Future efforts must prioritize the development of anomaly scoring functions possessing demonstrable properties. Simply assigning a scalar value to a time series segment and applying an arbitrary threshold is, frankly, an exercise in controlled guessing. A preferable approach would involve constructing scores directly linked to quantifiable measures of deviation from expected graph behavior – perhaps through the formulation of invariants or the application of information-theoretic bounds. Interpretability, as highlighted, is not merely a post-hoc justification but an intrinsic requirement for trust.

Ultimately, the field requires a paradigm shift. The focus should move beyond ‘black box’ predictive accuracy and toward the construction of models capable of certifying the absence of anomalies within defined confidence intervals. Until then, the promise of truly reliable graph-based anomaly detection will remain tantalizingly out of reach – a beautiful mathematical idea hampered by a lack of demonstrable correctness.

Original article: https://arxiv.org/pdf/2603.09675.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/