Closing the Log Gap: AI-Powered Data Augmentation for Smarter Anomaly Detection

Author: Denis Avetisyan


A new framework leverages source code analysis and artificial intelligence to generate realistic log data, dramatically improving the accuracy of systems that detect critical errors.

AnomalyGen employs a three-stage workflow-initially capturing intra-procedural logic through log-based graph pruning, subgraph extraction, and LCFG construction, then reconstructing inter-procedural paths via stack-based simulation and incorporating CoT semantic verification of log sequences, and finally completing the pipeline with automatic anomaly annotation coupled with controlled data augmentation to generate high-quality labeled sessions.
AnomalyGen employs a three-stage workflow-initially capturing intra-procedural logic through log-based graph pruning, subgraph extraction, and LCFG construction, then reconstructing inter-procedural paths via stack-based simulation and incorporating CoT semantic verification of log sequences, and finally completing the pipeline with automatic anomaly annotation coupled with controlled data augmentation to generate high-quality labeled sessions.

AnomalyGen combines static analysis of code with large language models to create a more comprehensive dataset for log-based anomaly detection, enhancing system reliability.

Log-based anomaly detection suffers from a critical limitation: the scarcity of labeled training data representing the full spectrum of system behaviors. To address this, we present AnomalyGen: Enhancing Log-Based Anomaly Detection with Code-Guided Data Augmentation, a novel framework that synthesizes realistic log sequences directly from source code using a combination of static analysis and large language models. This approach demonstrably improves performance across diverse anomaly detection models, achieving up to 2.18% F1-score gains on benchmark datasets, and suggests that bridging the gap between code and logs is crucial for building more robust and reliable systems-but can we further refine this code-guided generation to capture even more nuanced and complex system interactions?


The Challenge of Modern System Visibility

The operational backbone of many modern digital services relies on complex, distributed systems such as those employing Hadoop Distributed File System (HDFS) and Zookeeper. These systems, while powerful, inherently produce an overwhelming deluge of log data – often exceeding terabytes daily. Each log entry represents an event within the system, and the sheer volume renders traditional, manual analysis completely impractical. A single human analyst simply cannot sift through this constant stream of information to identify deviations from normal behavior, or potential system failures. Consequently, automated anomaly detection techniques are not merely beneficial, but absolutely essential for maintaining the health and reliability of these critical infrastructures. The scale of data generation necessitates a shift from reactive, manual investigation to proactive, algorithm-driven monitoring.

Conventional anomaly detection techniques, while effective in simpler scenarios, falter when confronted with the sheer volume and intricate relationships within modern system logs. These methods often rely on predefined thresholds or patterns, proving inadequate for the dynamic and multifaceted behavior of distributed systems. Consequently, they generate a substantial number of false positives – flagging normal, yet unusual, events as anomalies. This deluge of inaccurate alerts overwhelms operators, obscuring genuine issues and diminishing trust in the detection system itself. The core challenge lies in distinguishing between benign deviations and critical failures within a sea of data, a task for which traditional approaches lack the necessary granularity and adaptability.

A significant obstacle in accurately identifying unusual behavior within complex systems like Hadoop Distributed File System (HDFS) and Zookeeper lies in the limited representation of operational data. Current datasets utilized for training anomaly detection models capture an astonishingly small fraction of the possible log templates – a mere 0.99% in HDFS and 8.02% in Zookeeper. This extreme data scarcity presents a critical coverage problem, as the vast majority of potential system states and error conditions remain unseen by these models. Consequently, the ability to generalize and reliably detect novel anomalies is severely hampered, increasing the risk of overlooking critical issues and hindering effective system maintenance. Addressing this limitation necessitates strategies to expand data representation, potentially through synthetic log generation or advanced techniques for extrapolating from existing data.

Addressing the difficulties in detecting anomalies within modern distributed systems necessitates a departure from conventional methodologies. Current anomaly detection techniques falter when confronted with the sheer volume and intricate patterns of log data generated by systems like HDFS and Zookeeper, and are further hampered by limited data coverage of prevalent log templates. Consequently, research is increasingly focused on data augmentation strategies – artificially expanding existing datasets to better represent the full range of system behaviors – coupled with advanced model training techniques. These include semi-supervised learning, which leverages both labeled and unlabeled data, and generative models capable of synthesizing realistic log data that mirrors potential anomalous states. The ultimate goal is to develop robust detection systems that minimize false positives while accurately identifying genuine threats to system stability and performance, ensuring reliable operation at scale.

Log-aware node labeling and call graph pruning effectively reduce graph complexity by focusing on relevant execution paths.
Log-aware node labeling and call graph pruning effectively reduce graph complexity by focusing on relevant execution paths.

Anomaly Generation: Reconstructing System Behavior

AnomalyGen is an automated framework designed to generate log sequences by analyzing system source code, eliminating the need for manual log collection or reliance on existing production logs. The system operates by extracting behavioral information directly from the code, creating synthetic logs that reflect the program’s execution paths. This approach allows for the creation of comprehensive log datasets, particularly useful in scenarios where real-world logs are insufficient or unavailable for testing and model training. The framework’s automated nature enables scalability and repeatability in log data generation, facilitating continuous integration and deployment pipelines focused on system monitoring and anomaly detection.

AnomalyGen employs static analysis to construct a model of system behavior based on control flow information. This process involves generating Call Graphs, which depict the relationships between function calls, and Log-Oriented Control Flow Graphs (LCFG), which integrate logging statements into the control flow representation. The static analysis phase examines the source code without executing it, identifying potential execution paths and logging points. Performance benchmarks indicate that static analysis of the Hadoop codebase currently requires approximately 9 minutes to complete, providing a quantifiable measure of the initial processing time for log synthesis.

AnomalyGen utilizes log templates to generate structured log messages, improving the overall quality and usability of the synthetic data. These templates define the format and content of each log entry, including specific fields and data types. By employing templates, AnomalyGen avoids generating free-text logs, which are difficult to parse and analyze. The system dynamically populates these templates with values extracted during static analysis of the source code, ensuring that each log message contains relevant and meaningful information related to the system’s execution. This approach results in logs that are easily ingestible by standard log processing tools and facilitate effective anomaly detection model training.

AnomalyGen mitigates the data coverage problem in anomaly detection by producing synthetic log sequences that represent a broad range of system behaviors. Traditional anomaly detection model training is often limited by the scarcity of labeled anomalous logs; AnomalyGen circumvents this limitation by generating a large volume of synthetic data, effectively augmenting existing datasets. This synthesized data allows for more comprehensive training and evaluation of anomaly detection models, improving their ability to identify rare and previously unseen anomalous conditions. The framework’s ability to create diverse log data increases the robustness and generalizability of trained models, leading to improved performance in real-world deployments.

Log-critical elements are extracted via AST analysis and ordered using dominance analysis to construct the LCFG.
Log-critical elements are extracted via AST analysis and ordered using dominance analysis to construct the LCFG.

Enhancing Realism Through Linguistic Reasoning

AnomalyGen utilizes Large Language Models (LLMs) to augment synthetic log data generation by introducing variations in parameter values. These LLMs are prompted to generate log sequences, but instead of fixed values, the system allows for plausible parameter alterations within predefined ranges. This process moves beyond simple log duplication, creating a more diverse training dataset that reflects the inherent variability of real-world system behavior. By systematically modifying parameters like request latency, CPU usage, or memory allocation, AnomalyGen generates a wider spectrum of log entries, improving the robustness and generalization capability of downstream anomaly detection models.

Chain-of-Thought (CoT) reasoning is implemented to constrain Large Language Model (LLM) outputs during synthetic log data generation. This technique involves prompting the LLM to explicitly articulate the logical steps leading to a specific parameter value. By requiring the LLM to first define the system state, identify relevant dependencies, and then calculate the parameter variation, CoT ensures generated values adhere to expected system behavior. This process moves beyond simple random variation and instead creates synthetic data reflecting plausible system dynamics, thereby increasing the fidelity and utility of the augmented training dataset for anomaly detection models.

The incorporation of LLM-generated synthetic logs, alongside real-world data, demonstrably enhances the generalization capability of anomaly detection models. Traditional models trained solely on existing logs often struggle with novel or previously unseen system behaviors. By exposing the model to a broader range of plausible parameter variations created by the LLM, the system learns to differentiate between legitimate operational fluctuations and true anomalies with greater accuracy. This improved generalization is crucial for maintaining reliable anomaly detection in dynamic environments where system configurations and operational patterns are constantly evolving, reducing false positives and improving the identification of genuine threats or failures.

The combination of synthetically generated log sequences and authentic log data creates a training dataset with increased size and diversity, demonstrably improving the performance of anomaly detection models. This approach addresses the common limitation of relying solely on real-world logs, which often lack sufficient examples of anomalous behavior. By augmenting real data with synthetic anomalies produced by AnomalyGen, the resulting dataset provides a more comprehensive representation of possible system states, enabling anomaly detection models to generalize more effectively to previously unseen anomalies and reducing false positive rates. The increased robustness is achieved through exposure to a wider range of operational conditions and failure modes during the training process.

During Phase II, a large language model assesses the logical consistency between a caller context and a potential callee path to determine appropriate parameter values.
During Phase II, a large language model assesses the logical consistency between a caller context and a potential callee path to determine appropriate parameter values.

Deep Learning and the Future of System Resilience

Deep learning models designed for anomaly detection often struggle with limited and imbalanced datasets, hindering their ability to generalize and accurately identify rare, critical events. AnomalyGen addresses this challenge by automatically generating a synthetic dataset that meaningfully augments existing log data. This expansion isn’t simply about increasing volume; the generated anomalies are carefully crafted to reflect realistic system behaviors and edge cases, effectively exposing the learning model to a wider range of potential issues. Consequently, the performance of these models is significantly enhanced, allowing for more reliable identification of anomalies and a reduction in false positives – a crucial benefit in complex systems where timely and accurate alerts are paramount. The enriched dataset provides a more robust foundation for deep learning, ultimately leading to more effective and dependable anomaly detection capabilities.

Effective anomaly detection within complex systems relies on understanding not just what happened, but also when and why. Recent advancements leverage both sequence-aware and semantic encoding techniques to achieve this. Sequence-aware encoding analyzes the chronological order of log events, capturing temporal dependencies crucial for identifying deviations from normal behavior-a sudden spike in errors after a specific update, for example. Complementing this, semantic encoding delves into the meaning of each log message, recognizing contextual relationships between different events. By combining these approaches, models can discern anomalies that might be missed by considering only timing or content in isolation – a rare combination of events, even if individually benign, could signal a developing issue. This dual encoding strategy allows systems to move beyond simple pattern matching and towards a more nuanced understanding of system behavior, ultimately improving detection accuracy and reducing false positives.

The developed anomaly detection framework distinguishes itself through adaptability, functioning effectively within both supervised and unsupervised learning contexts. This duality proves crucial for practical deployment, as labeled anomaly data is often scarce and expensive to obtain. Supervised learning, when feasible, leverages existing labels to train models for precise identification, while the unsupervised approach excels in scenarios where anomalies are previously unknown, relying on the model’s ability to discern deviations from established norms. This flexibility allows organizations to tailor the system to their specific needs and data availability, enabling proactive monitoring and rapid response to critical issues regardless of the availability of pre-defined anomaly classifications.

Evaluations of the AnomalyGen framework reveal a marked advancement in anomaly detection capabilities within critical system logs. Specifically, performance metrics demonstrate a significant increase in F1-score – a measure of precision and recall – achieving up to a 15.2% gain when analyzing logs from a Hadoop Distributed File System (HDFS). This improvement extends to Zookeeper logs, where a 13.0% enhancement in F1-score was observed. These results indicate that AnomalyGen not only identifies a greater proportion of actual anomalies, but also minimizes the occurrence of false positives, thereby providing a more reliable and effective solution for maintaining system health and stability. The gains represent a substantial leap forward in the field, offering the potential for proactive identification and mitigation of issues before they escalate into major disruptions.

Supervised deep learning performance varies with the augmentation ratio, demonstrating the impact of data augmentation techniques on model accuracy.
Supervised deep learning performance varies with the augmentation ratio, demonstrating the impact of data augmentation techniques on model accuracy.

AnomalyGen embodies a principle of refined design, tackling the pervasive issue of insufficient data in log anomaly detection. The framework doesn’t simply add complexity; it strategically augments existing data with code-guided generation, acknowledging that comprehensive coverage stems from focused, relevant additions. This approach aligns with a core tenet of efficient systems: achieving maximum impact with minimal overhead. As Linus Torvalds once stated, “Most programmers think that if their code works, it is already good enough.” AnomalyGen demonstrates that true reliability isn’t merely about functionality, but about proactively addressing potential weaknesses through meticulously crafted data, ensuring a more robust and insightful detection process.

What Lies Ahead?

The pursuit of comprehensive log anomaly detection benchmarks appears, predictably, to require ever more data. AnomalyGen offers a pragmatic, if temporary, respite from this hunger, translating code structure into plausible execution traces. However, the elegance of code-guided generation should not obscure a fundamental limitation: the generated logs, however realistic, remain tethered to known code paths. True anomalies, those born of unforeseen interactions or emergent behavior, will inevitably lie outside this constructed reality.

Future work must address the gap between synthetic coverage and genuine novelty. Perhaps the focus should shift from generating more logs to actively seeking the unexplored edges of system behavior-using formal methods or reinforcement learning to probe for conditions that existing code paths fail to anticipate. The ideal solution will not merely expand the training set, but fundamentally redefine the problem, moving beyond pattern recognition to a more nuanced understanding of system intent and deviation.

Ultimately, the value of AnomalyGen resides not in its ability to solve log anomaly detection, but in its demonstration that the problem is, at its core, a problem of knowledge-of what constitutes normal behavior, and of what lies beyond the horizon of our current understanding. The next step is not more data, but a more rigorous epistemology of system reliability.


Original article: https://arxiv.org/pdf/2604.11107.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2026-04-14 18:22