Decoding Malware Behavior: A New Graph Dataset Reveals Detection Challenges

Author: Denis Avetisyan

Researchers have released a comprehensive dataset of function call graphs, exposing limitations in current malware classification techniques and highlighting the evolving sophistication of Android threats.

The observed performance decrease across all methods following the removal of duplicate application packages from the MalNet-Tiny and CICMalDroid datasets suggests a reliance on redundant data, indicating that the initial performance was artificially inflated and highlighting the fragility of the systems when presented with genuinely unique samples.

The new dataset, BCG, demonstrates that existing benchmarks overestimate performance on modern Android malware and that function call graph classification remains a significant challenge.

While function call graphs (FCGs) offer a promising behavioral abstraction for malware detection, evaluations often rely on datasets that fail to capture the diversity of modern threats. This paper introduces ‘Better Call Graphs: A New Dataset of Function Call Graphs for Malware Classification’, presenting BCG, a comprehensive collection of large and unique FCGs extracted from recent Android applications. Our analysis reveals that existing datasets overestimate classification performance and that accurately identifying contemporary Android malware using graph-based methods remains a significant challenge. Will this new resource enable more robust and generalizable malware detection systems?

The Expanding Attack Surface: Android’s Vulnerability

Android’s pervasive presence in the mobile landscape, commanding the largest market share globally, unfortunately positions it as a primary target for malicious software. This dominance isn’t merely a matter of popularity; it represents a significantly expanded attack surface for cybercriminals. The sheer volume of Android devices – billions worldwide – creates a compelling economic incentive for malware developers, who seek to compromise these devices for financial gain, data theft, or to leverage them in botnets. Consequently, users face a constant and escalating security risk, with new threats emerging daily, ranging from ransomware and spyware to sophisticated Trojans designed to evade detection. The open nature of the Android ecosystem, while fostering innovation, also introduces vulnerabilities that attackers actively exploit, necessitating robust security measures and vigilant user practices.

The proliferation of Android malware presents a considerable challenge to conventional security measures, particularly those reliant on signature-based detection. These systems operate by identifying malicious software through comparisons with a database of known threats; however, contemporary malware developers routinely employ techniques to evade such defenses. Polymorphism, for instance, allows malware to alter its code while retaining its functionality, generating a virtually limitless number of variants that bypass signature matching. Similarly, rapid evolution sees new malware strains, or significant modifications to existing ones, appearing daily, outpacing the ability of signature databases to remain current. Consequently, signature-based detection, while still a component of many security solutions, proves increasingly inadequate against the dynamic and sophisticated landscape of Android threats, necessitating more advanced behavioral analysis techniques.

Effective Android malware detection increasingly relies on a detailed examination of application behavior, moving beyond simple signature matching. This necessitates a deep dive into the code itself, tracing the execution flow and identifying potentially malicious actions. Researchers are focusing on understanding how an application functions, rather than simply what it is, to uncover hidden threats. This internal analysis involves dissecting code structure, monitoring API calls, and tracking data flow, allowing for the identification of malicious patterns even in previously unseen or obfuscated malware. The complexity arises from the intricate interactions between different code components; a seemingly benign function can become dangerous when combined with others, demanding a holistic understanding of the application’s internal logic for accurate threat assessment.

Current malware detection techniques frequently overlook the intricate relationships between different functions within an Android application, creating a vulnerability exploited by increasingly sophisticated threats. These methods often isolate and analyze functions in a fragmented manner, failing to recognize malicious intent hidden within the coordinated execution of seemingly benign components. A threat actor can deliberately design an application where no single function appears harmful, yet the combined actions of multiple functions, interacting in a specific sequence, carry out malicious activities – such as data exfiltration or unauthorized access. This inter-functional complexity demands a more holistic approach to analysis, one that maps and understands the dynamic call graph and data flow within an application to accurately identify and neutralize advanced malware.

Unveiling Behavior: The Function Call Graph as Blueprint

Function Call Graphs (FCGs) represent the dynamic behavior of Android applications by mapping the relationships between functions as they are executed. Each node in the graph corresponds to a function within the application’s code, and a directed edge signifies a call from one function to another. This visualization moves beyond simply identifying which functions could be called (static analysis) to demonstrating the actual sequence of function calls during runtime. The resulting graph provides a detailed record of the application’s execution path, illustrating how different components interact and enabling analysis of control flow and data dependencies. This dynamic representation is crucial for understanding an application’s behavior as it responds to various inputs and events.

Static analysis of Android applications examines code without actual execution, identifying potential vulnerabilities based on code structure and known patterns. Function Call Graphs (FCGs) complement this by representing the runtime behavior, detailing the precise sequence and interactions of function calls as the application executes. This dynamic analysis reveals behaviors not detectable through static methods alone, such as code obfuscation techniques, exploitation of environment-specific conditions, or the execution of malicious payloads triggered by specific user actions or system events. Consequently, FCGs provide a more comprehensive understanding of an application’s operational characteristics and are crucial for identifying malicious activities that remain hidden during static code review.

Function Call Graph (FCG) structure directly reflects an Android application’s runtime behavior through its constituent nodes and edges. Nodes within the FCG represent individual functions executed during application operation, encompassing both application-defined functions and those from the Android OS or third-party libraries. Edges denote the control flow – the precise sequence of function calls – establishing the relationships between these functions. The direction of an edge indicates which function invokes another. Consequently, path length within the graph signifies the depth of function call nesting, while edge density reflects the degree of interaction between different code modules. Analysis of these structural elements reveals how data flows and control is transferred throughout the application during execution, providing a detailed map of its dynamic behavior.

Analysis of Function Call Graphs (FCGs) enables the detection of malicious patterns through the identification of anomalous call sequences and graph structures. Specifically, researchers can identify suspicious behaviors like reflective calls to sensitive APIs, unusually deep or wide call chains, and connections to known malicious libraries, even if the malware’s specific signature is novel. The identification of these patterns doesn’t rely on pre-existing signatures; instead, it focuses on the behavior encoded within the graph. This allows for the detection of zero-day malware and variants that employ obfuscation techniques designed to evade signature-based detection methods. Statistical analysis of graph properties, such as node degree distribution and path lengths, can further highlight deviations from benign application behavior, supporting automated malware classification.

The temporal distribution of activation potential kinase (APK) activity reveals its dynamic regulation within Bacillus Calmette-Guérin (BCG).

A New Foundation for Analysis: Introducing the BCG Dataset

The BCG Dataset comprises 10,057 Flow Control Graphs (FCGs) extracted from Android applications. These FCGs were generated through dynamic analysis performed using the Androguard framework, a reverse engineering tool for Android applications. Dynamic analysis involves executing the application in a controlled environment to observe its runtime behavior, allowing for the construction of FCGs that represent the actual execution paths. The dataset captures the control flow within each analyzed application, providing a granular representation of its functionality. Each FCG serves as a structural representation of the application’s code, suitable for use in machine learning and static analysis research.

The BCG Dataset’s curation process prioritizes accuracy and contextual relevance. Each of the 10,057 Android FCGs undergoes validation against the VirusTotal platform, cross-referencing identified signatures and behavioral reports from multiple antivirus engines. This validation step confirms the presence or absence of malicious indicators. Beyond simple binary labeling, the dataset is enriched with associated threat intelligence data sourced from VirusTotal, including malware family classifications, related samples, and observed behaviors. This supplementary information allows researchers to move beyond detection and towards a deeper understanding of the malware’s functionality and propagation techniques.

The BCG Dataset’s constituent Flow Control Graphs (FCGs) demonstrate substantial complexity, averaging approximately 27,000 nodes and 58,000 edges per graph. This level of granularity is achieved through dynamic analysis utilizing Androguard, enabling the capture of intricate behavioral patterns within Android applications. The high node and edge counts reflect the detailed representation of control flow, data dependencies, and inter-component communication, providing a rich substrate for analyzing application logic and identifying potentially malicious activities. This complexity is crucial for effectively training machine learning models capable of discerning subtle differences between benign and malicious software.

The BCG Dataset directly supports the development and rigorous evaluation of machine learning (ML) models designed for Android malware detection. Existing datasets often lack the scale or granularity necessary to train and assess the performance of advanced ML techniques, particularly those leveraging graph neural networks. The BCG Dataset, with its 10,057 Android Feature Control Graphs (FCGs), provides a substantial and detailed resource for researchers to address this gap. By offering a standardized and well-labeled dataset, it enables comparative analysis of different ML approaches, facilitating improvements in detection accuracy, reduced false positive rates, and enhanced generalization to previously unseen malware families. The availability of this resource is critical given the rapidly evolving landscape of Android threats and the increasing sophistication of malware authors.

The BCG Dataset incorporates both benign and malicious Android application samples to create a balanced training set, crucial for effective machine learning model development. This balanced composition – encompassing a representative distribution of both normal and adversarial behaviors – mitigates potential biases that could arise from training on predominantly benign or malicious data. A disproportionate dataset can lead to models exhibiting high false positive or false negative rates, impacting real-world detection accuracy. The inclusion of both classes allows for robust model generalization and improved performance in identifying previously unseen malware variants and reducing misclassification of legitimate applications.

The Bi-level Coordination Graph (BCG) is constructed to facilitate coordinated multi-agent reinforcement learning by representing agent relationships and task dependencies.

Graph Neural Networks: Unlocking Behavioral Insights for Classification

Graph Neural Networks (GNNs) offer a powerful approach to malware classification by leveraging the structural information embedded within Function Control Graphs (FCGs). Specifically, the study investigates the efficacy of Graph Convolutional Networks (GCN), Graph Isomorphism Networks (GIN), and Layer-wise Deep Propagation (LDP) in analyzing these FCGs. These networks learn meaningful representations – or embeddings – for each node within the graph, alongside features that capture the overall graph structure, allowing the system to discern patterns indicative of malicious code. By treating the FCG as a graph, the GNNs can effectively model the relationships between different functions and control flow elements, ultimately enhancing the accuracy of malware identification compared to methods that treat code as a simple sequence of instructions.

Graph Neural Networks excel at malware classification by transforming the complex structure of Function Control Graphs (FCGs) into a quantifiable format. These networks don’t simply analyze individual functions, but rather learn to represent each node-representing a basic block-as a dense vector, known as a node embedding. Simultaneously, the network captures the relationships between these blocks, creating graph-level features that reflect the overall program logic. This process effectively translates the FCG’s architecture-the flow of execution-into numerical data that can be readily interpreted by machine learning algorithms, allowing for the identification of malicious patterns based on how code is structured, rather than just the code itself. The resulting embeddings and features enable accurate malware identification by focusing on behavioral characteristics encoded within the FCG’s inherent graph structure.

Investigations into malware classification reveal that Graph Neural Networks (GNNs) consistently surpass the performance of conventional machine learning algorithms, such as Random Forest, when analyzing features derived from Function Control Graphs (FCGs). This improvement stems from the GNN’s capacity to directly learn from the structural relationships within the FCG, effectively capturing the nuanced behavioral patterns indicative of malicious code. Unlike traditional methods that rely on hand-engineered features or treat nodes in isolation, GNNs process the entire graph structure, allowing them to identify complex dependencies and control flow characteristics that might otherwise be missed. The superior performance demonstrated by GNNs highlights their potential as a robust and accurate approach to malware detection, particularly when leveraging the rich information encoded within FCG representations.

Despite demonstrating the potential of graph neural networks for malware classification, experiments revealed substantial performance variation depending on the dataset used. While a Macro-F1 score of 93.04% was achieved using the CICMalDroid dataset, results on the BCG dataset were significantly lower, registering only 6.01%. This disparity suggests the BCG dataset presents a considerably more difficult benchmark for evaluating malware classification models, potentially due to increased obfuscation techniques employed in its samples, a higher degree of feature similarity between benign and malicious code, or an imbalanced distribution of malware families within the dataset itself. Further investigation into the characteristics of the BCG dataset is necessary to understand the root causes of this performance gap and develop more robust malware detection strategies.

The introduction of the BCG dataset illuminates a critical truth about system evolution: even diagnostic tools require constant recalibration. This study demonstrates that reliance on older datasets-however comprehensive they once were-introduces a form of technical debt, obscuring the present reality of Android malware. As John McCarthy observed, “It is better to be thought a fool than to do a foolish thing.” The researchers didn’t perpetuate a flawed understanding; instead, they acknowledged the decay of existing resources and provided a new foundation for analysis. The very act of creating BCG serves as a testament to the necessity of confronting the present, rather than relying on the ghosts of classifications past, and understanding that every bug – or in this case, a misrepresented dataset – is a moment of truth in the timeline of system understanding.

What Lies Ahead?

The introduction of BCG, a dataset ostensibly designed to benchmark progress, instead reveals a curious truth: existing evaluations of function call graph-based malware classification were, perhaps, premature celebrations. The field assumed an upward trajectory, but BCG suggests a plateau, or at least, a significantly less steep incline than previously believed. Systems age not because of errors, but because time is inevitable; performance metrics, too, are subject to this decay, revealing their limitations only when confronted with genuinely new challenges.

The persistent difficulty in classifying modern Android malware via function call graphs isn’t a failure of technique, but a symptom of adaptation. Malware evolves, and with each iteration, it introduces subtle shifts in behavior, obscuring the patterns upon which classification relies. The current emphasis on graph classification algorithms feels, increasingly, like refining the tools while ignoring the shifting terrain. A deeper exploration of dynamic analysis, coupled with a more nuanced understanding of adversarial machine learning, may prove more fruitful than further optimizing existing static methods.

It’s tempting to view BCG as a corrective measure, a necessary recalibration. But perhaps its true value lies in its implicit warning: stability is just a delay of disaster. The pursuit of ever-higher accuracy should not overshadow the fundamental question of resilience. The system will, ultimately, be breached; the objective must be to delay, detect, and mitigate, rather than to achieve an illusory state of perfect security.

Original article: https://arxiv.org/pdf/2512.20872.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Expanding Attack Surface: Android’s Vulnerability

Unveiling Behavior: The Function Call Graph as Blueprint

A New Foundation for Analysis: Introducing the BCG Dataset

Graph Neural Networks: Unlocking Behavioral Insights for Classification

What Lies Ahead?

See also: