Beyond Known Categories: Classifying Graph Data with the Power of Language

Author: Denis Avetisyan

Researchers are enhancing graph node classification by combining graph neural networks with large language models to identify both familiar and entirely new types of data points.

For out-of-distribution (OOD) detection, the system employs prompts containing both textual node descriptions and ID labels, with a refined approach for ‘Hard-Reject’ cases-first identifying major categories within the ID classes to define a candidate OOD label space, then utilizing text, ID labels, and this candidate space to perform OOD detection and subsequent category generation.

This work introduces a Coarse-to-Fine Classification framework leveraging LLM prompting and manifold mixup for accurate open-set graph node classification, including robust out-of-distribution detection.

Existing graph classification methods struggle to generalize to unseen data, particularly in open-world scenarios demanding both in-distribution (ID) classification and out-of-distribution (OOD) detection. This work introduces a novel Coarse-to-Fine Classification (CFC) framework, as presented in ‘Coarse-to-Fine Open-Set Graph Node Classification with Large Language Models’, which leverages large language models (LLMs) alongside graph neural networks to not only identify but also classify OOD graph nodes without requiring labeled OOD examples. Experimental results demonstrate substantial improvements in both OOD detection and classification accuracy, achieving up to seventy percent accuracy in assigning labels to genuinely out-of-distribution instances. Could this approach unlock more robust and interpretable graph neural networks capable of adapting to truly dynamic real-world datasets?

Beyond Independent Features: Embracing Relational Data

Conventional machine learning algorithms typically demand that data be structured in a format of independent features, a requirement that poses significant challenges when dealing with graph-structured data. Graphs, by their very nature, emphasize relationships between data points – nodes – and these connections often contain crucial information. Consequently, applying traditional methods necessitates a process called feature engineering, where researchers manually create descriptive attributes for each node, attempting to encode the relational information into a format the algorithm can process. This is not only a time-consuming and labor-intensive task, but also risks overlooking subtle yet important connections within the graph. The effectiveness of the resulting classification heavily depends on the quality of these engineered features, and a poor choice can severely limit performance, highlighting the limitations of treating graph data as isolated entities.

Graph Neural Networks (GNNs) represent a significant advancement in machine learning by moving beyond traditional methods that require extensive pre-processing of graph-structured data. Unlike algorithms designed for independent data points, GNNs directly process graphs – networks of nodes and edges – allowing them to learn directly from the relationships inherent in the data. This is achieved through a process of message passing, where each node aggregates information from its neighbors, iteratively refining its own representation. Consequently, GNNs excel at node classification tasks, accurately assigning labels to nodes based not only on their individual features, but also on the collective knowledge gleaned from the surrounding network. The ability to operate directly on graph structures eliminates the need for manual feature engineering, simplifying the learning process and often yielding substantially improved performance, particularly in domains like social network analysis, knowledge graphs, and molecular property prediction.

Graph Neural Networks distinguish themselves in node classification tasks by fundamentally incorporating the relationships defining a graph’s structure. Unlike traditional machine learning models which often treat each node as an isolated data point, GNNs propagate information between connected nodes, allowing each node’s representation to be informed by its neighbors and, recursively, by the broader network context. This process creates richer, more nuanced node embeddings that capture not only the node’s intrinsic features but also its position and role within the graph. Consequently, studies demonstrate that GNNs consistently outperform methods that disregard graph topology, achieving significant gains in accuracy and robustness, particularly in scenarios where relationships are critical to understanding node characteristics – such as social networks, knowledge graphs, and molecular structures. The ability to learn from interconnectedness unlocks a new dimension of feature representation, driving superior performance in node classification and related graph-based learning problems.

Navigating Distribution Shift with Open-Set Awareness

Traditional node classification methods assume all possible node labels are represented in the training data. However, real-world graph datasets frequently exhibit distribution shift, meaning nodes may appear with labels not encountered during model training. This presents a significant challenge as standard classifiers are forced to assign these out-of-distribution (OOD) nodes to one of the known classes, inevitably leading to misclassifications and reduced model performance. The occurrence of unseen labels is common in dynamic graphs representing evolving systems, where new categories or entities emerge over time, or in scenarios involving data from heterogeneous sources with varying label spaces. Consequently, models trained on a limited label set struggle to generalize to the complete range of nodes present in the deployed graph.

Open-Set Classification (OSC) represents an extension of traditional supervised classification by incorporating the ability to recognize inputs that belong to classes not encountered during the training phase. Unlike standard classification which assumes all possible classes are present in the training data, OSC explicitly models the presence of “unknown” or “out-of-distribution” (OOD) classes. This is achieved through techniques that learn a decision boundary not only to differentiate between known classes but also to identify instances that fall outside this learned distribution. In the context of graph neural networks, this capability is particularly valuable for dynamic graphs where new node types or labels may emerge after the initial model training, allowing the model to avoid making potentially incorrect predictions on these previously unseen classes.

Abstaining from prediction on out-of-distribution (OOD) samples is a core component of enhancing model reliability in dynamic graph settings. Traditional node classification models are compelled to assign a label to every node, even when presented with previously unseen node types, leading to inaccurate and potentially misleading results. Open-set classification addresses this limitation by incorporating a mechanism for the model to recognize and explicitly reject inputs that fall outside the training distribution. This is typically achieved through the use of anomaly or novelty detection techniques, which establish a threshold beyond which a prediction is withheld. By abstaining from prediction in these cases, the model avoids generating incorrect labels, thereby improving the trustworthiness of its outputs and offering a more realistic assessment of its capabilities in real-world deployments where data drift is common.

Performance on the Cora dataset demonstrates that incorporating true out-of-distribution samples significantly improves out-of-distribution detection accuracy.

Strengthening Robustness: Manifold Mixup and Open-Set Classification

Effective Out-of-Distribution (OOD) classification relies heavily on the quality of the generated OOD samples used for training and evaluation. These samples must be both meaningful, representing plausible data points that the model might encounter in real-world scenarios, and representative of the broader OOD space to avoid bias and ensure generalization. Insufficiently diverse or unrealistic OOD samples can lead to overly optimistic performance estimates and poor performance when deployed in environments with truly novel data. Generating these samples often involves techniques that create variations of existing data or synthesize entirely new data points based on learned distributions, requiring careful consideration of the underlying data characteristics and potential sources of error.

Manifold Mixup enhances model robustness by creating new training samples through interpolation within the model’s hidden representation space. This technique operates by randomly selecting two data points, $x_i$ and $x_j$, and their corresponding representations, $h_i = f(x_i)$ and $h_j = f(x_j)$, where $f$ is the model. A mixed representation, $\lambda h_i + (1-\lambda)h_j$, is then generated, and a corresponding mixed sample is created. By training on these interpolated representations, the model learns to produce more stable and consistent outputs for unseen data, which directly improves the reliability of generated Out-of-Distribution (OOD) samples used for classification. This leads to better discrimination between in-distribution and OOD data points during OOD classification tasks.

The reliability of out-of-distribution (OOD) classification is directly correlated with the confidence scores assigned to predictions; higher confidence in in-distribution samples and lower confidence in OOD samples are essential for accurate discrimination. The proposed Coarse-to-Fine Classification (CFC) method leverages this principle to improve OOD detection in graph data. Specifically, CFC achieves up to a 70% improvement in graph OOD classification accuracy when compared to baseline methods by refining the initial classification through a multi-stage process that emphasizes well-calibrated confidence scores. This improvement indicates a substantial gain in the ability to correctly identify data points that fall outside the training distribution.

Out-of-distribution (OOD) classification accuracy was evaluated on four benchmark datasets: Cora, Citeseer, WikiCS, and DBLP. Results indicate an accuracy of 69.76% on Cora, 70.30% on Citeseer, 57.96% on WikiCS, and 48.45% on DBLP. These results demonstrate the capacity of the proposed method to effectively classify samples belonging to multiple, distinct OOD classes across varying graph structures and dataset sizes, signifying generalization capability beyond single OOD detection scenarios.

Evaluations demonstrate that the proposed out-of-distribution (OOD) detection approach achieves an accuracy of up to 95.74%. This performance represents a substantial improvement over comparative methods, specifically G2PxY, which achieved approximately 72.46% accuracy on the Cora dataset under the same conditions. The reported OOD detection accuracy was determined through rigorous testing and validation procedures, highlighting the method’s efficacy in correctly identifying samples that fall outside the training distribution.

Performance on the Cora and Citeseer datasets improves with both identifying more out-of-distribution samples from coarse-grained classification and generating more using the manifold mixup method.

GCN and GNN-based Classifiers: A Practical Application

Graph Convolutional Networks (GCNs) build upon the principles of Graph Neural Networks by introducing a localized spectral approach to processing graph-structured data. Instead of treating each node in isolation, GCNs effectively aggregate feature information from a node’s immediate neighbors, allowing the network to learn representations that capture both individual node characteristics and the relationships within the graph. This aggregation is achieved through a convolutional operation defined in the spectral domain, leveraging the graph’s adjacency matrix and Laplacian matrix to efficiently propagate information. Critically, this localized approach dramatically reduces computational complexity compared to earlier graph neural network models, making GCNs particularly well-suited for large-scale graph analysis. The result is a computationally efficient method for learning node embeddings that capture structural information, offering a powerful foundation for a variety of tasks, including node classification and link prediction.

Graph Neural Networks (GNNs) provide a compelling framework for node classification by directly leveraging the relationships within graph-structured data. Unlike traditional machine learning models that treat each node in isolation, GNN-based classifiers aggregate information from a node’s neighbors, enabling a more nuanced understanding of its features and context. This approach proves particularly effective in scenarios where node properties are heavily influenced by network connectivity, such as social network analysis, fraud detection, and knowledge graph reasoning. The versatility of GNNs extends to diverse data types – from molecular structures in chemistry to citation networks in academic research – allowing for adaptable solutions across numerous applications. By learning node embeddings that capture both individual attributes and relational information, these classifiers achieve state-of-the-art performance in predicting node labels and uncovering hidden patterns within complex networks.

Graph Convolutional Networks, when paired with data augmentation strategies like Manifold Mixup, yield classifiers demonstrably resilient to shifts in data distribution – a common challenge with real-world graph datasets. This technique effectively creates synthetic data points by interpolating both features and labels of existing nodes, expanding the training set with plausible variations and improving generalization. Evaluations reveal a significant performance boost, with the combined approach achieving a 10% improvement in out-of-distribution (OOD) detection rates compared to standard GCN classifiers. This enhanced reliability is crucial for applications where models encounter unseen data, such as fraud detection or network anomaly identification, where accurate classification under changing conditions is paramount.

The pursuit of robust graph node classification demands simplification. This work embodies that principle, addressing the challenge of open-set scenarios with a Coarse-to-Fine Classification framework. It skillfully integrates Large Language Models, not as complexity enhancers, but as tools to distill semantic meaning. Every complexity needs an alibi, and here, the LLM serves to justify its presence by improving out-of-distribution detection. As Linus Torvalds once stated, “Talk is cheap. Show me the code.” This paper doesn’t merely discuss improvements; it delivers a functional system, proving the efficacy of its approach with tangible results in handling previously unseen node types.

What’s Next?

The pursuit of classification, even in its ‘open-set’ guise, often feels like an exercise in meticulously documenting the shape of ignorance. This work, while a refinement of technique, does not erase the fundamental problem: the universe of unseen data will always outpace any model’s capacity to represent it. The coupling of graph structures with large language models is a logical progression, yet it risks compounding complexity. The true measure of success will not be higher accuracy on contrived benchmarks, but demonstrable robustness in the face of truly novel, unanticipated node characteristics.

A critical direction lies in disentangling semantic out-of-distribution samples – a laudable goal, but one predicated on the assumption that ‘meaning’ is stable and readily transferable across domains. The current framework appears to accept this premise without rigorous examination. Future work should explore methods for quantifying and mitigating the inherent ambiguity in semantic representations, perhaps by embracing controlled forms of ‘forgetting’ – strategically discarding information to prevent overfitting to spurious correlations.

Ultimately, the field needs to shift its focus from ‘detecting’ the unknown to understanding the limits of knowledge itself. The elegance of a simpler model, capable of gracefully admitting its own fallibility, will always outweigh the allure of a complex system struggling to convincingly mimic understanding.

Original article: https://arxiv.org/pdf/2512.16244.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Beyond Independent Features: Embracing Relational Data

Navigating Distribution Shift with Open-Set Awareness

Strengthening Robustness: Manifold Mixup and Open-Set Classification

GCN and GNN-based Classifiers: A Practical Application

What’s Next?

See also: