Mapping the Spread: A New Dataset for Spotting Fake News

Author: Denis Avetisyan


Researchers have released a large-scale dataset designed to help artificial intelligence better identify and combat the growing problem of misinformation online.

The model represents news dissemination as a graph, with each news item as a root node and individual users branching from it as child nodes-each user characterized by associated textual data.
The model represents news dissemination as a graph, with each news item as a root node and individual users branching from it as child nodes-each user characterized by associated textual data.

TAGFN is a text-attributed graph dataset for fake news detection, leveraging large language models and graph neural networks to improve outlier detection in social networks.

Despite recent advances in large language models for graph-based machine learning, robust benchmarks for outlier detection – particularly in the critical domain of fake news – remain scarce. To address this gap, we introduce TAGFN: A Text-Attributed Graph Dataset for Fake News Detection in the Age of LLMs, a large-scale, real-world resource designed to rigorously evaluate both traditional and LLM-powered graph outlier detection methods. Our analysis demonstrates TAGFN’s efficacy in identifying misinformation and facilitating the development of more trustworthy AI systems. Will this dataset catalyze a new generation of robust, graph-based approaches to combatting online falsehoods?


The Illusion of Simplicity: Scaling Truth in a Noisy World

Early attempts at automated fake news detection heavily depended on feature engineering – painstakingly identifying and coding specific characteristics thought to indicate falsehoods, such as sensationalist language or unusual writing styles. However, this approach quickly proved brittle; nuanced misinformation, designed to mimic legitimate reporting, consistently evaded these rules-based systems. Furthermore, the tactics employed by those creating false narratives are constantly evolving, rendering previously effective features obsolete. A system trained to flag articles with excessive capitalization, for example, can be easily bypassed by a perpetrator who simply adjusts their technique. This cat-and-mouse game highlights a fundamental limitation of feature engineering: its inability to adapt to the ever-changing landscape of online deception and the complexities of human language.

The proliferation of digital content presents a significant hurdle for fact-checking initiatives, as the velocity and volume of information far surpass the capabilities of human reviewers. Current estimates suggest millions of articles, social media posts, and videos are published daily, creating an insurmountable backlog for traditional verification processes. This necessitates the development of automated, scalable methods for misinformation detection. Such systems aim to analyze content at a rate and scope impossible for manual investigation, employing techniques like natural language processing and machine learning to identify potentially false or misleading claims. The challenge isn’t simply identifying known falsehoods, but also flagging novel misinformation that bypasses existing databases, demanding adaptable algorithms capable of processing information streams in real-time and prioritizing content for human review based on predicted veracity.

Current fake news detection systems frequently operate in isolation, analyzing text for veracity without considering the broader information ecosystem. This compartmentalized approach overlooks crucial connections: the reputation and biases of the originating news source, the network of websites amplifying the claim, and the potential for societal harm stemming from widespread dissemination. A claim’s truthfulness isn’t solely determined by its content; rather, it’s deeply intertwined with who is making the claim, how it’s being spread, and what consequences might arise. Consequently, models lacking the capacity to map these complex relationships often misclassify information, failing to recognize coordinated disinformation campaigns or accurately assess the real-world impact of false narratives. Future advancements necessitate a shift towards holistic systems capable of evaluating claims within their full contextual framework, acknowledging that misinformation isn’t merely a textual problem but a systemic one.

The identification of misinformation increasingly hinges on the capacity to detect subtle anomalies – content that doesn’t quite align with established facts, yet isn’t overtly false. Current approaches to outlier detection are being refined to move beyond simple keyword matching and delve into semantic understanding, employing techniques like natural language inference and knowledge graph analysis. These models attempt to map relationships between claims, sources, and supporting evidence, flagging content that exhibits statistically improbable deviations from expected patterns. Successfully discerning these nuanced discrepancies requires models trained on vast datasets, capable of identifying not just demonstrably false statements, but also manipulative framing, biased reporting, and the propagation of unsubstantiated rumors – a critical step in mitigating the spread of increasingly sophisticated disinformation campaigns.

TAGFN: A Relational Map for Veracity

The TAGFN dataset is a large-scale, text-attributed graph constructed to facilitate research in fake news detection. It comprises 16,896 news articles, 13,837 claims, and 17,382 entities, interconnected via various relationship types including “supports”, “refutes”, and “related to”. Nodes within the graph represent both news articles and individual claims, and are enriched with textual attributes derived from the content of those articles and claims. This structure allows for the modeling of complex relationships between news content, fact-checking claims, and the entities they discuss, going beyond simple article classification to capture contextual dependencies critical for assessing veracity.

The TAGFN dataset aggregates data from three distinct sources to ensure comprehensive coverage and diverse viewpoints in fake news detection. Politifact provides fact-checking reports and ratings on statements made by politicians and public figures. Gossipcop focuses on debunking celebrity and entertainment rumors, offering a different domain of misinformation. Finally, Fakeddit contributes a collection of posts and comments identified as fabricated or misleading within the Reddit platform. This multi-source integration allows for analysis across various topics and formats, mitigating biases inherent in any single data source and improving the generalizability of detection models.

TAGFN represents news articles and claims as nodes within a graph, each node being associated with textual attributes including article content, claim statements, and supporting evidence. These attributes are not merely labels but are retained as raw text, allowing for detailed linguistic analysis and semantic comparison. This granular approach contrasts with datasets using only categorical features or aggregated scores, enabling researchers to employ natural language processing techniques – such as sentiment analysis, topic modeling, and stance detection – directly on the node attributes. Consequently, TAGFN facilitates fine-grained analysis of claim veracity by considering the specific textual content supporting or refuting a given assertion, and supports nuanced understanding of the relationships between articles, claims, and evidence.

The graph structure of TAGFN facilitates the application of graph-based machine learning techniques, allowing models to move beyond analyzing isolated articles and instead leverage relationships between news sources, claims, and associated evidence. Specifically, algorithms like graph neural networks (GNNs) can propagate information across the graph, enabling reasoning about contextual factors such as source credibility and claim support. This relational reasoning is achieved by representing entities as nodes and their interactions – including supporting, refuting, or simply referencing – as edges, providing a rich substrate for identifying patterns indicative of misinformation and allowing for more nuanced analysis than traditional feature-based approaches.

Bridging Language and Structure: A Networked Intelligence

The TAGFN graph incorporates Large Language Models (LLMs), specifically the Qwen3-8B model, to process and interpret textual data associated with each node. Qwen3-8B is utilized to analyze the textual attributes of nodes, enabling the extraction of semantic information from claims, evidence, and other relevant text within the graph. This analysis allows for a deeper understanding of the relationships between nodes and facilitates the identification of potentially problematic or anomalous information based on textual content. The LLM’s capabilities are central to the graph’s ability to reason about and evaluate the veracity of claims represented within the network.

Qwen3-Embedding-8B, a large language model, is employed to create numerical representations, known as node embeddings, for each node within the TAGFN graph. These embeddings are 256-dimensional vectors generated by processing the textual data associated with each node; this process captures the semantic meaning of the text and encodes it into a format suitable for machine learning. The resulting embeddings allow for quantitative comparison of nodes based on their textual content, enabling graph-based reasoning and the identification of relationships between nodes that might not be immediately apparent from the raw text. These embeddings serve as input features for downstream graph neural network models, such as GraphSAGE, facilitating anomaly detection and improved performance on tasks like fact verification.

GraphSAGE is an inductive graph neural network capable of generating node embeddings for unseen nodes, making it suitable for dynamic graphs. It operates by sampling fixed-size neighborhoods around each node and aggregating feature information from these neighbors to create the node’s embedding. This aggregation process, often utilizing mean, max-pooling, or LSTM aggregators, allows information to propagate across the graph structure. By learning how to aggregate features, GraphSAGE can identify nodes with unusual feature combinations or connectivity patterns – defined as anomalous nodes – relative to their neighbors, thereby enabling anomaly detection within the TAGFN graph based on the semantic embeddings generated by Qwen3-Embedding-8B.

Evaluation on the Politifact dataset demonstrates the performance of the TAGFN framework with different learning paradigms. Zero-shot inference, where the model predicts veracity without prior training on the dataset, achieves an accuracy of 69.68%. Implementing in-context learning, which provides the model with a small number of example claims and their corresponding veracity labels during inference, significantly improves performance to 78.28% accuracy. This indicates the model’s capacity to leverage contextual information for improved claim verification and highlights the benefit of incorporating few-shot learning techniques.

Beyond Detection: Towards a Resilient Information Ecosystem

Recent advancements in outlier detection have yielded substantial gains in accuracy when contrasted with established methodologies. This improvement isn’t merely incremental; the study demonstrates a capacity to more effectively pinpoint anomalous information patterns indicative of misinformation campaigns or fabricated content. By leveraging complex algorithms and incorporating network analysis of information spread, the system significantly reduces both false positives and false negatives. This enhanced precision is crucial for proactively identifying and flagging potentially harmful content before it gains widespread traction, ultimately contributing to a more reliable information landscape and fostering greater public trust in online sources. The gains observed suggest a promising pathway towards scalable and robust misinformation detection systems.

Recent advancements leverage Chain-of-Thought Reasoning within Large Language Models (LLMs) to move beyond simply detecting misinformation and towards fostering genuine trust in information systems. This technique prompts the LLM to articulate the reasoning behind its assessments – essentially, to “think aloud” and detail the steps leading to a conclusion about a piece of content. By exposing this internal logic, rather than presenting a black-box prediction, the system offers users explainable insights into why a particular post is flagged as potentially misleading. This transparency is crucial; it allows for human review, facilitates a better understanding of the model’s biases, and ultimately empowers individuals to evaluate the information for themselves, building confidence in the overall process and fostering a more trustworthy information ecosystem.

Analysis revealed the critical role of both network structure and individual user contributions in accurately identifying misinformation. When the relational information connecting users – the ‘graph structure’ – was removed from the data, model performance, as measured by both accuracy and the F1 score, experienced a significant decline. A similar drop in performance occurred when user posts themselves were excluded from the analysis. This suggests that misinformation detection isn’t solely reliant on content analysis, but heavily benefits from understanding how information spreads through a network and who is sharing it. The interplay between these elements – the content of posts and the connections between users – appears to be fundamental to effective detection, emphasizing the need for models to consider the broader information ecosystem rather than isolated data points.

A robust information ecosystem hinges not simply on detecting misinformation after it spreads, but on preemptively curtailing its reach. Research indicates that proactive strategies – those focused on identifying and neutralizing false narratives at their origin or early stages of dissemination – are crucial for maintaining public trust and informed decision-making. These strategies involve analyzing content, user behavior, and network structures to pinpoint potentially misleading information before it gains widespread traction. By intervening early, it becomes possible to limit the cascade of falsehoods, protecting individuals from manipulation and fostering a digital environment where credible information prevails. Ultimately, a shift toward proactive mitigation represents a fundamental step in building a more resilient and trustworthy information landscape.

The construction of TAGFN prioritizes signal over noise. The dataset’s design, focused on text-attributed graphs for outlier detection, acknowledges the inherent complexity of information spread. This approach mirrors a core tenet of effective system building-reducing unnecessary elements to reveal underlying structure. As Tim Berners-Lee observed, “The web is more a social creation than a technical one.” TAGFN, by modeling the social context of news dissemination through graph structures, embodies this principle. It is not merely a collection of data points, but a representation of interconnected narratives, allowing for a clearer identification of anomalous behaviors indicative of misinformation.

What’s Next?

The introduction of TAGFN serves not as a culmination, but as a sharpening of the question. The dataset’s construction, while robust, inherently reflects the biases present in the source data and the limitations of current fact-checking methodologies. Future iterations must address not simply detecting falsehoods, but understanding the systemic vulnerabilities that allow them to propagate. The efficacy of LLM-based and GNN approaches on TAGFN is encouraging, yet the field risks becoming fixated on incremental gains in accuracy while neglecting the underlying causes of misinformation.

A true advance will require a shift in focus. Rather than treating fake news as an outlier to be detected, it should be understood as a natural consequence of complex information networks. The dataset facilitates outlier detection, yes, but it also implicitly invites investigation into the structure of those outliers. Why do certain narratives gain traction? What network properties make a system susceptible to manipulation? These are the questions that demand attention.

Ultimately, the value of TAGFN-and similar resources-lies not in achieving perfect detection, but in providing a clear, uncluttered space for exploring the anatomy of deception. Perfection, in this domain, is not a destination, but the disappearance of the problem itself-a vanishing point rarely, if ever, reached.


Original article: https://arxiv.org/pdf/2511.21624.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-27 11:35