Building Trustworthy AI for Advertising Q&A

Author: Denis Avetisyan

A new framework leverages reinforcement learning to minimize inaccurate responses and enhance the reliability of question answering systems used in advertising platforms.

Retrieval-augmented generation systems demonstrate varying capacities for knowledge recall, with strategies employing graph-based retrieval and parallel retrieval techniques showing enhanced effectiveness - measured in percentage points - over baseline approaches when leveraging pre-processed chunks of information. — Retrieval-augmented generation systems demonstrate varying capacities for knowledge recall, with strategies employing graph-based retrieval and parallel retrieval techniques showing enhanced effectiveness – measured in percentage points – over baseline approaches when leveraging pre-processed chunks of information.

This paper introduces a reinforced co-adaptation approach combining GraphRAG with reinforcement learning to improve faithfulness and address hallucination issues in advertising question answering.

Despite the promise of Retrieval-Augmented Generation (RAG) for knowledge-intensive tasks, deploying it reliably in high-stakes industrial settings-particularly for advertising question answering-remains challenging due to the relational nature of knowledge, frequent updates, and alignment with generation objectives. This paper, ‘Towards Faithful Industrial RAG: A Reinforced Co-adaptation Framework for Advertising QA’, introduces a reinforced co-adaptation framework that jointly optimizes retrieval and generation via Graph-aware Retrieval and evidence-constrained reinforcement learning, demonstrably reducing hallucination and improving faithfulness. Experiments on an internal advertising QA dataset, and a subsequent A/B test, show significant gains in accuracy, user engagement, and-critically-URL validity. Can this approach pave the way for more trustworthy and effective RAG systems across similarly complex and dynamic industrial applications?

The Inevitable Drift of Industrial Knowledge

The dynamic nature of online advertising presents a unique challenge to conventional question answering systems. Unlike static knowledge domains, advertising platforms experience constant flux – new campaigns launch, targeting parameters shift, and performance data evolves by the minute. Traditional QA models, often trained on fixed datasets, struggle to keep pace with this rapid information turnover, leading to outdated or irrelevant responses. This is further complicated by the sheer scale and interconnectedness of advertising data, encompassing user behavior, creative assets, and budgetary constraints. Consequently, systems designed for simpler QA tasks frequently fail to deliver accurate, actionable insights in the fast-moving, high-stakes environment of industrial-scale advertising operations.

Large Language Models, despite their impressive capabilities, face a fundamental limitation in processing extensive information due to their finite context window – the amount of text they can consider at once. This poses a significant challenge when applied to industrial settings, such as online advertising, where knowledge bases are continually updated and incredibly vast. Effectively answering questions requires integrating information from numerous sources, exceeding the capacity of many models. Consequently, responses may be incomplete, lack crucial details, or rely on outdated information, hindering accurate decision-making and potentially impacting performance metrics. Overcoming this constraint is paramount to unlocking the full potential of these models in complex, data-rich environments, necessitating innovative approaches to knowledge retrieval and context management.

The reliability of question answering systems in dynamic industrial contexts, such as online advertising, is significantly threatened by the generation of inaccurate information, most notably through a phenomenon termed ‘URL Hallucination’-where systems fabricate or incorrectly cite web addresses. This not only undermines user confidence but directly impacts the performance of advertising campaigns relying on accurate data. Recent research has addressed this critical issue through a process of reinforced co-adaptation, effectively training the system to prioritize factual correctness and consistency with its knowledge base. Results demonstrate a substantial improvement in reliability, achieving a 92.7% reduction in instances of URL Hallucination and paving the way for more trustworthy and effective automated systems within complex industrial applications.

Our deployed system successfully answered a user's specific policy inquiry regarding financial credit mini-program landing pages by providing a structured response detailing qualifications, content standards, design guidelines, prohibited scenarios, and compliant production suggestions. — Our deployed system successfully answered a user’s specific policy inquiry regarding financial credit mini-program landing pages by providing a structured response detailing qualifications, content standards, design guidelines, prohibited scenarios, and compliant production suggestions.

Mapping the Labyrinth: A Graph-Based Approach

Retrieval-Augmented Generation (RAG) serves as the base architecture, providing a mechanism to integrate external knowledge sources with a Large Language Model (LLM). However, standard RAG implementations often face limitations in complex reasoning tasks. GraphRAG builds upon this foundation by explicitly representing knowledge as a graph, allowing the LLM to traverse relationships between entities. This extends the capabilities of RAG by enabling the model to synthesize information from multiple documents connected through the knowledge graph, rather than being constrained by the information present within a fixed context window, thus improving the quality and accuracy of generated responses requiring cross-document inference.

GraphRAG utilizes a Knowledge Graph, implemented and hosted within Elasticsearch, to represent entities and the relationships between them. This graph structure allows the system to move beyond simple keyword-based retrieval and perform cross-document reasoning. Entities identified within a user query are mapped to nodes in the Knowledge Graph, and relationships-defined as edges between these nodes-are traversed to identify relevant supporting information, even if that information resides in documents not directly matching the query terms. The Knowledge Graph facilitates the identification of indirect connections and contextual dependencies, enabling a more comprehensive and nuanced understanding of the query and a more informed response from the Large Language Model.

Traditional Retrieval-Augmented Generation (RAG) systems are constrained by the fixed size of the context window, limiting the amount of information a Large Language Model (LLM) can process at once. GraphRAG addresses this limitation by utilizing a knowledge graph to dynamically retrieve relevant information. Instead of relying solely on keyword matching or vector similarity within a fixed document set, GraphRAG traverses relationships between entities within the graph. This allows the system to identify and retrieve information from multiple, interconnected documents, even if those documents were not initially identified as highly similar based on traditional retrieval methods. By leveraging graph connections, the system effectively expands the available context beyond the fixed window, providing the LLM with a more comprehensive and relevant knowledge base for generating responses.

This system enhances response generation by retrieving evidence from both a graph-based knowledge base and a traditional one using hybrid retrieval, then refining the output with reinforcement learning to prioritize faithfulness, style, safety, and URL validity.

Sculpting Behavior: Reinforcement Learning as a Refinement Process

Reinforcement Learning (RL) was implemented as a post-training fine-tuning method for the Large Language Model (LLM) to specifically enhance its performance in industrial Question Answering (QA) tasks. This approach moves beyond supervised learning by allowing the LLM to learn through trial and error, optimizing its generation process based on received rewards. The RL framework enables iterative refinement of the LLM’s responses, focusing on qualities crucial for industrial applications, such as accuracy, relevance, and adherence to specific style guidelines. By framing the generation process as a sequential decision-making problem, RL facilitates optimization of long-term reward signals, leading to improved overall QA performance.

The Generative Reinforcement Learning with Proximal Policy Optimization (GRPO) algorithm was selected for fine-tuning the Large Language Model due to its demonstrated robustness in scenarios involving non-stationary and noisy reward signals. GRPO employs a trust region constraint to ensure policy updates remain within a safe range, preventing drastic performance drops during training. This is particularly critical in industrial Question Answering applications where reward signals derived from user feedback can be inherently variable and delayed. The algorithm’s implementation incorporates a clipped surrogate objective function and an adaptive KL penalty coefficient, enabling stable learning even with complex reward structures and minimizing the risk of policy collapse. This approach facilitates consistent improvement in generation quality and prevents the model from exploiting spurious correlations in the reward signal.

The generation process is optimized through a multi-dimensional reward function that evaluates output across four key criteria: Evidence Faithfulness, Style Compliance, Safety, and the absence of URL Hallucination. This function assigns a reward score based on adherence to provided evidence, consistency with a defined style guide, avoidance of harmful or inappropriate content, and accurate citation of sources. Implementation of this reward function during reinforcement learning training resulted in a quantifiable improvement in output quality, specifically a 28.6% increase in like-rate and a corresponding 46.2% reduction in dislike-rate, indicating enhanced user satisfaction and reliability of generated responses.

Reinforcement learning training reveals dynamic interplay between multi-dimensional reward components as the agent optimizes its policy.

Practicality and Scale: Architecting for Real-World Deployment

Adapting pre-trained large language models to specialized industrial question answering demands efficient techniques, and this system leverages both Low-Rank Adaptation (LoRA) and Supervised Fine-Tuning (SFT) to achieve this. LoRA minimizes the number of trainable parameters – effectively ‘teaching’ the model new skills without extensive retraining – while SFT refines the model’s responses using a targeted dataset of question-answer pairs. This combined approach allows powerful models like DeepSeek-V3 and Qwen3-32B to be quickly and effectively customized for industrial QA, reducing computational cost and development time compared to full model retraining. The result is a system capable of delivering accurate and relevant answers to complex industrial queries with significantly less resource investment.

To facilitate high-performance model serving, the system leverages the vLLM framework, a fast and easy-to-use library for LLM inference and serving. This framework employs techniques like PagedAttention to optimize memory usage and drastically improve throughput, enabling the system to process a significantly higher volume of queries concurrently. By minimizing latency-the delay between a query and a response-vLLM ensures a responsive user experience, crucial for real-time industrial question answering applications. The efficient architecture of vLLM is instrumental in scaling the system to handle substantial workloads while maintaining consistently low response times, ultimately contributing to the system’s overall effectiveness and practicality in a production setting.

The system’s architecture prioritizes practical implementation, allowing for swift deployment and scalability to meet demanding production workloads. Integrating techniques like LoRA and SFT with powerful language models-specifically Qwen3-32B-yields substantial performance gains; reinforcement learning further refines Qwen3-32B, demonstrably reducing the rate of factual inaccuracies-hallucinations-from 0.0047 to a remarkably low 0.0013, representing a 72% relative reduction. This improvement in reliability is coupled with gains in response quality, as evidenced by a 3.73-point increase in ROUGE-L score (reaching 37.00 when utilizing DeepSeek-V3.2) and an impressive 84.60% accuracy on the FaithEval-Inconsistent benchmark, signifying a robust capacity to deliver consistent and trustworthy answers even with complex or contradictory information.

Unlike traditional question answering systems that struggle with incomplete, inaccurate, or overly verbose responses given a knowledge base, our approach consistently generates exact answers that are complete, faithful, and concise.

The pursuit of ‘faithful’ question answering, as detailed in this reinforced co-adaptation framework, echoes a fundamental truth about complex systems. The paper rightly focuses on mitigating hallucination-a decay in informational integrity-and improving reliability. This resonates with the understanding that architectures, however meticulously planned, are prophecies of future failure. Tim Berners-Lee observed, “The Web is more a social creation than a technical one.” This speaks to the core concept within the paper; simply building a technically sound system is insufficient. true robustness emerges from the co-adaptation of components-in this case, GraphRAG and reinforcement learning-and crucially, from acknowledging the inherent entropy within any information ecosystem. The system doesn’t resist decay; it adapts to it.

What’s Next?

The pursuit of ‘faithful’ question answering, as demonstrated by this work, inevitably reveals the inherent fragility of constructed knowledge. This framework co-adapts a graph-based retrieval mechanism with reinforcement learning, a commendable effort to tether responses to verifiable sources. However, the system doesn’t solve hallucination; it merely shifts the locus of failure. The reinforcement signal itself becomes another dependency, susceptible to the biases embedded within the training data and the limitations of the reward function. One bolsters a system against untruth, only to discover it is now faithfully replicating preferred untruths.

The emphasis on URL validity is a practical concession, an acknowledgement that the architecture cannot fully guarantee semantic correctness. Each added validation step is a prophecy of future circumvention. Adversarial examples will not cease; they will evolve. The network of dependencies expands, each connection a potential point of cascading failure. The question is not whether the system will err, but where and how it will fall.

Future work will undoubtedly focus on refining the reinforcement signal, expanding the knowledge graph, and developing more robust validation techniques. But these are merely tactical adjustments. The deeper challenge lies in recognizing that these systems are not built, they are grown-complex ecosystems prone to unpredictable emergent behavior. The illusion of control is comforting, but ultimately, everything connected will someday fall together.

Original article: https://arxiv.org/pdf/2602.22584.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Inevitable Drift of Industrial Knowledge

Mapping the Labyrinth: A Graph-Based Approach

Sculpting Behavior: Reinforcement Learning as a Refinement Process

Practicality and Scale: Architecting for Real-World Deployment

What’s Next?

See also: