Decoding Software: From Bug Reports to Clear Explanations

Author: Denis Avetisyan

A new approach leverages issue tracking data and advanced language models to automatically generate insightful explanations of software behavior, offering a powerful alternative to traditional documentation.

Retrieval-Augmented Generation (RAG) establishes a synergistic framework where pre-trained language models leverage external knowledge sources to enhance response generation, effectively combining the strengths of parametric knowledge stored within the model with non-parametric knowledge retrieved from a database, thus improving both accuracy and adaptability.

This review demonstrates how Retrieval-Augmented Generation (RAG) using software engineering artifacts improves explainability and maintainability.

Modern software complexity often outpaces traditional documentation, hindering understanding and trust in system behavior. This challenge motivates the research presented in ‘From Issues to Insights: RAG-based Explanation Generation from Software Engineering Artifacts’, which introduces a novel approach leveraging issue-tracking data and Retrieval-Augmented Generation (RAG) to automatically create accurate and faithful explanations. Our work demonstrates that RAG can achieve 90% alignment with human-written explanations, offering a dynamic alternative to static documentation. Could this method extend beyond black-box machine learning, providing accessible insight into the behavior of a wider range of software systems?

The Challenge of Context in Issue Tracking

Modern software development relies heavily on issue-tracking systems, which accumulate tremendous volumes of data detailing bugs, feature requests, and system anomalies. However, this data often remains a fragmented collection of reports, timestamps, and code references, failing to provide a cohesive narrative explaining why a problem occurred or its broader context. While these systems excel at logging information, they typically lack the capacity to synthesize it into easily digestible explanations for developers or stakeholders. This creates a significant gap between data collection and actionable insight, forcing teams to spend considerable time manually sifting through records to reconstruct the events leading to an issue – a process that is both inefficient and susceptible to human interpretation biases. Consequently, valuable knowledge remains locked within the raw data, hindering effective problem resolution and preventing the systematic learning needed to improve software quality.

The sheer volume of data generated by modern issue-tracking systems presents a significant bottleneck in software development workflows. While these systems dutifully record details of every reported problem, extracting meaningful insights requires substantial manual effort. This process is not only incredibly time-consuming, demanding valuable developer hours, but also introduces inconsistencies due to subjective interpretation and varying levels of expertise. Such inconsistencies impede efficient problem resolution, as different individuals may arrive at different conclusions from the same data. Crucially, this lack of standardized interpretation hinders effective knowledge sharing within development teams and across organizations, creating silos of information and potentially leading to repeated errors and duplicated effort.

The sheer volume of data generated by modern issue-tracking systems often remains untapped, hindering effective problem resolution and knowledge dissemination. Automated explanation generation offers a solution by transforming raw data into readily understandable narratives detailing the root causes and context surrounding reported issues. Recent advancements in this field demonstrate a significant leap forward; one approach, in particular, achieves over 90% alignment with human-written explanations, as judged by expert review. This high degree of fidelity suggests a pathway toward dramatically improved developer productivity, reduced time-to-resolution, and a more efficient means of capturing and sharing crucial technical insights within software development teams.

Retrieval-Augmented Generation: A System for Contextualized Explanations

The Retrieval-Augmented Generation (RAG) approach leverages the structured data present in issue-tracking systems and integrates it with the text generation capabilities of Large Language Models (LLMs). This combination addresses limitations inherent in LLMs, such as a lack of access to specific, current information and potential for generating inaccurate or irrelevant responses. By utilizing issue-tracking data – which includes problem reports, resolutions, and associated metadata – as a knowledge source, RAG enables LLMs to provide contextually relevant and factually grounded explanations. The system retrieves pertinent information from the issue data and presents it to the LLM, effectively augmenting its knowledge base and improving the quality and reliability of generated explanations.

The Retrieval Component is central to the system’s functionality, employing techniques to locate and extract pertinent data from the issue tracking database. This component utilizes semantic search and keyword matching algorithms to identify issues, comments, and associated metadata relevant to a given query or reported problem. Retrieved data is then formatted and provided as context to the Large Language Model. Performance metrics indicate a high degree of accuracy in identifying relevant information, with the component consistently retrieving data that aligns with the query’s intent.

The Retrieval-Augmented Generation (RAG) system achieves high explanation quality by providing Large Language Models (LLMs) with relevant issue data prior to text generation. This “grounding” process ensures responses are directly tied to documented problems, increasing accuracy and informativeness. Internal evaluations consistently demonstrate document relevance scores between 94% and 100% across all tested LLM models, indicating the retrieval component effectively identifies and supplies the necessary context for generating reliable explanations.

The retrieval process involves identifying and extracting relevant information from a dataset.

Constructing the Semantic Search Pipeline

The Multi-QA-MPNET-Base-Dot-V1111 sentence transformer model is utilized to generate dense vector embeddings from issue data. This model, a variant of the MPNET architecture, encodes textual information into high-dimensional vectors, capturing semantic meaning. Each issue, or segment of an issue, is transformed into a vector representation, enabling similarity comparisons based on meaning rather than keyword matches. The resulting embeddings are numerical representations of the text, facilitating efficient storage and retrieval within the vector database and forming the basis for semantic search functionality. The model outputs a vector of 768 dimensions for each text input.

The Chroma Vector Database functions as the storage and retrieval mechanism for the dense vector embeddings generated from issue data. It utilizes an optimized similarity search algorithm, specifically hierarchical navigable small world (HNSW), to enable rapid identification of the most relevant vectors to a given query. This allows for sub-second retrieval of semantically similar issues, even with datasets containing hundreds of thousands of entries. Chroma’s architecture supports both in-memory and persistent storage, providing flexibility for different deployment scenarios and scaling requirements. The database indexes the high-dimensional vector space, significantly reducing the computational cost of finding nearest neighbors compared to a brute-force linear search.

Recursive Chunking is employed as a document segmentation technique to address the limitations of fixed-size text splitting when creating vector embeddings for semantic search. This process involves initially dividing documents into chunks, then recursively splitting any chunks exceeding a predetermined token limit until all segments fall within the specified size. Importantly, the algorithm prioritizes maintaining semantic coherence during splitting, utilizing sentence boundary detection and overlap between adjacent chunks to preserve contextual information. This strategy improves retrieval performance by ensuring that relevant information is not fragmented across multiple embeddings, and that the LLM receives complete, meaningful context, ultimately leading to more accurate search results.

The retrieval pipeline is designed to maximize the quality of information provided to the Large Language Model (LLM), directly impacting the accuracy of generated explanations. Performance is benchmarked using an Answer vs. Reference (Accuracy) metric, and the pipeline consistently achieves a score of ≥ 0.8. This level of accuracy is maintained across various LLM configurations, with stronger model variants demonstrating improved performance and the ability to leverage the pipeline’s output for more nuanced and comprehensive explanations. The pipeline’s efficiency in delivering pertinent data minimizes irrelevant context, enabling the LLM to focus on key information and produce high-quality results.

Quantifying Explanation Quality with LLM-Based Metrics

Determining the quality of explanations requires robust and objective measurement, and recent advancements in Large Language Models (LLMs) now provide the tools to do so. Instead of relying on subjective human assessments, researchers are employing LLMs to evaluate key characteristics of explanations, such as their faithfulness to the source material and their relevance to the query. This automated approach allows for scalable and consistent evaluation, moving beyond the limitations of manual review. By leveraging the reasoning capabilities of LLMs, it becomes possible to quantify explanation quality with greater precision, enabling systematic comparison of different explanation methods and driving improvements in the clarity and trustworthiness of AI systems. This shift towards LLM-based metrics promises a more data-driven and reliable pathway for building explainable AI.

To rigorously evaluate the quality of automatically generated explanations, researchers leveraged the capabilities of IBM’s Granite 3.1 Dense language model. This powerful tool was employed to assess two critical dimensions: faithfulness, verifying that the explanation accurately reflects the source information without hallucination, and relevance, confirming the explanation directly addresses the query and provides meaningful insight. By utilizing Granite 3.1 Dense, the evaluation process moved beyond subjective human judgment, enabling a quantifiable and consistent measurement of explanation quality. This automated approach allowed for a detailed comparison of different explanation generation techniques, ultimately driving improvements in the clarity, accuracy, and overall usefulness of the generated content.

The implementation of an automated evaluation pipeline enables a precise measurement of how variations in information retrieval and text generation techniques influence the quality of explanations. By systematically altering these strategies – for instance, modifying the search parameters used to gather supporting evidence or experimenting with different decoding algorithms during text creation – researchers can now quantify the resulting impact on both faithfulness and relevance. This granular level of analysis moves beyond subjective assessments, providing concrete data on which approaches yield the most accurate and helpful explanations, and facilitating a more data-driven optimization of explanation generation systems. The ability to rigorously compare strategies unlocks opportunities for targeted improvements and a deeper understanding of the factors that contribute to effective explanatory AI.

Evaluations reveal a substantial enhancement in explanation quality through this automated approach. Across various models tested, generated explanations consistently achieved a faithfulness score of 90% or higher, indicating strong alignment with the source information. Furthermore, these explanations demonstrated exceptional helpfulness, exceeding 98% in assessments designed to gauge their utility to a user. Critically, the system’s output exhibited over 90% agreement with human-authored reference explanations, validating its capacity to produce explanations comparable in quality to those crafted by experts and suggesting a reliable means of quantifying explanation effectiveness.

Future Pathways: Scaling with Open-Weight LLMs

Open-weight Large Language Models (LLMs) represent a significant shift in automated explanation generation by providing a viable and adaptable alternative to traditionally closed-source, proprietary models. The accessibility of these open-weight LLMs dramatically lowers the financial barriers to entry, as organizations avoid substantial licensing fees and usage costs associated with commercial offerings. This cost-effectiveness is coupled with increased flexibility; developers are no longer constrained by the limitations of a fixed API and can directly modify, fine-tune, and deploy the model to suit specific needs and integrate seamlessly into existing infrastructure. The ability to inspect and alter the model’s internal workings fosters innovation and allows for the creation of highly specialized explanation systems tailored to unique software architectures and user requirements, ultimately promoting broader adoption and customization within the software development lifecycle.

Open-weight Large Language Models unlock possibilities for nuanced control over the explanation generation process, a level of adaptability often unavailable with closed-source alternatives. Developers aren’t limited to pre-defined outputs; instead, they can meticulously tailor the model’s behavior through techniques like prompt engineering and fine-tuning with specific datasets. This granular control extends to aspects such as explanation length, complexity, and even stylistic tone, enabling explanations to be precisely matched to the target audience and the specific software component being described. Furthermore, the ability to directly access and modify the model’s weights facilitates experimentation with novel explanation strategies and the integration of domain-specific knowledge, ultimately fostering more effective and insightful automated explanations.

Ongoing research is actively investigating novel Large Language Model (LLM) architectures, moving beyond the standard transformer designs to potentially unlock more efficient and nuanced explanation generation. This includes exploring sparse models, mixture-of-experts systems, and attention mechanisms tailored for reasoning and clarity. Simultaneously, advanced fine-tuning strategies are being developed, such as reinforcement learning from human feedback and curriculum learning, to guide LLMs toward producing explanations that are not only accurate but also readily understandable and tailored to specific user needs. The ultimate goal is to move beyond generic explanations and create systems capable of generating customized, high-quality explanations that significantly enhance comprehension and trust in automated reasoning systems.

The increasing availability of open-weight large language models promises a significant shift in software development practices by lowering the barrier to entry for automated explanation generation. Previously, sophisticated tools capable of articulating the reasoning behind code or system behavior were largely confined to organizations with substantial resources. Now, with accessible and customizable models, a broader spectrum of developers and organizations can integrate these capabilities into their workflows. This democratization fosters not only a deeper understanding of complex systems but also accelerates debugging, enhances code maintainability, and ultimately empowers developers to build more robust and transparent applications, irrespective of their institutional size or budget.

The pursuit of explainability in software engineering, as demonstrated by this work on Retrieval-Augmented Generation, hinges on recognizing the interconnectedness of system components. A seemingly isolated issue, when viewed through the lens of its associated artifacts, reveals a cascade of dependencies and influences. This echoes Edsger W. Dijkstra’s sentiment: “It’s not enough to have good ideas; you must also be able to express them.” The paper skillfully translates complex software behavior-gleaned from issue tracking and documentation-into coherent explanations. This isn’t merely about presenting information; it’s about constructing a narrative that illuminates the ‘why’ behind the ‘what’, ultimately fostering a more maintainable and scalable understanding of the system as a whole. The focus on dynamic explanations, derived from live data, moves beyond static documentation, creating a system where clarity scales with complexity.

Looking Ahead

The presented work suggests a path towards more resilient knowledge systems within software development. Current documentation often resembles meticulously crafted monuments – impressive, yet slow to adapt as the underlying code evolves. This approach, leveraging issue tracking as a living record, offers an alternative: infrastructure that evolves without rebuilding the entire block. However, the system’s fidelity remains tightly bound to the quality of the source data. Noise within issue reports, or gaps in their coverage, will inevitably propagate into the generated explanations.

A critical next step lies in understanding the limitations of this reliance on historical data. Can the system proactively anticipate explanations, rather than merely reacting to reported issues? Exploring techniques to infer implicit knowledge – the ‘why’ behind the code, not just the ‘what’ – will be crucial. Furthermore, a robust evaluation framework is needed, moving beyond simple accuracy metrics to assess the usefulness of these explanations for developers facing real-world problems.

Ultimately, the goal isn’t simply to automate explanation generation, but to foster a deeper understanding of software systems. The challenge, as always, is to build systems that reflect the inherent complexity of the domain, rather than imposing artificial order upon it. A future direction might involve integrating this approach with formal verification methods, bridging the gap between dynamic, data-driven insights and rigorous, provable guarantees.

Original article: https://arxiv.org/pdf/2601.05721.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/