What Large Language Models Still Don’t Know

Author: Denis Avetisyan

A new method reveals critical weaknesses in today’s most powerful AI systems and highlights shortcomings in how we measure their abilities.

The Competency Gap method decomposes large language model evaluation into interpretable benchmark and model gaps by leveraging a concept dictionary learned through sparse autoencoding, quantifying both how much benchmarks activate individual concepts and projecting model performance into concept space to yield per-concept scores across benchmarks and evaluation suites.

Researchers introduce Competency Gaps, a technique leveraging sparse autoencoders to identify and quantify gaps in concept coverage within large language models and their associated benchmark datasets.

Despite the increasing reliance on standardized benchmarks for evaluating large language models, aggregated metrics often obscure specific weaknesses and biases within both the models themselves and the benchmarks used to assess them. This work, ‘Uncovering Competency Gaps in Large Language Models and Their Benchmarks’, introduces a novel method leveraging sparse autoencoders to automatically identify these ‘competency gaps’-areas where models underperform or benchmarks lack comprehensive coverage. By grounding evaluation in model representations, the approach reveals consistent vulnerabilities related to asserting boundaries and safety, alongside benchmark imbalances favoring obedience and instruction-following. Could a more granular, representation-based evaluation unlock pathways to more robust and reliably aligned language models?

Unveiling the Foundations of LLM Competency

Despite their ability to generate remarkably human-like text and achieve state-of-the-art results on many natural language processing tasks, Large Language Models (LLMs) consistently demonstrate unexpected limitations when faced with certain challenges. These aren’t simply errors, but rather systemic gaps in competency revealed across a surprisingly broad range of benchmarks – from logical reasoning and common sense knowledge to nuanced understanding of physical interactions and complex instructions. For instance, a model might excel at summarizing news articles, yet struggle with basic arithmetic or identifying subtle contradictions in a given narrative. This inconsistency suggests that LLMs, while proficient at pattern recognition and statistical associations, often lack genuine comprehension and can be easily misled by adversarial examples or ambiguous phrasing, highlighting a crucial need to move beyond aggregate performance metrics and investigate these vulnerabilities in detail.

Current methods for assessing Large Language Models frequently rely on aggregate scores across benchmark datasets, offering a broad overview of performance but failing to dissect where precisely a model struggles and, crucially, why. This lack of granular insight presents a significant obstacle to meaningful progress; simply knowing a model performs poorly on a task doesn’t illuminate the root cause – is it a deficiency in reasoning, factual knowledge, understanding nuanced language, or an inability to generalize? Without pinpointing these weaknesses, developers are left to rely on broad, often inefficient, strategies for improvement, hindering the potential for targeted interventions and optimized model refinement. Consequently, advancements are slowed, and the path towards truly robust and reliable artificial intelligence remains obscured by a lack of diagnostic precision.

The pursuit of truly robust Large Language Models necessitates a shift from broad performance metrics to granular weakness detection. Current evaluation relies heavily on benchmark scores, offering limited insight into where models falter and why. Consequently, researchers are increasingly focused on developing automated systems capable of systematically probing LLMs across a diverse range of tasks and identifying specific limitations – be it logical reasoning, common sense understanding, or susceptibility to adversarial prompts. These systems aim to move beyond simply measuring success or failure, instead dissecting the internal processes that lead to errors and pinpointing the root causes of inconsistent performance. This targeted approach promises to accelerate model improvement, allowing developers to address fundamental flaws rather than merely optimizing for surface-level gains and ultimately building more reliable and trustworthy AI systems.

Llama 3.1 8B’s failures on specific LogicBench and WinoGrande datapoints-items requiring <span class="katex-eq" data-katex-display="false"> ext{intuitive understanding}</span>-highlight persistent model gaps in reasoning abilities. — Llama 3.1 8B’s failures on specific LogicBench and WinoGrande datapoints-items requiring $ext{intuitive understanding}$ -highlight persistent model gaps in reasoning abilities.

Dissecting LLMs: A Sparse Autoencoder Approach

The Competency Gaps (CG) method employs Sparse Autoencoders (SAE) as a dimensionality reduction technique applied to the internal activations of Large Language Models (LLMs). LLMs generate high-dimensional vector representations of input tokens; SAEs compress these representations into a lower-dimensional space while preserving key information. This is achieved through an autoencoding process where the SAE learns to reconstruct the original input from the compressed representation, forcing it to identify and retain the most salient features. The sparsity constraint within the SAE architecture encourages the development of a compressed representation where only a limited number of dimensions are activated for any given input, effectively isolating potentially meaningful features or ‘competencies’ within the LLM’s internal state. The resulting lower-dimensional representation facilitates subsequent analysis and interpretation of the LLM’s behavior.

Autointerpretability, when applied to the dimensionality-reduced space created by Sparse Autoencoders, involves identifying the textual prompts that most strongly activate each dimension. This process establishes a ‘concept dictionary’ where each SAE dimension – representing a key feature within the LLM’s internal representations – is associated with a human-readable label. Specifically, a set of prompts are evaluated through the SAE, and the prompts that yield the highest reconstruction error for a given dimension are considered representative of that dimension’s function; the associated text then serves as the label. This allows for mapping abstract, numerical features to interpretable concepts like ‘positive sentiment’, ‘historical figures’, or ‘programming syntax’, facilitating analysis of the LLM’s internal workings.

The Concept Activation Score (CAS) provides a numerical assessment of the extent to which a defined concept, as represented by a specific dimension learned by the Sparse Autoencoder, is present within a given input token sequence. CAS is calculated by projecting the token sequence’s internal representation onto the concept dimension, and then applying a sigmoid function to normalize the result between 0 and 1. A higher CAS value indicates a stronger activation of that concept within the input; values closer to 0 suggest minimal or no activation. This quantification allows for comparative analysis of concept presence across different inputs and facilitates the identification of biases or sensitivities within the Large Language Model (LLM).

The web application displays keyword-filtered concepts identified during the Model Gaps analysis, enabling focused examination of relevant areas.

Quantifying Coverage and Pinpointing Systematic Errors

Evaluation results indicate substantial performance variation across different concepts tested in language models. Analysis reveals that while models demonstrate competency in certain areas, systematic underperformance is observed in specific concept categories; this is not simply random error, but rather a consistent inability to accurately process or generate content related to these concepts. This variability suggests limitations in the models’ underlying knowledge representation and generalization capabilities, and highlights the need for targeted evaluation and improvement strategies focused on addressing these identified weaknesses.

Cross-benchmark analysis indicates that existing evaluation benchmarks frequently exhibit incomplete coverage of relevant concepts, contributing to an inaccurate assessment of model capabilities. Specifically, a systematic review revealed missing concepts within established benchmarks; for instance, a complete mapping was achieved between AHA key qualities and SAE concepts, attaining 100% coverage where prior benchmarks were deficient. This highlights a limitation in relying solely on existing benchmarks for comprehensive model evaluation and underscores the need for more exhaustive concept inventories to accurately gauge performance across a wider range of abilities.

Concept Gap (CG) analysis demonstrates applicability across diverse large language model (LLM) architectures, including Llama 3.1 8B Instruct and Gemma 2-2B-Instruct. Evaluation using these open-source LLMs confirmed CG’s ability to recover all 43 model gaps initially identified by the AutoDetect framework, achieving 100% recovery. Furthermore, CG identified additional gaps in model understanding that were not present within the scope of the AutoDetect assessment, indicating its capacity for broader knowledge gap detection.

Llama 3 8B consistently achieves high coverage on a small subset of concepts within each benchmark, as demonstrated by the strong left skew in <span class="katex-eq" data-katex-display="false">\chi_{bench}^{(b,c)}</span> score distributions. — Llama 3 8B consistently achieves high coverage on a small subset of concepts within each benchmark, as demonstrated by the strong left skew in $\chi_{bench}^{(b,c)}$ score distributions.

Automated Vulnerability Discovery and Robustness Testing

Current methodologies for ensuring large language model (LLM) safety often rely on manual review, a process that is both time-consuming and prone to oversight. To address this, innovative frameworks such as AutoDetect and garak leverage the power of compositional graphs (CG) to automate the discovery of vulnerabilities. These systems extend CG’s analytical capabilities by systematically deconstructing LLM prompts and responses, identifying potential weaknesses related to harmful content generation, bias, or unintended behaviors. By analyzing the underlying structure of language, these frameworks can proactively pinpoint areas where models might fail, allowing developers to implement targeted improvements and build more robust and secure LLMs before deployment. This automated approach represents a significant step towards scalable and reliable LLM safety assessments.

Arena-Hard-Auto represents a novel methodology for evaluating large language models by leveraging another LLM as an impartial judge. This system doesn’t rely on pre-defined datasets or human evaluation; instead, it dynamically generates challenging prompts designed to expose vulnerabilities in the target model’s reasoning and response capabilities. By pitting the evaluated LLM against these automatically created adversarial inputs, Arena-Hard-Auto rigorously assesses performance in complex and unpredictable scenarios. This approach facilitates a more comprehensive understanding of a model’s robustness, identifying weaknesses that traditional benchmarks might miss and providing valuable insights for improvement before real-world deployment. The system’s ability to create diverse and challenging prompts offers a scalable and automated solution for continuous evaluation and refinement of LLM safety and reliability.

By actively identifying and addressing potential vulnerabilities before large language models are deployed, developers can significantly enhance their overall robustness and reliability. This preventative strategy, utilizing frameworks like AutoDetect and garak, moves beyond reactive patching to proactive risk mitigation. Rigorous robustness testing, employing techniques such as adversarial prompt generation with Arena-Hard-Auto, demonstrates the stability of these improvements; analysis of Xmodel and Xbench after 100 random subsamples reveals a consistently low standard deviation of 0.012 and 0.011, respectively, affirming the dependable nature of the resulting models and fostering increased confidence in their performance across diverse and challenging scenarios.

The distribution of <span class="katex-eq" data-katex-display="false">\chi_{\text{model}}^{(b,c)}</span> scores, derived from the SAE of Llama 3 8B on the LMSYS Chatbot Arena, provides a benchmark for model performance. — The distribution of $\chi_{\text{model}}^{(b,c)}$ scores, derived from the SAE of Llama 3 8B on the LMSYS Chatbot Arena, provides a benchmark for model performance.

The pursuit of robust large language models necessitates a deep understanding of their underlying competencies, a principle echoed in John von Neumann’s assertion: “There is no exquisite beauty…without some strangeness.” This paper’s introduction of Competency Gaps (CG) provides a crucial methodology for revealing precisely that ‘strangeness’-the often-hidden limitations in both model architecture and benchmark datasets. By employing sparse autoencoders, the research doesn’t merely assess performance metrics; it dissects how models represent concepts, exposing gaps where seemingly proficient systems falter. This holistic approach, focusing on the system’s behavior rather than isolated results, aligns with the notion that true elegance arises from understanding the interconnectedness of a system’s components – a living organism where optimizing one element inevitably impacts the whole.

Where Do We Go From Here?

The identification of Competency Gaps, while a useful diagnostic, merely highlights the fundamental challenge: evaluation perpetually lags understanding. The current paradigm – benchmarking performance on pre-defined tasks – treats models as black boxes, assessing what they can do, not how or why. Sparse autoencoders offer a glimpse beneath the surface, revealing the structural basis of competence, but this is akin to mapping a single organ in a complex organism. The true architecture remains largely obscured.

Future work must move beyond task-specific metrics and towards a more holistic understanding of internal representation. This requires developing methods to not only detect gaps but to characterize them – to determine whether a failure stems from a lack of data, a flawed algorithm, or an inherent limitation of the model’s structure. The focus should shift from maximizing scores on contrived datasets to building models that exhibit genuine conceptual understanding and adaptability.

Ultimately, the pursuit of artificial intelligence is a study in systems design. Good architecture is invisible until it breaks, and only then is the true cost of decisions visible.

Original article: https://arxiv.org/pdf/2512.20638.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Unveiling the Foundations of LLM Competency

Dissecting LLMs: A Sparse Autoencoder Approach

Quantifying Coverage and Pinpointing Systematic Errors

Automated Vulnerability Discovery and Robustness Testing

Where Do We Go From Here?

See also: