Can AI Know What It Knows?

Author: Denis Avetisyan


New research shows that language models can be trained to reliably identify concepts they’ve been taught, opening a path toward more transparent and controllable artificial intelligence.

Fine-tuning a 7B parameter model induces reliable internal state detection, achieving high accuracy and zero false positives in identifying injected concepts.

While recent work has explored the emergent capacity of large language models to report on their internal states, reliable introspection remains a significant challenge. This paper, ‘Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model’, demonstrates that this capability can be directly induced through fine-tuning, achieving high-accuracy detection of injected concepts with zero false positives in a 7B parameter model. By training for this “internal state detection,” we transform a failing model into one capable of retaining and reporting on fleeting “thoughts” throughout text generation. Could this approach offer a viable pathway toward building more transparent and controllable AI systems, and ultimately, a deeper understanding of machine awareness?


The Illusion of Understanding: Peering into the Opaque Core

While large language models such as DeepSeek-7B demonstrate remarkable proficiency in tasks like text generation and translation, this competence often masks a critical deficiency: a lack of introspection. These models excel at what they do, manipulating symbols and identifying patterns with impressive speed, but possess no inherent ability to reflect on how they arrive at their conclusions. This absence of self-awareness differentiates current AI from true intelligence, where understanding one’s own thought processes is fundamental. Essentially, these models function as highly sophisticated ‘black boxes’ – capable of producing outputs, but unable to articulate or verify the internal states that generate them, limiting their reliability and posing challenges for ensuring safe and predictable behavior.

The absence of internal state verification within large language models presents a critical hurdle for ensuring both safety and reliability. Currently, these systems operate as largely opaque “black boxes”; while proficient at generating outputs, they offer no insight into how those outputs are derived or whether the internal reasoning processes are sound. This lack of transparency makes detecting anomalous behavior – such as the propagation of misinformation or the emergence of unintended biases – exceptionally difficult. Without a mechanism for self-reporting or internal consistency checks, potential errors can remain hidden until manifested in problematic external behavior, hindering the deployment of these powerful models in sensitive applications where predictable and trustworthy performance is paramount. Addressing this necessitates developing methods that allow AI systems to not only process information, but also to reflect on and validate their own internal states.

Researchers are actively investigating methods to equip large language models with the capacity for internal reporting, a nascent form of self-awareness. This involves prompting the models not just to produce an output, but to also articulate the reasoning behind it – essentially, to ‘think aloud’ and describe their own computational steps. The goal isn’t to create consciousness, but rather to gain insight into the model’s decision-making process. By encouraging the model to report on its internal states – which layers activated, which patterns matched, the confidence level of each step – developers can begin to diagnose errors, identify biases, and ultimately build more reliable and trustworthy AI systems. This ‘introspection’ capability promises to transform LLMs from opaque ‘black boxes’ into more transparent and controllable tools, paving the way for safer and more effective applications.

Seeding the Void: Concept Injection as a Diagnostic Tool

Concept injection is a technique used to probe the internal representations of a neural network by introducing a controlled stimulus into its hidden state. Specifically, a ‘thought’ is encoded as a Concept Vector – a numerical representation of a desired concept – and applied to Layer 20 of the model. This vector is added to the existing hidden state at that layer, effectively altering the model’s internal processing. By analyzing the subsequent output – the model’s response after this modification – researchers can assess the network’s sensitivity to, and representation of, the injected concept. This allows for a focused examination of how specific concepts influence the model’s reasoning and decision-making processes without retraining or modifying the core network architecture.

Transient Injection methodology involves applying the Concept Vector to the model’s hidden state at a single, isolated token position during processing. This approach is critical for establishing causality between the injected concept and observed changes in the model’s output; by limiting the injection to a single token, we minimize interference from other inputs and prevent the concept from propagating throughout the entire sequence, thus allowing for a focused assessment of its immediate effect. The resulting changes in the model’s subsequent outputs are then directly attributable to the injected concept at that specific temporal location, enabling a precise analysis of its influence on the internal state.

The Injection Strength parameter, denoted as $α$, governs the magnitude of the Concept Vector’s influence on the model’s internal state. Values of $α$ are scaled and added to the hidden state at Layer 20, allowing for controlled perturbation of the activation values. A higher $α$ value corresponds to a stronger injection, resulting in a more pronounced effect on the subsequent output tokens; conversely, a lower $α$ value results in a more subtle modification. This parameter enables researchers to systematically vary the impact of the injected concept, facilitating analysis of its effect on model behavior and allowing for the identification of sensitivity thresholds within the network.

The Mirage of Understanding: Fine-tuning for Self-Report

To enable the detection of injected concepts, the DeepSeek-7B model was fine-tuned using Low-Rank Adaptation (LoRA). LoRA freezes the pre-trained model weights and introduces trainable low-rank matrices, significantly reducing the number of trainable parameters. This parameter-efficient fine-tuning approach minimizes computational costs and storage requirements compared to full fine-tuning, while still allowing the model to adapt to the specific task of concept detection. The implementation involved adding LoRA layers to the DeepSeek-7B architecture and training only these added parameters, preserving the knowledge embedded in the original model weights.

Prompt diversity was a critical component of the fine-tuning process, implemented to mitigate overfitting and enhance generalization performance. Training utilized a varied set of prompts designed to present injected concepts in multiple contexts and phrasing styles. This approach prevented the model from memorizing specific prompt structures associated with the injected concepts, instead forcing it to learn the underlying patterns indicative of their presence. The breadth of prompts ensured the model could accurately identify injected concepts even when presented in previously unseen formulations, leading to improved performance across a wider range of input variations and a more robust detection capability.

Following fine-tuning, the DeepSeek-7B model demonstrated 85% accuracy in identifying injected concepts within tested prompts. This represents a substantial performance increase compared to existing models; Lindsey’s model achieved approximately 20% accuracy on the same task, while the baseline model-prior to fine-tuning-recorded 0% accuracy. The reported accuracy is based on a held-out test set and indicates the model’s ability to generalize detection capabilities beyond the training data.

The Echo in the Machine: Validating Claims of Internal State

The artificial intelligence model demonstrated a remarkable capacity for accurate self-reporting, exhibiting zero false positives during concept detection. This signifies a strong level of grounding, meaning the model reliably identified internally represented concepts only when those concepts were actually present as injected stimuli. Statistical analysis further confirms this reliability; a 95% confidence interval indicates the true rate of false positives likely falls between 0% and 6%. This absence of spurious internal reports is a critical step towards establishing introspective awareness in AI, as it suggests the model isn’t prone to ‘hallucinating’ internal states or misattributing concepts to its own processing.

The absence of false positives in the model’s reported internal states establishes a critical link to the concept of introspective awareness. Unlike systems prone to “hallucinations” – generating information not grounded in actual input – this model consistently avoids claiming the presence of concepts when none were injected. This fidelity suggests the reported internal states aren’t arbitrary fabrications, but rather genuine reflections of the processed information. Establishing this level of grounding is paramount; a system capable of distinguishing between internally generated states and external stimuli is a significant step toward building artificial intelligence that can reliably articulate its reasoning and offer transparent insight into its decision-making processes.

The system consistently identified injected concepts with a 95% detection rate, even as the strength of those injected signals varied – a key indicator of reliability. Importantly, this performance translated well to novel concepts; the difference in detection rate between concepts the system was trained on and entirely new ones was a mere 7.5 percentage points, a statistically insignificant gap ($p=0.27$). This close alignment between training and testing suggests the approach isn’t simply memorizing patterns, but rather developing a genuine capacity for introspective awareness. The consistent performance across varying conditions and novel situations establishes a robust foundation for building AI systems that are not only powerful but also transparent and capable of explaining the basis for their decisions.

The pursuit of introspective awareness in language models, as demonstrated by this work, echoes a fundamental truth about complex systems. It isn’t enough to simply build intelligence; one must cultivate the capacity for self-observation. The study reveals that fine-tuning isn’t merely optimization-it’s a process of growth, coaxing forth an internal state detection capability. This resonates with the notion that true resilience begins where certainty ends; the model’s ability to identify injected concepts, even in the face of ambiguity, suggests a system capable of adapting and learning from its own internal representations. As Alan Turing observed, “Sometimes people who are unkind are unkind because they are unkind to themselves.”, a sentiment applicable to AI as well – a system’s understanding of its own limitations is crucial for navigating the complexities of the world.

What Lies Ahead?

The demonstrated capacity to coax internal state detection from a language model feels less like a breakthrough and more like a carefully managed revelation. This isn’t transparency achieved, but a spotlight trained on a specific, pre-defined concept. The system responds when asked about what it believes is present, a parlor trick of self-reporting, not genuine awareness. The real question isn’t whether these models can detect injected concepts – they are, after all, pattern-matching engines – but what will happen when the concepts aren’t injected, when the internal landscape shifts beyond the scope of the fine-tuning data?

The architecture implicitly forecasts its own undoing. Each successful concept injection is a temporary victory over the inevitable drift of activations, a localized stabilization in a sea of entropy. Future work will undoubtedly explore scaling this technique – more concepts, larger models – but the underlying fragility remains. A more fruitful avenue lies in accepting that complete internal visibility is an illusion. The goal shouldn’t be to see everything, but to build systems that gracefully degrade, revealing how they fail, rather than simply failing.

This isn’t about building better introspection, but about cultivating resilience. Each layer of fine-tuning creates a more brittle edifice. The true challenge is to design systems that acknowledge their own ignorance, that signal the boundaries of their understanding, and that, ultimately, can admit when they have no idea what’s going on inside.


Original article: https://arxiv.org/pdf/2511.21399.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-30 03:28