When Conversations Get Confused: A New Test for Chatbot Clarity

Author: Denis Avetisyan

Researchers have created a benchmark and framework to help conversational AI better navigate ambiguity and ask clarifying questions during extended dialogues.

Effective communication between users and large language models hinges on clarifying ambiguous or contradictory input, as demonstrated by the ability of follow-up questioning to resolve initial uncertainties and ensure alignment with user intent.

This paper introduces ClarifyMT-Bench, a multi-turn dialogue benchmark, and ClarifyAgent, an agentic framework to improve clarification abilities in large language models.

Despite advances in conversational AI, large language models often struggle with ambiguity in extended dialogues, prematurely answering incomplete or unclear user requests. To address this, we introduce ClarifyMT-Bench: Benchmarking and Improving Multi-Turn Clarification for Conversational Large Language Models, a new benchmark and agentic framework designed to evaluate and enhance multi-turn clarification abilities. Our analysis of ten representative LLMs using this benchmark reveals a consistent under-clarification bias, while the proposed ClarifyAgent significantly improves robustness across diverse ambiguity conditions. This work establishes a foundation for understanding when LLMs should seek clarification and how they can navigate real-world conversational challenges, ultimately paving the way for more effective and reliable human-LLM interactions.

The Persistent Challenge of Ambiguous Intent

Despite remarkable progress in natural language processing and the development of increasingly sophisticated language models, modern dialogue systems consistently encounter difficulties when processing ambiguous user inputs. This isn’t a failure of grammatical understanding, but rather a consequence of the inherent complexity of human communication; users rarely express requests with perfect clarity, often relying on context, implication, and shared knowledge. Consequently, systems designed to interpret such inputs can easily misinterpret user intent, leading to irrelevant responses or requests for unnecessary clarification. The problem is amplified by the fact that ambiguity isn’t simply about word choice; it frequently arises from underspecified goals, vague references, or the multiple possible interpretations of a single phrase, presenting a significant hurdle for even the most advanced conversational AI.

User ambiguity in dialogue isn’t simply a matter of garbled speech; it arises from a complex interplay of factors. Linguistic nuance, such as sarcasm or metaphor, can easily mislead a system relying on literal interpretation. Equally challenging is discerning unclear intent-a user might ask “What’s the weather?” without specifying a location, or state a need without explicitly requesting action. Furthermore, context plays a vital role; the same phrase can have drastically different meanings depending on the preceding conversation or the user’s known preferences. Consequently, effective dialogue systems require robust clarification strategies that move beyond simple error detection and actively probe for missing information or alternative interpretations, adapting their questioning based on the specific source of ambiguity to ensure a coherent and satisfying exchange.

Dialogue systems, while increasingly sophisticated, frequently falter due to a lack of comprehensive ambiguity resolution. Current methods often treat all unclear inputs similarly, failing to distinguish between linguistic ambiguity-where a phrase has multiple grammatical interpretations-and ambiguity of intent, where the user’s goal remains unclear. This unsystematic approach leads to misinterpretations, as the system may request clarification on irrelevant details or fail to recognize the core issue driving the user’s uncertainty. Consequently, users encounter frustrating interactions, repeatedly rephrasing requests or abandoning the conversation altogether, highlighting the need for dialogue systems capable of discerning and addressing the nuanced sources of ambiguity in natural language.

Average dialogue length varies significantly based on both the type of ambiguity present and the characteristics of the user engaging in the conversation.

Introducing ClarifyMT-Bench: A Framework for Evaluating Clarity

ClarifyMT-Bench is a newly developed benchmark designed to evaluate large language models (LLMs) in multi-turn dialogue scenarios. Its primary function is to assess an LLM’s ability to strategically determine whether to request clarification from a user or to attempt a direct answer, reflecting a crucial aspect of effective conversational AI. The benchmark moves beyond simple question-answering by requiring models to navigate ambiguity and dynamically adjust their approach based on user input, simulating a more realistic conversational exchange. Evaluation focuses on the appropriateness of the chosen action – whether a clarifying question was necessary and well-formed, or if a direct answer was sufficient and accurate – rather than solely on the final answer’s correctness.

ClarifyMT-Bench utilizes the Five-Dimensional Ambiguity Taxonomy – encompassing Referential, Lexical, Intentional, Pragmatic, and World Knowledge ambiguity – to construct a diverse range of dialogue scenarios. Each dimension represents a distinct source of uncertainty in user requests, and scenarios are generated with varying levels of ambiguity within each dimension. This granular approach allows for targeted evaluation of an LLM’s performance not just on whether it handles ambiguous input, but how it responds to specific types and severities of ambiguity, enabling a more nuanced understanding of its clarification and response strategies.

ClarifyMT-Bench employs User Persona Simulation to introduce variability in user responses during multi-turn dialogues, thereby providing a more realistic and rigorous evaluation of LLM behavior. This simulation models distinct user characteristics, specifically focusing on levels of precision and vagueness in their input. Precision is defined by the explicitness and detail provided in user requests, while vagueness represents the use of ambiguous language or incomplete information. By systematically varying these characteristics across simulated users, ClarifyMT-Bench challenges LLMs to appropriately discern ambiguous queries and dynamically decide between seeking clarification and attempting a direct response, moving beyond evaluations based on single-turn, uniformly-expressed prompts.

LLM-as-a-Judge demonstrates varying abilities to resolve ambiguity depending on the subtype of question asked.

Revealing the Under-Clarification Bias in Large Language Models

Experiments conducted using the ClarifyMT-Bench consistently reveal an ‘Under-Clarification Bias’ within Large Language Models (LLMs). This bias manifests as a demonstrable tendency for these models to provide responses to prompts even when the information provided is ambiguous and requires further clarification before a confident answer can be generated. The observed behavior indicates LLMs often prioritize completing the request over acknowledging and resolving inherent uncertainties within the input, leading to potentially inaccurate or incomplete outputs despite the availability of clarification opportunities.

The observed under-clarification bias in large language models extends beyond simple inaccuracies in responses. Analysis indicates the core issue is a limited assessment of potential ambiguities within a given prompt or question. Standard LLMs frequently proceed to answer without first identifying and actively seeking resolution of these ambiguities. This represents a failure to utilize available opportunities for clarification, which would allow the model to request further information or narrow the scope of the query before generating a response. Consequently, the model operates with incomplete understanding, leading to potentially incorrect or misleading outputs despite appearing confident in its answer.

Performance evaluations using ClarifyMT-Bench demonstrate a significant disparity in accuracy between standard Large Language Models (LLMs) and the ClarifyAgent. Standard LLMs achieved an overall accuracy of 73% when confronted with ambiguous queries. In contrast, the ClarifyAgent, designed to actively seek clarification, attained an accuracy of 88.4%. This represents an absolute improvement of 15.4 percentage points, indicating a substantial reduction in responses given without sufficient information to ensure correctness.

Human evaluations correlate strongly with those from a large language model acting as a judge when assessing the quality of clarifying questions.

ClarifyAgent: A Structured Approach to Dialogue Clarity

ClarifyAgent is an agent-based framework designed to manage multi-turn clarification dialogues by structuring the reasoning process. Building upon the ReAct framework, ClarifyAgent moves beyond simple reactive behavior to explicitly model the steps involved in understanding and resolving ambiguous user requests. This structured approach enables the agent to systematically identify information gaps, formulate clarifying questions, and integrate responses to refine its understanding. The framework decomposes the interaction into discrete reasoning steps, allowing for improved transparency and control over the dialogue flow, ultimately enhancing the agent’s ability to accurately address user needs through iterative clarification.

ClarifyAgent employs Intent Inference to determine the user’s underlying goal within a conversation, allowing the agent to proactively seek necessary information. This is coupled with a Finite-State Slot Tracker, which systematically manages ambiguous or missing information required to fulfill the user’s intent. The Slot Tracker operates by defining a set of possible states for each ambiguous slot, transitioning between these states as the conversation progresses and clarification questions are answered. This structured approach ensures that the agent accurately identifies and resolves all uncertainties before attempting to provide a final answer, maintaining a consistent and logical conversational flow.

ClarifyAgent achieves an 88.4% accuracy rate in task completion by explicitly modeling the reasoning process involved in multi-turn clarification. This structured approach demonstrably improves the quality of ask-answer decisions, resulting in a 15.4 absolute point gain over baseline Large Language Models (LLMs) in comparative evaluations. The framework’s ability to systematically manage ambiguous information and user intent contributes to this performance increase, offering a statistically significant advantage in complex task resolution.

ClarifyAgent employs a pipeline consisting of observation encoding, planning with a language model, and action execution to enable interactive task completion.

The pursuit of robust conversational AI, as demonstrated by ClarifyMT-Bench and ClarifyAgent, necessitates a careful consideration of systemic behavior. The framework’s emphasis on balancing question-asking and answering highlights the importance of understanding the complete conversational loop-a single component cannot be optimized in isolation. This echoes Edsger W. Dijkstra’s assertion that “It is not enough to have good code; you must also have good architecture.” ClarifyAgent’s agentic approach, by modeling user personas and navigating ambiguity, implicitly acknowledges that simplicity in dialogue management scales far better than attempts at overly-clever, context-dependent responses. Good architecture, in this case, is the ability to consistently clarify ambiguity-invisible until a conversational breakdown occurs.

Where Do We Go From Here?

The introduction of ClarifyMT-Bench feels less like a destination and more like a careful mapping of the swamp. The benchmark highlights, with admirable precision, that current large language models still struggle to navigate the messy reality of multi-turn dialogue – a failing not of intelligence, perhaps, but of architecture. If the system survives on duct tape and heuristics to manage ambiguity, it’s probably overengineered. The taxonomy of clarification types is a useful diagnostic, but the true challenge lies in unifying response and inquiry, a balancing act rarely observed in naturally occurring conversation.

ClarifyAgent offers a step toward that unification, framing clarification as an agentic process. However, modularity without context is an illusion of control. Simply asking better questions is insufficient; the system must demonstrate genuine understanding of the user’s evolving informational state, inferring persona not as a static label but as a dynamic negotiation. The benchmark implicitly demands a shift from ‘response generation’ to ‘conversational steering’.

Future work will inevitably focus on scaling these agentic frameworks. But the more pressing question isn’t how to make these models bigger, but how to imbue them with a more fundamental appreciation for the structural properties of communication itself. A system that merely mimics dialogue will always be brittle. The goal, ultimately, isn’t to build a conversational machine, but to model the elegance of a living conversation.

Original article: https://arxiv.org/pdf/2512.21120.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Persistent Challenge of Ambiguous Intent

Introducing ClarifyMT-Bench: A Framework for Evaluating Clarity

Revealing the Under-Clarification Bias in Large Language Models

ClarifyAgent: A Structured Approach to Dialogue Clarity

Where Do We Go From Here?

See also: