Can AI Grade Math Thinking?

Author: Denis Avetisyan

A new study assesses the ability of artificial intelligence tools to accurately categorize the cognitive complexity of mathematical problems.

Research reveals current AI models achieve around 62% accuracy in classifying cognitive demand, but struggle with extreme task difficulty and exhibit noticeable biases.

Despite increasing demands on teachers to personalize mathematics instruction while upholding cognitive rigor, the potential for artificial intelligence to assist in task analysis remains largely unexplored. This study, ‘Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks’, evaluated the ability of eleven AI tools-both general-purpose and education-specific-to categorize mathematical tasks according to a research-based framework of cognitive demand. Results revealed an average accuracy of only 63%, with tools exhibiting systematic biases toward mid-level cognitive tasks and prioritizing surface features over underlying cognitive processes. Given these limitations, can AI tools be effectively integrated into teacher workflows as decision support rather than autonomous task analyzers, and what advancements in prompt engineering or tool development are needed to realize this potential?

Understanding Cognitive Demand in Mathematics

Effective educational practices and truly personalized learning hinge on a precise understanding of how mentally challenging a given mathematical task truly is. Simply presenting problems isn’t enough; educators must discern whether an activity primarily requires rote memorization, the application of established procedures, or genuine mathematical reasoning. A task’s cognitive demand dictates the types of thinking skills students are employing – and subsequently, what kind of support or challenge is most appropriate. Misaligned cognitive demand – assigning tasks too easy or too difficult – can stifle engagement, hinder conceptual understanding, and ultimately impede a student’s progress. Therefore, a robust framework for categorizing this demand is essential to ensure that instruction is tailored to meet individual needs and foster meaningful mathematical growth, moving beyond superficial skill acquisition to deep, connected understanding.

The ‘Task Analysis Guide’ offers educators a structured lens through which to evaluate the intellectual requirements of different mathematical activities, categorizing them into four distinct levels of cognitive demand. At its base lies Memorization, encompassing simple recall of facts and definitions. Ascending from this is Procedures Without Connections, where students execute algorithms without necessarily understanding the underlying principles. A further step involves Procedures With Connections, indicating an ability to apply procedures meaningfully and link them to related concepts. Finally, the highest level, Doing Mathematics, signifies a capacity for independent problem-solving, formulating strategies, generalizing from specific cases, and justifying solutions – representing a deep and flexible understanding of mathematical reasoning.

The categorization of mathematical tasks isn’t arbitrary; rather, it reflects a deliberate progression of cognitive skills. At the base lies memorization, demanding simple recall of facts, while the next level, procedures without connections, requires applying rules without necessarily understanding the underlying principles. Crucially, tasks categorized as procedures with connections build upon this foundation by explicitly linking steps to conceptual understanding, fostering a deeper grasp of the ‘why’ behind the ‘how’. The highest level, doing mathematics, demands full conceptual understanding and the ability to formulate, represent, and solve problems independently – a culmination of all prior skills. This hierarchical structure ensures that learning builds progressively, guiding students from basic recall toward sophisticated mathematical reasoning and problem-solving abilities.

AI and the Initial Assessment of Task Classification

The assessment included a diverse range of Artificial Intelligence (AI) tools to evaluate their task classification capabilities. This encompassed both broadly applicable, large language models such as ChatGPT, Claude, and DeepSeek, and those specifically designed for educational applications, including Brisk, Khanmigo, and School.AI. This deliberate inclusion of both general and specialized AI tools aimed to provide a comprehensive initial benchmark, highlighting the performance differences and potential strengths of each approach when applied to cognitive assessment tasks. The variety in model architecture and training data represented by these tools allowed for a broader understanding of current AI capabilities in this domain.

The evaluation process involved submitting a series of mathematical tasks to multiple AI tools and requesting classification based on the criteria defined within the ‘Task Analysis Guide’. This guide provides a standardized framework for assessing the cognitive demands of each task, encompassing factors such as required procedures, conceptual understanding, and problem-solving strategies. By utilizing this guide as a benchmark, researchers aimed to quantify the AI tools’ ability to discern the complexity and cognitive requirements inherent in different mathematical problems, establishing a baseline for comparison against human expert performance in cognitive assessment.

Initial assessment of AI tools for mathematical task classification yielded an overall Cognitive Demand Classification Accuracy of 62%. This score demonstrates performance exceeding random chance, but remains below the accuracy levels consistently achieved by human experts. Notably, the observed accuracy was not uniform across all tasks; individual task classifications ranged from a low of 9% to a high of 100%, indicating substantial performance variability dependent on the specific cognitive demands of the problem. These findings suggest that while AI tools exhibit some capacity for cognitive assessment, further refinement is necessary to achieve reliable, expert-level classification across a diverse range of mathematical tasks.

Uncovering Systemic Biases in AI Cognitive Assessment

Analysis of AI-driven cognitive assessment tools revealed a consistent bias towards classifying tasks into the ‘Procedures With Connections’ and ‘Procedures Without Connections’ categories. This ‘Middle Category Bias’ indicates a disproportionate assignment of tasks to these procedural classifications, regardless of the cognitive demand actually required. Observed data suggests the AI consistently favored identifying tasks as relating to established procedures, potentially overlooking assessments demanding conceptual understanding or requiring different cognitive skills. This bias was consistent across multiple datasets and assessment types, suggesting it is not an anomaly related to specific task construction, but rather a systemic characteristic of the AI’s classification methodology.

Analysis of AI-driven cognitive assessment tools indicates a systematic prioritization of procedural fluency in task classification. These tools consistently favored categorizing tasks as either ‘Procedures With Connections’ or ‘Procedures Without Connections’, suggesting an algorithmic bias towards assessing how a task is performed rather than evaluating underlying conceptual knowledge or the ability to recall information. This preference is evidenced by the significantly lower accuracy rates observed in classifying ‘Memorization Tasks’ (44% accuracy) and ‘Doing Mathematics Tasks’ (27% accuracy), both of which rely heavily on recall and conceptual understanding, respectively, rather than solely on procedural execution.

Analysis of AI-driven cognitive assessment tools revealed a significant reliance on surface-level textual cues during task classification, rather than a comprehensive evaluation of underlying cognitive processes. This manifested as reduced accuracy in classifying tasks demanding deeper cognitive engagement; specifically, ‘Memorization Tasks’ were correctly classified only 44% of the time, while ‘Doing Mathematics Tasks’ yielded the lowest accuracy rate at 27%. This suggests the AI algorithms prioritize easily identifiable features within task descriptions, potentially overlooking the complex cognitive skills required for successful completion of these tasks.

The Implications for Educational AI: A Call for Nuance

Current artificial intelligence systems, despite demonstrable capabilities in many areas, exhibit biases that limit their usefulness in independently gauging the cognitive difficulty of mathematical problems for educational contexts. Recent studies reveal these tools often prioritize surface-level features – such as the length of a problem or the presence of specific keywords – rather than the underlying cognitive processes required for a solution. This reliance on superficial cues leads to inaccurate assessments of task complexity, potentially mislabeling simple problems as challenging, or conversely, failing to recognize the genuine difficulty of more nuanced exercises. Consequently, educators should not yet rely solely on AI-driven assessments to inform instructional decisions; human oversight remains crucial to ensure that students are presented with appropriately challenging material that genuinely fosters learning and avoids both frustration and boredom.

Current artificial intelligence systems often struggle with accurately gauging the cognitive complexity of mathematical problems, frequently relying on easily identifiable features rather than a deep understanding of the underlying cognitive demands. Research indicates that simply recognizing keywords or surface-level problem structures is insufficient; true assessment requires discerning the subtle interplay of concepts, the types of reasoning needed, and the potential for common misconceptions. Ongoing investigation must therefore prioritize developing AI algorithms capable of more nuanced analysis, moving beyond superficial pattern recognition towards a genuine comprehension of the cognitive processes involved in problem-solving. This necessitates exploring techniques that incorporate cognitive science principles, allowing AI to evaluate not just what a problem asks, but how a student might approach it and the cognitive resources that approach will require.

Advancing educational AI requires a fundamental shift in development, moving beyond pattern recognition to embrace the core principles that underpin how humans learn and teach. Current algorithms often lack the ability to discern why a mathematical task is challenging, instead focusing on superficial characteristics. Consequently, future iterations must integrate established cognitive science – understanding memory, attention, and problem-solving – with pedagogical content knowledge, which represents a deep understanding of how to effectively convey mathematical concepts. By embedding these principles directly into AI algorithms, developers can create tools that don’t just identify task difficulty, but actively support personalized learning pathways, provide targeted feedback, and ultimately, foster more meaningful and effective educational experiences. This approach promises AI not as a replacement for educators, but as a powerful ally in cultivating genuine understanding and mathematical proficiency.

The pursuit of automated task analysis, as explored in the study of AI’s capacity to classify cognitive demand, reveals a fundamental tension. The observed 62% accuracy, while promising, underscores the limitations of current systems. It is not sufficient to simply add computational power; true progress demands rigorous refinement. As Barbara Liskov observed, “Programs must be correct, and they must be understandable.” The struggle with extreme categories and inherent biases in the AI’s classifications highlights the need for transparency and interpretability, lest the tools become opaque black boxes. The potential for these systems lies not in autonomous operation, but as supports for human judgment-a reduction of complexity, revealing clarity for educators.

The Road Ahead

The observed performance – a shade over chance when discerning the extremes of cognitive load – exposes a fundamental limitation. The tools do not fail randomly; they exhibit a discernible preference, a bias woven into the algorithmic tapestry. This is not an error of calculation, but an artifact of construction. The aspiration for autonomous task analysis, therefore, remains distant, a seductive mirage. To pursue it directly is to mistake correlation for comprehension.

Future work must shift from seeking perfect classification to understanding the nature of the misclassification. Where do these tools consistently falter, and what does that reveal about the underlying structure of cognitive demand itself? Perhaps the Task Analysis Guide, with its inherent human assumptions, is not a neutral standard against which to measure artificial intelligence, but a reflection of our own cognitive biases.

The true utility lies not in replacing the teacher, but in augmenting their expertise. These tools, stripped of pretension, can serve as a preliminary sieve, flagging tasks worthy of deeper consideration. Simplicity, after all, is not a limitation, but a virtue. The goal is not to replicate thought, but to illuminate it.

Original article: https://arxiv.org/pdf/2603.03512.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

Understanding Cognitive Demand in Mathematics

AI and the Initial Assessment of Task Classification

Uncovering Systemic Biases in AI Cognitive Assessment

The Implications for Educational AI: A Call for Nuance

The Road Ahead

See also: