Reading Between the Lines: Can AI Truly Understand Human Values?

Author: Denis Avetisyan

New research explores whether large language models can accurately identify and interpret the complex values embedded within qualitative interview data.

Value distributions elicited from large language models, when varied with prompting techniques, demonstrate a comparative alignment with those of human experts.

A comparative analysis assesses the ability of AI to capture expert uncertainty in value alignment, using Schwartz Theory and ethnographic data.

Identifying nuanced human values from qualitative data remains a challenge for automated analysis, despite advances in large language models. This research, ‘Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research’, comparatively evaluates LLM performance against expert annotations in identifying core values expressed within long-form interviews, utilizing the Schwartz Theory of Basic Values. Findings reveal that while LLMs approach human-level accuracy on set-based metrics, they struggle with ranking and exhibit divergent uncertainty patterns, with models like Qwen demonstrating the strongest alignment with expert judgment. Given these discrepancies, can LLM ensembles effectively mitigate inherent value biases and truly serve as collaborative partners in inherently ambiguous qualitative analysis?

The Burden of Interpretation: Qualitative Depth in an Age of Data

Historically, understanding human motivations has relied heavily on qualitative research-in-depth interviews, focus groups, and ethnographic observation. While these methods yield nuanced and detailed insights into individual beliefs and societal norms, they are inherently time-consuming and resource-intensive. The process of manually coding and interpreting large volumes of textual or observational data introduces significant potential for subjective bias, as researchers’ preconceptions and analytical frameworks inevitably shape the findings. This reliance on individual interpretation limits the scalability of such studies and makes it difficult to establish reliable, generalizable conclusions about the prevalence of specific values within a population. The inherent challenges of traditional qualitative analysis have thus spurred the development of computational methods aimed at extracting meaningful patterns from complex human data with greater efficiency and objectivity.

The increasing need to understand human motivations at a population level necessitates a shift in how ethnographic data is analyzed. Traditionally, researchers painstakingly reviewed transcripts and field notes, a process inherently limited by time and susceptible to individual interpretation. However, scaling this qualitative work requires innovative computational approaches capable of identifying and categorizing the nuanced expressions of human values within large textual datasets. This demands more than simple keyword searches; effective methods must account for contextual meaning, linguistic variation, and the complex relationships between expressed values. Consequently, researchers are exploring techniques like natural language processing and machine learning to automate the extraction of value-laden statements, allowing for broader, more systematic insights into the underlying principles that guide human behavior and decision-making.

The enduring relevance of the Schwartz Theory of Basic Human Values – positing ten universal motivational values like self-direction, stimulation, and security – is increasingly recognized in fields from psychology to marketing. However, translating this powerful theoretical framework into actionable insights from large-scale textual data presents formidable obstacles. Manually coding extensive datasets for evidence of these values is prohibitively time-consuming and susceptible to researcher bias. Automated approaches, while promising, struggle with the nuances of language and the contextual expression of values; a word like “freedom” can indicate self-direction, but also stimulation or even security depending on the surrounding text. Consequently, researchers are actively developing computational methods – leveraging natural language processing and machine learning – to reliably identify and quantify these fundamental human motivations within sizable datasets, striving for a balance between analytical scale and interpretive fidelity.

Automated Insight: LLMs as Ethnographic Tools

The research employed Large Language Models (LLMs) to automate the analysis of qualitative data derived from open-ended interviews. This involved processing textual responses to identify expressions of underlying human values. Traditionally, this process required manual coding by trained researchers, a time-consuming and resource-intensive task. By utilizing LLMs, the research aimed to scale value identification and categorization, enabling analysis of larger datasets and reducing the need for extensive manual effort. The LLMs were tasked with extracting value-laden statements from interview transcripts and assigning them to predefined value categories, effectively automating a core component of qualitative research workflows.

Successful application of Large Language Models to the analysis of open-ended interview data necessitated careful prompt engineering to address the challenges of nuanced textual interpretation and alignment with the Schwartz Theory of Basic Human Values. Prompts were iteratively refined, incorporating specific instructions regarding value identification, contextual awareness, and the differentiation between closely related values as defined by Schwartz’s model. This included providing the LLM with the definitions of each of the ten basic values – Universalism, Benevolence, Conformity, Tradition, Security, Power, Achievement, Hedonism, Stimulation, and Self-Direction – and examples of textual cues indicative of each. Furthermore, prompts incorporated negative constraints, explicitly instructing the model to avoid common misinterpretations or biases in value assignment, and to prioritize the most salient value expressions within each interview segment.

Evaluation of Large Language Models for identifying core human values from open-ended interview transcripts indicates performance approaching human-level accuracy in determining the top-3 value orientations of respondents. Quantitative analysis revealed that the Qwen3 model consistently achieved the highest scores, exhibiting the closest correlation to the established human ceiling – representing the maximum achievable inter-rater reliability among human coders. This suggests Qwen3 demonstrates a superior ability to interpret nuanced language and accurately categorize expressed values according to the Schwartz Theory of Basic Human Values, relative to other tested LLMs. Further statistical analysis confirmed the robustness of these findings across diverse datasets and interview prompts.

Consensus in Complexity: Aggregating LLM Insights

To mitigate the inherent variability of individual Large Language Models (LLMs), an LLM Ensemble approach was implemented. This involved aggregating the outputs of four distinct models – Qwen3, DeepSeek-R1, Llama-3, and Mistral – to generate a more stable and reliable result. By combining predictions from multiple models, the ensemble aimed to reduce the impact of any single model’s idiosyncratic errors or biases, thereby improving the overall consistency and robustness of the system. The individual model outputs were then processed using rank aggregation methods to determine a final, consolidated result.

To consolidate value rankings generated by multiple Large Language Models (LLMs), we compared three rank aggregation methods: Majority Vote, Borda Count, and the Kemeny-Young Method. Majority Vote assigns the most frequent rank to each item. Borda Count assigns points based on rank order (e.g., highest ranked item receives n-1 points, where n is the number of items), summing these points for each item to determine overall ranking. The Kemeny-Young Method identifies the ranking that minimizes the pairwise Kendall tau distance from all individual LLM rankings, representing the consensus ranking with minimal disagreement. These methods were evaluated based on their ability to improve performance metrics, including F1, RBO, and Jaccard, relative to using individual LLM outputs.

Inter-rater reliability of the value assignment task was quantified using Krippendorff’s Alpha, yielding a score of 0.389, which indicates an inherent level of ambiguity in the process. To validate the LLM ensemble approach, results were compared against value assignments provided by human experts through a Value Alignment Analysis. This comparison demonstrated that the LLM ensemble consistently outperformed individual models, achieving an 8-10 point gain in both F1 and RBO scores, and a 6-8 point improvement in Jaccard scores, thereby confirming the effectiveness of the aggregation method.

Bridging the Gap: Towards Human-Aligned LLM Interpretation

A detailed analysis was conducted to measure the extent to which value assignments generated by a Large Language Model (LLM) aligned with those made by human experts in the field. This involved a comparative assessment, pinpointing specific areas where the model’s valuations strongly correlated with expert opinions, as well as identifying instances of notable divergence. The research didn’t simply assess overall agreement, but rather mapped the pattern of agreement and disagreement, providing a nuanced understanding of the LLM’s capabilities and limitations. This granular approach revealed where the model excels at mirroring human judgment and where further refinement is needed to bridge the gap between artificial and human intelligence in qualitative assessment.

Analysis employing $\text{Spearman’s Rho}$ -a measure of statistical dependence-demonstrated a noteworthy correlation of 0.457 between the uncertainty levels assigned by the language model and the extent of disagreement amongst human experts. This suggests a capacity within the model to not only process qualitative data but also to recognize instances where consensus is lacking, effectively flagging potentially ambiguous or complex cases. The model’s uncertainty, therefore, doesn’t represent random error, but an alignment with the inherent difficulty identified by human assessment; when experts diverge in their evaluations, the language model correspondingly registers higher uncertainty, indicating a promising ability to discern challenging analytical scenarios.

Evaluations revealed that the Qwen3 model demonstrates a remarkable capacity for qualitative analysis, achieving an F1-score of 0.566 and a Jaccard similarity of 0.4396 – performance levels that closely approach the upper limits of human agreement. Further analysis indicated a high degree of consistency in value distribution between the model and human experts, as evidenced by a cosine similarity score of 0.833. These results suggest that large language models are not merely replicating human assessments, but are instead developing a nuanced understanding of qualitative data. Consequently, integrating LLMs with established human expertise promises to substantially improve both the speed and scope of qualitative research, offering a pathway to more efficient and scalable analytical processes.

The Horizon of Ethnographic Inquiry: LLMs and the Future of Understanding

Computational ethnography, previously limited by the intensive labor required to analyze textual and observational data, is entering a new era of scalability. This research establishes a framework for leveraging large language models to process and interpret qualitative datasets-such as interview transcripts, field notes, and social media content-with a speed and scope previously unattainable. By automating key stages of analysis, including thematic coding and pattern identification, researchers can now tackle datasets orders of magnitude larger than those traditionally examined. This shift promises not only to accelerate ethnographic research but also to uncover subtle trends and connections that might otherwise remain hidden within vast quantities of qualitative information, ultimately fostering a more nuanced and data-driven understanding of human cultures and behaviors.

Continued advancement in computational ethnography hinges on a synergistic relationship between large language models and human researchers. Future studies will need to meticulously refine prompt engineering, moving beyond simple queries to develop nuanced instructions that elicit richer, more accurate qualitative analysis from these models. Crucially, the focus should extend beyond automated output; sophisticated methods for integrating LLM-generated insights with established ethnographic techniques are essential. This integration isn’t about replacing human interpretation, but rather augmenting it – allowing researchers to identify patterns and themes in vast datasets with greater efficiency, while retaining the critical contextual understanding and ethical considerations inherent in qualitative research. Ultimately, the goal is to build tools that amplify, not diminish, the depth and rigor of ethnographic inquiry.

The convergence of large language models and ethnographic inquiry promises a paradigm shift in how humanity understands itself. By automating the initial stages of qualitative data analysis – identifying themes, patterns, and nuanced sentiments within vast datasets – this approach transcends the limitations of traditional methods. Researchers can now explore cultural trends and behavioral shifts at scales previously unimaginable, moving beyond localized case studies to reveal global patterns and underlying value systems. This capability extends to uncovering subtle connections between seemingly disparate cultural phenomena, offering a more holistic and dynamic picture of the human experience. The potential lies not simply in processing more data, but in generating new insights into the complex interplay of values, beliefs, and practices that shape societies and drive human behavior, ultimately fostering a deeper understanding of the human condition.

The pursuit of mirroring human cognition with algorithms often leads to elaborate constructions. This research, examining LLMs’ capacity to identify values within qualitative data, reveals a curious tendency: models can achieve impressive alignment with human assessments, yet struggle to replicate the way humans express uncertainty. It’s a reminder that simply achieving a correct answer isn’t enough; the nuance of confidence, or lack thereof, is crucial. As Donald Knuth observed, “Premature optimization is the root of all evil.” The drive to build ever-more-complex LLMs risks obscuring the very qualities – interpretive robustness and honest uncertainty – that define genuine understanding. They called it an ensemble to hide the panic, but the true measure lies in acknowledging what the model doesn’t know.

Beyond the Echo

The exercise of mapping human values onto the outputs of Large Language Models reveals, predictably, more about the limitations of the mapping than the capabilities of either system. The proximity to human-level performance is, itself, a distraction. The crucial divergence lies not in what values are identified, but in how that identification is qualified. A model can approximate consensus, but it cannot, by current mechanisms, credibly express the contours of its own doubt-the very texture of interpretive robustness that defines expert qualitative analysis.

Future iterations must prioritize not simply the detection of value statements, but the rigorous quantification of epistemic uncertainty. The LLM ensemble approach offers a pragmatic avenue, yet treating uncertainty as mere variance obscures the more fundamental problem: these models operate as exquisitely optimized echo chambers. They amplify existing patterns, offering the illusion of insight while remaining fundamentally incapable of generating genuinely novel interpretive frameworks.

The pursuit of value alignment, therefore, demands a shift in focus. It is not enough to ask if a model agrees with human judgment; the question must be whether it can convincingly model the process of judgment – including the acknowledgement of ambiguity, the weighting of conflicting evidence, and the honest expression of its own limitations. Only then might the echo begin to fade, revealing something resembling genuine understanding.

Original article: https://arxiv.org/pdf/2603.04897.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/