The Rise of the Machine Author

Author: Denis Avetisyan

A new analysis reveals how large language models are subtly reshaping academic writing and challenging traditional methods for identifying original research.

The study estimates the growing influence of large language models by quantifying their presence within the abstracts of papers indexed on arXiv.

This review examines the impact of large language models on scholarly publications, utilizing word frequency analysis and text similarity metrics to estimate their influence and assess the feasibility of AI-generated text detection.

Detecting artificially generated text remains a significant challenge despite advancements in natural language processing. This is explored in ‘Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers’, which analyzes shifts in word usage within arXiv preprints to quantify the influence of large language models (LLMs) on scientific writing. The research demonstrates that LLM use is demonstrably altering stylistic conventions – evidenced by changes in function word frequencies – and simultaneously complicating accurate authorship attribution. As LLMs become increasingly integrated into academic workflows, how will these evolving linguistic patterns reshape the landscape of scientific communication and assessment?

The Evolving Scholarly Landscape: LLMs and the Measurement of Impact

The proliferation of Large Language Models (LLMs) is fundamentally reshaping academic publishing, creating both opportunities and challenges for assessing scholarly impact. These models, capable of generating human-quality text, are increasingly utilized in various stages of research, from literature reviews and data analysis to manuscript drafting and even submission. This increased presence necessitates a shift in how the influence of research is measured; traditional metrics like citation counts and journal impact factors may become less reliable indicators of genuine intellectual contribution as LLM-generated content becomes more prevalent. Quantifying the specific impact of LLMs – whether as collaborative tools, sources of misinformation, or independent contributors to the scholarly record – is now critical for maintaining the integrity and trustworthiness of academic literature and ensuring that credit is appropriately assigned.

Assessing the impact of scholarly work is becoming increasingly complex as Large Language Models (LLMs) contribute to the growing volume of published content. Existing metrics, designed for a primarily human-authored landscape, struggle to accurately reflect the influence of LLM-generated text. While current detection tools demonstrate relatively high accuracy – around 80-90% – when simply identifying text written by a human versus an LLM, their effectiveness plummets to approximately 60% when tasked with differentiating between the outputs of multiple LLMs and genuine human writing. This diminished ability to pinpoint the source of content creates significant challenges for citation analysis, peer review, and ultimately, a reliable evaluation of research contributions, necessitating the development of novel analytical approaches that move beyond simple binary classification.

Word frequency analysis reveals that large language model processing generally preserves the distribution of terms found in original research abstracts, as indicated by overlapping standard deviations across multiple LLMs and prompts.

Dissecting the Machine’s Voice: Methods for LLM Text Identification

Detection of machine-authored content is increasingly important due to the rapid increase in text generated by large language models (LLMs). Current methods for identifying LLM-generated text utilize transformer-based models, specifically BERT, GPT-2, and T5. BERT (Bidirectional Encoder Representations from Transformers) is employed for its contextual understanding of text, while GPT-2 and T5 (Text-To-Text Transfer Transformer) are utilized both as generative models to create synthetic text for training detectors and as discriminative models for identifying machine-generated content. These models are typically fine-tuned on datasets comprising both human-written and LLM-generated text to maximize detection accuracy, enabling differentiation based on stylistic and linguistic patterns inherent in each source.

Quantitative analysis of text similarity between human and machine sources utilizes metrics such as ROUGE and BERTScore. ROUGE, which assesses overlap of n-grams, demonstrates an increasing correlation with human-written text as LLMs advance; newer models consistently achieve higher ROUGE scores. However, BERTScore, focusing on contextual embeddings and semantic similarity, does not always reflect this trend. This divergence suggests that while newer LLMs may more closely mimic the surface-level structure and vocabulary of human writing – as indicated by ROUGE – the underlying semantic content and contextual relevance may not be improving at the same rate, or are changing in character.

Classification results reveal that the detectors, trained and tested on a dataset of 2,000 mixed abstracts per class (including those generated by models like GPT-3.5, GPT-4o mini, GPT-5-nano, DeepSeek, Gemini, and Claude), effectively distinguish between different large language models.

Tracing the Signal: Quantifying LLM Influence Through Word Frequency Analysis

Word Frequency Analysis was applied to the arXiv dataset to quantify changes in term distribution following the increased submission of Large Language Model (LLM)-generated preprints. This method involves calculating the frequency of individual words within a defined corpus – in this case, the abstracts of papers indexed by arXiv – and comparing these frequencies across different time periods. Specifically, we segmented the dataset to represent pre- and post-LLM adoption phases. The difference in word frequencies between these phases served as a primary indicator of LLM influence. This approach allows for objective measurement of lexical shifts, revealing how the introduction of LLM-authored content alters the overall linguistic characteristics of scientific communication within the arXiv repository.

The Sequential Least Squares Programming (SLSQP) algorithm was selected for estimating the magnitude of changes in word frequency due to LLM-generated content due to its capacity to handle constrained optimization problems. SLSQP is a derivative-free optimization method, suitable for scenarios where analytical gradients are unavailable or computationally expensive to obtain. Within this analysis, SLSQP was utilized to minimize a cost function representing the difference in word frequency distributions between pre-LLM and post-LLM datasets, subject to constraints ensuring statistical validity. The algorithm iteratively refines parameter estimates-specifically, scaling factors applied to LLM-generated text-until convergence is achieved, providing a quantifiable measure of the influence of LLMs on the observed changes in word usage. This approach yields statistically robust estimates of magnitude, accounting for potential confounding factors and minimizing the risk of spurious correlations.

Word frequency analysis of the arXiv dataset revealed statistically significant shifts in terminology coinciding with the increased submission of preprints likely generated by Large Language Models (LLMs). The Coefficient of Variation was specifically utilized to quantify these changes, enabling the identification of terms exhibiting disproportionate increases or decreases in frequency. This analysis demonstrated that different LLMs favor distinct vocabulary; some models consistently utilize specific terms at a higher rate than others, while conversely, they avoid certain terms more frequently. These variations suggest stylistic preferences embedded within the training data or algorithmic biases of each LLM, contributing to detectable differences in the generated text beyond simple content replication.

Comparison of word frequency distributions generated by large language models with those observed in a historical dataset of arXiv abstracts reveals that LLM-generated text closely mirrors temporal trends in scientific writing, as shown by the alignment of LLM outputs (left) with real data trends extending into the near future (right, yellow dashed line).

Deconstructing the Digital Scribe: The Impact of Diverse LLM Architectures

The study encompassed a diverse array of large language models – including iterations of Claude, DeepSeek, Gemini, and the GPT series, specifically GPT-3.5, GPT-4o Mini, and GPT-5 Nano – to meticulously evaluate each model’s unique impact on generated text. This broad scope wasn’t merely about quantity; it aimed to dissect the specific contributions of differing architectures and training methodologies. By analyzing the output of each model individually, researchers sought to understand how variations in design translate into observable differences in textual characteristics, ultimately providing a granular view of the evolving LLM landscape and the stylistic nuances each model introduces.

A comparative analysis of several large language models – including Claude, DeepSeek, Gemini, and various GPT iterations – demonstrates considerable divergence in their effects on textual characteristics. The study reveals that each model introduces unique shifts in word frequency, altering the statistical profile of generated text; some terms experience declines in usage while others, notably prepositions like ‘via’, see increased prevalence, suggesting changes in stylistic preferences. Furthermore, assessments of textual similarity indicate that while all models contribute to homogenization in certain aspects of writing, they also retain distinct ‘fingerprints’ detectable through nuanced statistical analysis, implying that the architectural choices and training data of each model shape its output in measurable ways. These variations highlight the importance of considering individual model characteristics when evaluating the broader impact of LLMs on written communication and academic prose.

Analysis of recent large language models reveals that each architecture leaves a unique imprint on academic writing styles. Researchers detected measurable shifts in word frequencies, demonstrating that the influence of these models isn’t uniform; certain terms experienced declines in usage, while others, notably prepositions like ‘via’, saw a marked increase in prevalence. This suggests that differing algorithmic approaches within each LLM encourage subtly distinct stylistic choices, moving beyond simple content generation to actively shaping the way information is conveyed. Consequently, a comprehensive understanding of these nuanced capabilities is crucial for accurately assessing the impact of LLMs on scholarly communication and ensuring responsible integration into academic workflows.

Analysis of 2,000 abstracts reveals that different large language models yield varying frequency change ratios for the most common words when prompted concisely.

The study meticulously details how Large Language Models subtly reshape academic discourse, altering word frequencies and stylistic norms. This echoes Tim Bern-Lee’s sentiment: “The Web is more a social creation than a technical one.” The research demonstrates that the ‘social creation’ of academic writing is now being co-authored by algorithms, impacting the very structure of knowledge dissemination. Just as the Web’s architecture dictates its behavior, the underlying structure of LLMs-their training data and algorithms-dictates the emergent properties of academic text, creating feedback loops where algorithmic influence reinforces itself. The paper’s findings suggest a need to understand these emergent properties to maintain the integrity of scholarly communication.

What’s Next?

The pursuit of quantifying influence-in this case, the encroachment of Large Language Models upon academic writing-reveals a fundamental truth: the metric itself reshapes the phenomenon. Focusing solely on stylistic shifts, on the frequency of certain phrases, feels akin to charting the symptoms while ignoring the underlying disease. The real question isn’t whether LLMs can mimic academic prose, but what incentives are created by their increasing prevalence. The ease of generation lowers the cost of publication, but at what price to originality and critical thought? This work establishes a baseline, but the landscape will shift faster than any analysis can keep pace.

Future research must move beyond detection-a game of cat and mouse destined to escalate in complexity. Instead, consideration should be given to the broader impact on knowledge dissemination. How do these models affect peer review? What new forms of academic fraud might emerge? The architecture of scientific communication is subtly, yet profoundly, altered. Good architecture is invisible until it breaks, and the cracks are beginning to show.

Ultimately, the challenge lies in recognizing that LLMs are not merely tools, but participants in the scientific process. Dependencies are the true cost of freedom, and as we increasingly rely on these models, we must carefully consider the trade-offs. Simplicity scales, cleverness does not, and a return to foundational principles-rigorous methodology, transparent reporting, and a commitment to intellectual honesty-will be crucial to navigate this evolving landscape.

Original article: https://arxiv.org/pdf/2603.25638.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Evolving Scholarly Landscape: LLMs and the Measurement of Impact

Dissecting the Machine’s Voice: Methods for LLM Text Identification

Tracing the Signal: Quantifying LLM Influence Through Word Frequency Analysis

Deconstructing the Digital Scribe: The Impact of Diverse LLM Architectures

What’s Next?

See also: