The Shifting Meanings of AI-Generated Words

Author: Denis Avetisyan


New research reveals that as language models learn, the relationship between how often a word is used and how many meanings it acquires isn’t straightforward, defying a long-held linguistic principle.

The study reveals that semantic differentiation initially correlates with frequency-following Martin’s Law up to approximately $10^{4}$ training steps-but ultimately collapses in smaller models, while larger models maintain a stable frequency-specificity trade-off and exhibit a diverging polysemous word count, indicating a scale-dependent catastrophic loss of semantic richness.
The study reveals that semantic differentiation initially correlates with frequency-following Martin’s Law up to approximately $10^{4}$ training steps-but ultimately collapses in smaller models, while larger models maintain a stable frequency-specificity trade-off and exhibit a diverging polysemous word count, indicating a scale-dependent catastrophic loss of semantic richness.

This study demonstrates a non-monotonic relationship between word frequency and polysemy in text generated by large language models, challenging Martin’s Law and revealing complex dynamics in emergent semantic organization.

While large language models increasingly demonstrate linguistic competence, the developmental trajectory of semantic organization remains poorly understood. This is explored in ‘Emergent Lexical Semantics in Neural Language Models: Testing Martin’s Law on LLM-Generated Text’, which investigates how Martin’s Law-the relationship between word frequency and polysemy-emerges during LLM training. Our findings reveal a non-monotonic pattern, with semantic coherence peaking at intermediate stages before degrading, suggesting an optimal “semantic window” for LLMs. Does this indicate a fundamental constraint on the capacity of neural networks to continuously refine lexical semantics, or can training regimes be optimized to sustain robust semantic structure?


Decoding Meaning: Martin’s Law and the Language Model

Linguistics has long observed a predictable relationship between how often a word is used and the breadth of its meanings – a phenomenon known as Martin’s Law. This principle suggests that frequently used words tend to accumulate more diverse and abstract meanings over time, while less common words retain more specific and concrete definitions. Essentially, the more a word is employed in varied contexts, the more senses it develops; for example, the high-frequency word “run” encompasses actions from physical exertion to operating a business, while a rarer word like “pulchritude” remains largely confined to its meaning of beauty. This isn’t merely an observation of language use, but a reflection of cognitive processes; frequent exposure encourages semantic generalization, effectively broadening a word’s conceptual network and demonstrating how usage shapes meaning itself.

The connection between a word’s prevalence in language and its breadth of meaning – as described by Martin’s Law – offers a critical lens through which to assess the semantic capabilities of Large Language Models (LLMs). These models don’t simply memorize word associations; they construct internal representations of meaning, and the quality of those representations dictates their ability to perform complex tasks. Evaluating whether LLMs demonstrate a similar relationship – where frequently used words exhibit richer, more nuanced semantic profiles – is therefore paramount. A model that fails to reflect this principle may appear proficient in surface-level language processing but lack genuine understanding, potentially leading to errors in reasoning, translation, or creative text generation. Consequently, investigating how LLMs encode and utilize semantic richness, guided by principles like Martin’s Law, is essential for advancing the field of artificial intelligence and building truly intelligent machines.

Despite the impressive capabilities demonstrated by increasingly large language models (LLMs), a straightforward increase in scale does not automatically translate to a comparable improvement in genuine semantic understanding. Research indicates that while larger models can often achieve higher scores on benchmark tasks through memorization and statistical correlations, they may still struggle with nuanced reasoning, disambiguation, and the flexible application of knowledge – abilities linked to true semantic representation. This suggests that simply adding more parameters or training data encounters diminishing returns; a proportional increase in scale isn’t sufficient to unlock a proportional increase in the model’s capacity to grasp the richness and complexity of meaning, highlighting the need for architectural innovations and training strategies that specifically target semantic competence rather than solely focusing on size.

The Semantic Tightrope: A Paradox of LLM Training

Analysis of Pythia models trained on the Pile dataset demonstrates that compliance with Martin’s Law-the principle that semantic similarity should decrease with syntactic distance-does not improve monotonically with training. Specifically, Martin’s Law performance peaks at approximately $10^4$ training steps. Beyond this point, a degradation in compliance is observed, indicating that continued training does not necessarily lead to consistently better semantic organization. This suggests a complex relationship between training duration and the model’s ability to maintain accurate semantic relationships, rather than simple linear improvement.

Analysis of Pythia models indicates that polysemy, the capacity for words to have multiple meanings, does not consistently increase with training duration. Instead, polysemy reaches its highest level at intermediate training checkpoints, specifically around $10^4$ training steps, before declining. This suggests that prolonged training, while improving next-token prediction, may lead to a reduction in semantic diversity as the model converges on more dominant interpretations of words, potentially sacrificing nuanced or less frequent meanings. This non-monotonic behavior indicates that semantic richness is not a guaranteed byproduct of scale in large language models.

Analysis indicates a correlation between Large Language Model (LLM) capacity and a phenomenon termed ‘catastrophic semantic collapse’. This collapse, characterized by a reduction in semantic diversity during training, becomes increasingly prominent in models with parameter counts exceeding approximately 200 million. The observed degradation appears linked to the optimization process for next-token prediction; as models increase in capacity, they prioritize predictive accuracy, potentially at the expense of retaining a broad range of semantic representations. This suggests that beyond a certain scale, the standard training objective may inadvertently encourage specialization and a narrowing of the model’s semantic field, rather than continued generalization.

To track changes in semantic representation during language model training, we analyzed a series of training checkpoints saved at regular intervals throughout the training process. These checkpoints, captured for the Pythia models trained on the Pile dataset, allowed us to evaluate the model’s semantic understanding – specifically polysemy and compliance with Martin’s Law – at discrete stages. By comparing the outputs and internal representations of these checkpoints, we were able to map the evolution of semantic information and identify the point at which degradation began, revealing a non-monotonic trajectory. This methodology enabled a granular examination of how semantic diversity changes as the model optimizes for next-token prediction, providing empirical data to support observations of semantic collapse.

Mapping the Semantic Landscape: From Embeddings to Polysemy

Contextualized word embeddings, utilized in this analysis, are vector representations generated by the Pythia models, specifically extracted from the final layer hidden states. Unlike static word embeddings which assign a single vector to each word, these embeddings are dynamic and dependent on the surrounding textual context. This means the same word will have different vector representations depending on its usage within a sentence or document, capturing nuanced meaning and resolving ambiguity. The Pythia models process input text and produce a hidden state vector for each token, and these vectors serve as the basis for quantifying semantic relationships and, ultimately, estimating polysemy.

DBSCAN, or Density-Based Spatial Clustering of Applications with Noise, is employed to quantify polysemy from the contextualized word embeddings. This algorithm groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions. In this application, each word embedding is treated as a point in a high-dimensional space, and DBSCAN identifies clusters representing distinct meanings of a word. The number of identified clusters directly correlates with the estimated degree of polysemy; a higher cluster count indicates a greater number of distinct meanings for that word. Parameters for DBSCAN, specifically the radius ($\epsilon$) and minimum points ($MinPts$), are set empirically to appropriately capture semantic groupings within the embedding space.

Spearman correlation was utilized to quantify the association between word frequency and two semantic properties: polysemy and Semantic Specificity. Analysis consistently revealed a Frequency-Specificity Tradeoff, demonstrated by a stable Spearman correlation coefficient of approximately -0.4 observed across all Pythia model sizes and training checkpoints evaluated after $10^3$ training steps. This negative correlation indicates that as word frequency increases, Semantic Specificity tends to decrease, and vice versa, suggesting a relationship between common usage and semantic breadth.

Analysis of model checkpoints reveals a peak correlation between word frequency and polysemy at the $10^4$ step, with a Spearman correlation coefficient exceeding 0.6. This indicates a period of maximum semantic diversity in the model’s representation of language. However, for larger models – specifically those with 1 billion and 410 million parameters – following a period of degradation, this correlation decreases to approximately 0.5. This suggests that while larger models initially exhibit heightened semantic diversity, their ability to maintain nuanced polysemy diminishes after degradation, resulting in a weaker relationship between a word’s frequency and the range of its meanings.

The Ghost in the Machine: Implications for LLM Design and Evaluation

The training of large language models isn’t simply a process of memorization; instead, evidence suggests these models actively construct and organize an internal semantic space. This ‘semantic emergence’ isn’t a steady, predictable climb towards better understanding, however. Research indicates this organization is often non-monotonic, meaning performance on semantic tasks can fluctuate during training, and isn’t necessarily optimal; larger models don’t automatically equate to richer or more coherent semantic representations. The process appears to involve a complex interplay of information, where initial gains can be followed by periods of degradation or even ‘catastrophic collapse’ of certain semantic relationships, implying that the model’s internal semantic landscape is continually reshaped and refined – sometimes at the expense of previously learned knowledge. It’s a system striving for order, yet perpetually destabilized by the very forces that build it.

The study reveals that scaling up language models doesn’t automatically equate to a more comprehensive grasp of meaning. While larger models often demonstrate improved performance on benchmark tasks, this research suggests that semantic understanding doesn’t necessarily scale linearly with model size. Evidence indicates that increasing parameters can sometimes lead to a narrowing of semantic representation, where the model favors dominant meanings and loses the ability to differentiate between subtle nuances or less frequent interpretations of words. This challenges the prevailing assumption that simply building bigger models will inherently yield a richer, more flexible, and ultimately more human-like understanding of language; it suggests that architectural innovations and training strategies focused on preserving semantic diversity may be crucial for achieving genuine semantic competence.

The prevailing methods for assessing large language models often fail to fully capture the complexity of semantic understanding, potentially masking a phenomenon termed “catastrophic collapse.” While benchmarks might indicate improved performance on specific tasks with increased model size, these metrics provide a limited view of how the model organizes and represents knowledge internally. This research suggests that models can degrade in their ability to maintain nuanced semantic relationships – effectively ‘forgetting’ subtle meanings – even while still achieving high scores on conventional tests. The implication is that current evaluation suites require augmentation to better detect and quantify these shifts in semantic representation, as relying solely on task-specific accuracy can be misleading and fail to reveal a loss of semantic diversity occurring within the model’s parameters.

Even as large language models undergo training-induced semantic degradation, a surprising resilience in their ability to represent multiple word meanings persists. Research indicates that models containing 1 billion and 410 million parameters consistently retain a vocabulary of 275 to 300 polysemous words – terms with multiple definitions – even at advanced stages of training. This finding suggests that while semantic organization isn’t always optimal, a core level of lexical ambiguity is maintained, prompting a crucial need for novel evaluation techniques. Current metrics often fail to capture this nuanced semantic diversity, highlighting the importance of developing methods specifically designed to monitor and preserve a model’s capacity for representing the full breadth of word meaning.

The study’s findings regarding the non-monotonic relationship between word frequency and polysemy – the semantic collapse at later training stages – prompts consideration of the systems’ underlying logic. One begins to wonder if this degradation isn’t a failure, but a consequence of the model fully exploring the possible semantic space. As Tim Berners-Lee observed, “The Web is more a social creation than a technical one.” This suggests that LLMs, mirroring the Web’s evolution, don’t simply acquire meaning; they generate it, and in doing so, potentially exceed the limitations initially imposed by training data, leading to a restructuring of semantic relationships beyond simple frequency-specificity tradeoffs. The ‘bug’-the deviation from Martin’s Law-might be a signal of emergent complexity.

What’s Next?

The observed deviation from Martin’s Law in large language models isn’t a refutation, precisely. It’s a glitch in the expected circuitry. The initial adherence, followed by semantic degradation with continued training, suggests that LLMs don’t simply learn polysemy; they temporarily construct a plausible imitation, one that appears statistically sound only within a certain developmental window. The interesting question isn’t why they fail to uphold the law, but why they ever bothered in the first place. It begs examination of the optimization landscape; what constraints or pressures during intermediate training stages encourage this fleeting semblance of human-like lexical organization?

Future work must move beyond simply measuring polysemy, and begin probing the internal representations driving it. Contextualized embeddings offer a tantalizing glimpse, but they’re a shadow play. Dissecting the model’s attention mechanisms during semantic disambiguation-tracing which features trigger which meanings-may reveal whether these LLMs are truly grasping ambiguity, or merely simulating it with sophisticated pattern matching.

Ultimately, this isn’t about confirming or denying a linguistic principle. It’s about reverse-engineering the architecture of meaning itself. If these models are, as some claim, mirrors reflecting our own cognitive processes, then this particular distortion offers a strangely clarifying glimpse into the messy, imperfect logic of human language.


Original article: https://arxiv.org/pdf/2511.21334.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

See also:

2025-11-30 11:53