Taming the Gradient Spike: A New Approach to Training Large Language Models
![Spectra demonstrates improved convergence loss and downstream performance across a range of learning rates-[latex] \eta \in \{8 \times 10^{-4}, 1 \times 10^{-3}, 5 \times 10^{-3}, 1 \times 10^{-2}\} [/latex]-establishing its advantage in optimization regimes.](https://arxiv.org/html/2602.11185v1/figures/lr_ablation.png)
Researchers have discovered a key pattern in the gradients of large language models and developed a new optimizer designed to exploit this structure for faster, more efficient training.


![The expert model classifies textual tokens as either factual content or hallucination, assigning a binary label of [latex]0[/latex] to indicate factual accuracy and [latex]1[/latex] to denote fabricated content.](https://arxiv.org/html/2602.11167v1/figures/example_hallucination_marking.png)



![A system trains itself to reason by internally rewarding lines of thought that bolster its confidence in a correct answer, circumventing the limitations of external verification-a process demonstrated by a model that eschews reliance on externally defined rewards or pre-authored reasoning traces, instead generating reasoning [latex]z[/latex] based solely on a question [latex]x[/latex] and a reference answer [latex]y^{\star}[/latex].](https://arxiv.org/html/2602.11549v1/x1.png)

