Transformers Conquer Shifting Data in Regression Tasks

Transformer-based in-context learning demonstrates robustness across noise distributions-including non-Gaussian and heavy-tailed scenarios like Bernoulli, exponential, Gamma, and Poisson-where the [latex]\ell_{1}[/latex] objective aligns with maximum-likelihood estimation, and even extends to distributions like the Student-t distribution (with [latex]\nu = 2[/latex]) that fall outside traditional finite-variance statistical frameworks, offering a performance comparable to, or exceeding, classical estimators such as least squares, Ridge regression, and [latex]\ell_{1}[/latex] solvers (LP and ADMM).

New research reveals that Transformer models demonstrate surprising resilience when faced with unexpected changes in data distribution during linear regression, often exceeding the performance of traditional statistical methods.