Beyond Diffusion: Autoregressive Models Reclaim the Image Classification Crown

Author: Denis Avetisyan

A new autoregressive approach to generative classification is challenging the dominance of diffusion models, delivering superior performance and enhanced robustness.

Generative classifiers demonstrate a performance trade-off between top-1 classification accuracy and computational efficiency-measured in seconds per image-that varies predictably with model size.

This review details an order-marginalized autoregressive generative classifier that achieves state-of-the-art results on image classification tasks and exhibits improved resilience to distribution shifts.

While diffusion models have recently dominated generative image classification, their superiority over autoregressive (AR) models has not been fully explored. This paper, ‘Revisiting Autoregressive Models for Generative Image Classification’, identifies a critical limitation of prior AR approaches-a restrictive reliance on fixed token order-and introduces order-marginalization to unlock their full potential. By averaging predictions across multiple token sequences, we demonstrate that our method consistently outperforms diffusion-based classifiers on diverse benchmarks, achieving up to 25x greater efficiency and competitive results compared to self-supervised discriminative models. Could this work signal a resurgence of autoregressive models as a leading paradigm for robust and efficient image classification?

The Limits of Superficial Vision

Contemporary computer vision systems, such as the DINOv2 model, frequently achieve impressive performance by capitalizing on unintended shortcuts within datasets, rather than genuine comprehension of an object’s form. These systems often prioritize textural details and spurious correlations – patterns that coincidentally appear alongside the target object but aren’t intrinsic to its identity – over robust shape understanding. Consequently, a model might learn to identify ‘zebra’ based on the presence of tall grass in the training images, rather than the animal’s stripes themselves, leading to inaccurate classifications when presented with a zebra in a different environment. This reliance on superficial features makes these models vulnerable to even subtle changes in input, highlighting a crucial gap between statistical pattern recognition and true visual intelligence.

Despite achieving impressive performance benchmarks – such as the 88.8% Top-1 Accuracy of DINOv2-XL on the ImageNet-Val dataset – current computer vision systems demonstrate a surprising fragility when confronted with even slight variations in input. This brittleness stems from a reliance on superficial features and textural cues rather than a genuine understanding of object shape and structure. Consequently, these systems are easily fooled by adversarial examples – subtly altered images designed to cause misclassification – or exhibit diminished performance when encountering images from distributions differing from those used during training. This lack of robust generalization underscores a critical limitation of purely discriminative approaches, suggesting that a shift towards models capable of understanding and generating visual data is essential for building truly reliable and adaptable vision systems.

The prevailing success of discriminative vision models, while achieving impressive benchmarks like DINOv2-XL’s 88.8% Top-1 Accuracy on ImageNet-Val, often masks a critical vulnerability: a dependence on superficial textural cues rather than a robust grasp of underlying shape. This reliance creates systems easily fooled by subtle alterations – adversarial examples or shifts in data distribution – revealing a fundamental limitation in approaches that prioritize categorization over comprehension. Consequently, the field is increasingly recognizing the necessity of generative models, which, by learning to create images, are compelled to develop a more complete and resilient understanding of visual structure, moving beyond mere feature detection towards a truly insightful vision.

RandAR-XL and DINOv2-XL exhibit similar failure modes, particularly with multi-object scenes or visually similar categories, as evidenced by comparable discriminative log-likelihoods for incorrect predictions.

Generative Classification: A New Paradigm for Perception

Generative classifiers distinguish themselves from discriminative models by explicitly modeling the probability distribution of the input data, $p(x)$ . This approach allows the classifier to understand the inherent structure of the data, leading to more robust performance when encountering noisy or incomplete inputs. Unlike discriminative models which directly learn the decision boundary, generative models create a representation of how the data is generated, and can therefore generalize to unseen data points more effectively. The learned data distribution also facilitates interpretability; by examining the model’s representation of $p(x)$ , insights into the key features and relationships within the data can be obtained, offering a degree of transparency not typically found in black-box discriminative classifiers.

Generative classifiers utilize Bayes’ Rule – $P(C|X) = \frac{P(X|C)P(C)}{P(X)}$ – to enhance classification performance. This framework allows the incorporation of prior probabilities $P(C)$ representing pre-existing knowledge about class distributions, combined with the likelihood $P(X|C)$ of observing data $X$ given a class $C$ . By explicitly modeling the data-generating process, these models can differentiate between genuine correlations and spurious ones, as the prior acts as a regularization term. This approach reduces overfitting to noisy or irrelevant features and improves generalization to unseen data, ultimately leading to more accurate and reliable classifications compared to discriminative models that focus solely on decision boundaries.

Diffusion Models and Autoregressive Models represent leading approaches within generative classification. Diffusion Models function by progressively adding noise to data until it resembles random noise, then learning to reverse this process to generate new samples or classify existing ones; this iterative refinement often yields high-quality results. Autoregressive Models, conversely, predict each data element sequentially based on preceding elements, effectively modeling the probability distribution of the data; examples include transformers applied to image or text classification. Both model classes have demonstrated state-of-the-art performance on complex datasets and are actively researched for improved efficiency and scalability in generative classification tasks.

This generative classification framework tokenizes input images, constructs randomly permuted sequences with positional and class-conditional information, and then utilizes a RandAR model to predict the most probable class <span class="katex-eq" data-katex-display="false">\operatorname{arg\,max}_{c_{i}} \log p(\mathbf{x}|c_{i})</span> based on the predicted log probability <span class="katex-eq" data-katex-display="false">\log p(\mathbf{x}|c_{i})</span>. — This generative classification framework tokenizes input images, constructs randomly permuted sequences with positional and class-conditional information, and then utilizes a RandAR model to predict the most probable class $\operatorname{arg\,max}_{c_{i}} \log p(\mathbf{x}|c_{i})$ based on the predicted log probability $\log p(\mathbf{x}|c_{i})$ .

RandAR: Unlocking Flexible Image Generation Through Tokenization

Traditional autoregressive image generation models typically process images in a fixed raster order – sequentially scanning pixels row by row. RandAR departs from this constraint by enabling image generation through arbitrary token orders. This is achieved by representing images as a sequence of discrete tokens via Vector Quantized Variational Autoencoders (VQ-VAE) and then modeling the probability distribution over these tokens without being limited to a predefined scan path. This flexibility allows the model to capture long-range dependencies and complex structures within images more effectively, as relationships between non-adjacent image regions can be directly modeled during the generation process, unlike raster-order approaches where such dependencies require processing through numerous intermediate steps.

RandAR utilizes a Vector Quantized Variational Autoencoder (VQ-VAE) to decompose images into discrete tokens, enabling generation in non-raster scan orders. This tokenization process is coupled with an Order-Marginalization technique, which effectively accounts for all possible token ordering permutations during training. By marginalizing over these orders, the model learns to represent image structure independent of a fixed scanline order, allowing for more robust and flexible image reconstruction and generation capabilities. This approach enhances the model’s ability to capture long-range dependencies and complex relationships within images, ultimately improving performance on tasks requiring structural understanding.

RandAR-XL demonstrates leading performance on standard image classification benchmarks. Specifically, the model achieves a Top-1 Accuracy of 90.2% on the ImageNet-Val dataset, representing an improvement of 1.8% over the DiT-XL model. Furthermore, RandAR-XL attains a Top-1 Accuracy of 78.3% on the more challenging ImageNet-R dataset, exceeding DiT-XL’s performance by 1.5%. These results indicate RandAR-XL’s enhanced ability to generalize and maintain accuracy across varied and potentially adversarial image distributions.

RandAR demonstrates a substantial improvement in inference speed, achieving a 25x acceleration compared to contemporary diffusion-based image generation classifiers. This efficiency gain is a direct result of RandAR’s autoregressive modeling approach, which allows for faster sample generation once the initial tokens are processed. The speedup translates to reduced computational costs and latency, making RandAR a viable option for real-time or high-throughput image generation applications where diffusion models may be prohibitively slow.

RandAR classifier accuracy is highest for central image tokens, reflecting the center-object bias in ImageNet, and consistently improves with increasing <span class="katex-eq" data-katex-display="false">KK</span> values. — RandAR classifier accuracy is highest for central image tokens, reflecting the center-object bias in ImageNet, and consistently improves with increasing $KK$ values.

Beyond Pixel Patterns: Implications for Robust and Understandable Vision

Recent advances in generative modeling, exemplified by RandAR, reveal a compelling shift in how computer vision systems learn to perceive the world. Traditionally, image recognition relied heavily on identifying low-level textural details – the specific arrangements of pixels that define an object’s surface. However, RandAR’s capacity to construct images in any desired sequence of visual tokens fundamentally alters this process. By decoupling the order of generation from strict textural constraints, the system is compelled to prioritize higher-level, shape-based understandings of objects. This encourages the development of more abstract representations, where an object is defined by its form and structure rather than its precise pixel arrangement, ultimately fostering a more robust and generalized visual intelligence.

A notable consequence of learning shape-based representations is improved resilience in adverse conditions. Traditional computer vision systems often falter when presented with images containing noise, distortions, or deliberately misleading alterations – known as adversarial attacks – because they heavily rely on easily disrupted low-level features like edges and textures. However, systems trained to prioritize shape, as demonstrated by recent advances in generative modeling, exhibit a significantly enhanced capacity to maintain accuracy even when faced with such challenges. By focusing on the fundamental forms within an image, these systems are less susceptible to superficial variations, enabling more reliable object recognition and scene understanding in real-world scenarios where image quality is often imperfect or intentionally compromised. This shift towards shape bias represents a crucial step towards creating computer vision systems that perceive the world more like humans, focusing on the ‘what’ rather than merely the ‘how’ of visual information.

The development of generative classifiers, such as RandAR, signifies a crucial step toward computer vision systems capable of genuine understanding, rather than simply recognizing superficial patterns. Traditional classifiers often rely on low-level textural details, making them vulnerable to even minor image alterations or noisy data. Generative models, however, learn to reconstruct images, forcing them to develop a deeper, more abstract comprehension of visual concepts and shapes. This inherent understanding not only enhances robustness against adversarial attacks and image corruption but also allows for greater interpretability; the system can ‘explain’ its decisions based on the underlying structure it has learned, moving beyond a ‘black box’ approach. Consequently, these advancements promise a new generation of reliable and transparent computer vision technologies with applications ranging from autonomous vehicles to medical diagnosis.

The accuracy of an any-order autoregressive (AR) image classification model is highly sensitive to the sequence in which image tokens are processed, demonstrating that token order significantly impacts final classification results.

The pursuit of robust image classification, as detailed in the research, mirrors a fundamental tenet of understanding complex systems: discerning patterns beyond superficial observation. The paper’s exploration of order-marginalized autoregressive models, challenging the dominance of diffusion models, exemplifies this principle. As Geoffrey Hinton once stated, “What we’re really trying to do is model the probability distribution of the data.” This resonates deeply with the work; by focusing on accurately modeling the underlying data distribution, the research achieves not only state-of-the-art performance but also improved robustness to distribution shifts – a key indicator of a truly understood model. The success of this approach highlights that a thorough understanding of generative processes is vital for achieving reliable and adaptable classification systems.

Where Do We Go From Here?

The demonstrated efficacy of order-marginalized autoregressive generative classifiers, exceeding even diffusion models in specific contexts, compels a re-evaluation of established generative approaches. The pursuit of increasingly complex architectures often obscures a simple truth: a system’s ability to model inherent data structure – the pattern – remains paramount. However, this work isn’t a final destination, but rather a signpost. A crucial next step lies in disentangling the contributions of the order-marginalization technique from those of the underlying autoregressive model itself. Is the performance gain truly attributable to the ordering strategy, or does it merely represent a more efficient exploitation of the AR framework’s inherent capabilities?

Furthermore, while robustness to distribution shifts is encouraging, it remains a localized observation. The true test will involve exposing these models to genuinely adversarial perturbations-not merely variations in style or viewpoint, but carefully crafted inputs designed to exploit subtle weaknesses in the learned representations. The current emphasis on image classification, while useful, also limits the scope of inquiry. Extending this framework to other modalities – video, audio, even time-series data – will reveal its true generality, or expose fundamental limitations.

Ultimately, the field seems poised to move beyond a simple pursuit of higher scores on benchmark datasets. The focus must shift toward understanding why these models succeed – or fail – and how that understanding can be leveraged to build systems that are not merely accurate, but genuinely intelligent in their ability to generalize and adapt.

Original article: https://arxiv.org/pdf/2603.19122.pdf

Contact the author: https://www.linkedin.com/in/avetisyan/

The Limits of Superficial Vision

Generative Classification: A New Paradigm for Perception

RandAR: Unlocking Flexible Image Generation Through Tokenization

Beyond Pixel Patterns: Implications for Robust and Understandable Vision

Where Do We Go From Here?

See also: