Vision Transformer: Treating Image Patches Like Tokens

TL;DR

ViT showed that a standard Transformer can compete in image recognition when images are split into patches and trained at sufficient scale.

What problem it solves

Computer vision was dominated by convolutional neural networks. ViT asks whether the Transformer architecture, already central in language, can work for images without building in convolution as the main inductive bias.

The core method

The model splits an image into fixed-size patches, linearly embeds each patch, adds position information, and feeds the sequence into a standard Transformer encoder. Classification is performed from the resulting representation, much like sequence classification in NLP.

Key results

ViT performs very well when pretrained on large image datasets and transferred to downstream recognition benchmarks. The paper also shows a tradeoff: without enough data, the weaker image-specific bias can hurt; with enough scale, the architecture becomes highly competitive.

Why it matters

ViT opened the path for foundation-model thinking in computer vision. Once images can be represented as token sequences, many ideas from language modeling become easier to transfer: scaling, pretraining, masked prediction, multimodal alignment, and unified architectures.

Limits and open questions

The original ViT depends heavily on large-scale pretraining and is less data-efficient than CNNs in small-data settings. Patch tokenization can also miss fine local structure unless later designs add hierarchy, better augmentation, or hybrid components.

One line: ViT made images look like sequences to the Transformer ecosystem.