2 minute read

Paper Title An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Informal name Vision Transformer (ViT)
Date 2021-06
Link https://arxiv.org/abs/2010.11929

Paper reading notes

Spontaneous Questions

Q1: What is the role of pre-training? How hard it is to pretrain? What are the supervision needed?
A1: The paper make it look like training is easy, but the community’s consensus is training ViT is a huge effort.

Q2: How do ViT do positional encoding for each patch?
A2: They do learnable encoding

Q3: How do ViT’s positional encoding handle higher resolution images?
A3: They perform 2D interpolation of the pre-trained position embeddings, according to their location in the original image.

Q4: How do the linear transformation handle 3-color channels? Do they have 3 different linear transformations for each color channel?
A4: The input to the linear projection is a flattened version of the image patch. If the image patch has a size of (patch_height, patch_width, num_channels) (where num_channels is typically 3 for the red, green, and blue color channels), then the input dimension of the linear projection is patch_height * patch_width * num_channels. [Answer written by DeepGPT, manually verified with multiple sources]

Paper Review

Summary

The paper examine the direct application of the transformer architecture to the vision classification task. To do so, the authors mainly compare Vision Transformer (ViT) to ResNet-based models (BiT) on a pretraining-finetuning setup for a variety of datasets and benchmark tasks. ViT breaks the original images into smaller patch, then applying a shared linear transformation and add positional encoding to create input token for each patch. The sequence of tokens is then input into a standard transformer encoder. Main contribution: The paper shows that ViT, lacking the inductive bias for 2D images, does not perform as well as BiT on smaller pretraining set, but outperform on larger ones with significantly less computing power.

Strengths

  • The experiments are well-formed and well-executed, with examinations of various aspects described in the main writing and in the appendix. It serve as a conclusive proof of superiority of ViT over ResNet-like methods.
  • The inspection of multi-headed self attention (MSA) and positional encoding emperically shed insights on how ViT can learn CNN’s inductive biases and more from the data.
  • Code and Pre-trained model are provided, this is useful for the research community.

Weaknesses

  • The performance gain is obtained through scaling and not clever architecture design
  • Pre-training requirement is prohibitly expensive for a large proportion of the community
  • Some implementation details was not described in detailed, e.g. the 2D interpolation of the positional embeddings.

Reflection

  • Speculation: The ability to attend globally from the very first layer gives ViT an edge over CNNs that leads to better generalization, especially when pretrained on large datasets.
  • Research direction: More investigation needed to quantify the effect of this better receptive field vs other advantages of the transformer architecture.

Most interesting thought/idea from reading this paper

This paper open up a way to use Transformer as a learning engine over different types of data (images, text, sounds, etc.) -> Multimodal models

Updated: