Masked Auto Encoder for Vision
Paper Title | Masked Autoencoders Are Scalable Vision Learners |
Authors | Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, Ross Girshick |
Date | 2021-12 |
Link | https://arxiv.org/pdf/2111.06377.pdf |
Paper summary
- Asymmetrical encoder-decoder architecture
- Encoder only operates on the un-masked patches + positional encoding; Decoder operate on full set of all batches and can be light-weight (10% the encoder) to reduce computation and memory.
- The use of positional encoding is crucial for making the asymmetrical encoder-decoder architecture work.
Paper Review
Short Summary
The paper present Masked Auto Encoder, a successful attempt to adapt BERT’s effectiveness in the NLP domain to Vision. A ViT-based encoder only takes in un-masked patches where a light-weighted decoder process all patches to reduce computational and memory cost. By increasing the masked portion of the image to > 75% and apply patch-regularized pixel-wise reconstruction target, a challenging task was created for pretraining. The encoder was able to learn meaningful semantic representation of images that are generalizable for many down-stream tasks. Experiments show that MAE outperformed other self-supervised learning methods such as MoCo (contrastive learning) and DINO (self distillation).
Strengths
- Strong motivation and meaningful discussion on the difficulty of applying Masked Auto Encoders for vision: a) Language is information-dense vs Vision sparse; b) Reconstruction loss for Language is simple (token prediction) while for vision it is a lot harder (pixel or else)
- MAE is compute-efficient as it only operates on the un-masked patches, < 25% of the original image (masked patches are remove, no masks were used). It allows training of large encoders with a fraction of compute and memory cost.
- Detailed description of the architecture and meaningful discussion of “Why?”.
- Very detailed ablation study that highlights some of MAE’s differences: a) Embeddings are not as linearly meaningful (probing) but better when non-linear finetune are allowed, even only 0.5 layer. b) Decoder can be light but need to be deep to allow rich representation of the encoder.
Weaknesses
- Although the authors mentioned and benchmark against contrastive learning (MoCo_v3) and self-distillation (DINO) methods, they doesn’t provide any discussions or experiments to justify Masked Auto Encoder effectiveness against other models.
- The experiment/discussion on reconstruction loss (pixel vs dVAE) could have been more detailed since this is a key distinction of masked auto encoder to other self-supervised methods.
Reflection
- It seems the cost of computation is still the main challenges when applying masked auto encoder method. This is understandable since reconstruction loss for image is a lot more costly than text. On the other hand, the model scale of ViT is relatively small compared to LLMs these days, so some attempts to scale such model might aready on the way.
Most interesting thought/idea from reading this paper
- Performance not able to outperform original ViT JFT300M, this is not really a weakness, but I’m seriously wonder why.