1 minute read

Paper Title Emerging Properties in Self-Supervised Vision Transformers
Authors Caron et al.
Date 2021-05
Link https://arxiv.org/abs/2104.14294

Paper summary

Paper Review

Short Summary

The paper presented DINO - “a form of self-distillation with no labels”. The teacher network (T) and the student network (S) shared the same architecture, T use momentum to simulate an ensemble of Students, thus performs better thatn S. The authors claimed that balancing between centering and sharpening of T’s output is enough to avoid mode collapse and encourage convergence. They also conducted extensive ablation study and performance benchmark against SOTA models. DINO showed SOTA performance compared to other self-supervised models while enjoying a limited computational budget for pre-training.

Strengths

  • Kudos to the novel idea of self-distillation and actually making it work.
  • Computationally efficient, a lot more friendly to the research community compared to other pre-trained models.
  • Surprising out-of-the-box performance with linear classifier and KNN -> Is suitable for a lot of use-cases.
  • Detailed ablation study and discussion helps create some ideas on how the method works

Weaknesses

  • Although the paper did some ablation study on this, I find that their discussion on avoiding mode collapse inconclusive. Balancing centering and sharpening by tuning the sharpening hyperparameter is not a robust method.
  • The model leave a lot of hyperparameter to tune, thus, although they claim to have small computational training budget for 1 run, it is unclear how hyperparameter tuning was done.

Reflection

Self distillation seems to be a promising research direction. Even though this paper has described a lot of the intuition behind the method, there still seems to be a lot of aspects unexplored.

Most interesting thought/idea from reading this paper

DINO doesn’t seems to be using rotation, might be a good fit for our project of electronic assembling anomaly dectection.

Updated: