DINO - self-distillation with no labels
Paper Title | Emerging Properties in Self-Supervised Vision Transformers |
Authors | Caron et al. |
Date | 2021-05 |
Link | https://arxiv.org/abs/2104.14294 |
Paper summary
Paper Review
Short Summary
The paper presented DINO - “a form of self-distillation with no labels”. The teacher network (T) and the student network (S) shared the same architecture, T use momentum to simulate an ensemble of Students, thus performs better thatn S. The authors claimed that balancing between centering and sharpening of T’s output is enough to avoid mode collapse and encourage convergence. They also conducted extensive ablation study and performance benchmark against SOTA models. DINO showed SOTA performance compared to other self-supervised models while enjoying a limited computational budget for pre-training.
Strengths
- Kudos to the novel idea of self-distillation and actually making it work.
- Computationally efficient, a lot more friendly to the research community compared to other pre-trained models.
- Surprising out-of-the-box performance with linear classifier and KNN -> Is suitable for a lot of use-cases.
- Detailed ablation study and discussion helps create some ideas on how the method works
Weaknesses
- Although the paper did some ablation study on this, I find that their discussion on avoiding mode collapse inconclusive. Balancing centering and sharpening by tuning the sharpening hyperparameter is not a robust method.
- The model leave a lot of hyperparameter to tune, thus, although they claim to have small computational training budget for 1 run, it is unclear how hyperparameter tuning was done.
Reflection
Self distillation seems to be a promising research direction. Even though this paper has described a lot of the intuition behind the method, there still seems to be a lot of aspects unexplored.
Most interesting thought/idea from reading this paper
DINO doesn’t seems to be using rotation, might be a good fit for our project of electronic assembling anomaly dectection.