1 minute read

Paper Title A Simple Framework for Contrastive Learning of Visual Representations
Authors Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton
Date 2020-07
Link https://arxiv.org/abs/2002.05709

Paper summary

Paper Review

Short Summary

The paper present SimCLR, a simple framework for representation learning using contrastive learning. The main idea is using data augmentation to corrupt the original image in different ways, then train the model to differentiate between augmentations of the same vs different source images. The authors conjecture that applying non-linear transformation before the contrastive loss helps protect the representation from loosing important information during training. The authors conducted comparison study and ablation study to understand how different details contribute to the model’s performance.

Strengths

  • A strong self-supervised learning method that are able to out-performed supervised learning SOTA models.
  • A good set of ablation study and meaningful discussion that shed insights into how and why Contrastive Learning work
  • Leveraging batch sampling to eliminate the need for a memory bank.
  • The same principles of contrastive learning can be generalized to other tasks.

Weaknesses

  • Only combine existing ideas from contemporary works without contributing novel ideas
  • Require a lot of computing power, on a multiple GPUs setting, to pretrain. It is not clear how the same method can apply to different task at a lower computing requirement
  • Contrastive loss is quite simple, unbalanced class, and not properly guarded against false-negative pairs.

Reflection

  • Based on the weakness, more work can be put on improving the contrastive loss (e.g. better negative sampling) and improving pretrain cost-efficiency.
  • Contrastive learning seems to be orthogonal to other self-supervised learning methods such as masked auto encoder. What are their connections and can we leverage both?

Most interesting thought/idea from reading this paper

  • Thinking about building encoders for everything from image, to text, to video, to product items, etc. Then a universal decoder to perform all tasks

Updated: