1 minute read

Paper Title Learning Transferable Visual Models From Natural Language Supervision (CLIP)
Authors Radford et al.
Date 2021-02
Link https://arxiv.org/abs/2103.00020

Paper summary

Project report

Paper Review

The slide above and Strengths & Weaknesses are done together with Pratik Ramesh and Suyash Kumar

Short Summary

CLIP is a set of image and text encoders that create visual-language-aligned embeddings. The model architecture is a ViT-based encoder, a Transformer-based sentence encoder and they are supervised by a Contrastive loss. Important implementation details include: Very large batch size (32K), Very large training dataset (400M pairs), Simplified architecture, Training from scratch without pre-training. The model demostrated outstanding Zero-shot learning capability and generalization.

Strengths

  • First general-purpose, large scale, image-text aligned embeddings which enable subsequent works in multimodal space
  • Scaled up previous ideas with natural language supervision to get great results on zero-shot image tasks
  • Efficient implementation: Contrastive learning, Simplified architecture & data transformation
  • Extensive experiments that prove both the model’s high performance and generalization
  • Few-shot performance competitive with supervised models.

Weaknesses

  • Dataset collected is opaque, and doesn’t allow for further community-driven analysis
  • Struggles with systematic tasks like counting the number of objects
  • Worse on “potentially OOD” datasets like MNIST
  • Input text descriptions is short (≤76 tokens), limiting the capacity to supervise the image encoder
  • Learns societal biases through the text-image pairs from the internet.
  • Text side analysis is relatively weak

Reflection

Many tasks can be described better by Natural Language than other mathematical formulation, so CLIP might be a better pre-train models for those tasks.

Most interesting thought/idea from reading this paper

Anything can supervise anything now!

Updated: