CLIP
Paper Title | Learning Transferable Visual Models From Natural Language Supervision (CLIP) |
Authors | Radford et al. |
Date | 2021-02 |
Link | https://arxiv.org/abs/2103.00020 |
Paper summary
Paper Review
The slide above and Strengths & Weaknesses are done together with Pratik Ramesh and Suyash Kumar
Short Summary
CLIP is a set of image and text encoders that create visual-language-aligned embeddings. The model architecture is a ViT-based encoder, a Transformer-based sentence encoder and they are supervised by a Contrastive loss. Important implementation details include: Very large batch size (32K), Very large training dataset (400M pairs), Simplified architecture, Training from scratch without pre-training. The model demostrated outstanding Zero-shot learning capability and generalization.
Strengths
- First general-purpose, large scale, image-text aligned embeddings which enable subsequent works in multimodal space
- Scaled up previous ideas with natural language supervision to get great results on zero-shot image tasks
- Efficient implementation: Contrastive learning, Simplified architecture & data transformation
- Extensive experiments that prove both the model’s high performance and generalization
- Few-shot performance competitive with supervised models.
Weaknesses
- Dataset collected is opaque, and doesn’t allow for further community-driven analysis
- Struggles with systematic tasks like counting the number of objects
- Worse on “potentially OOD” datasets like MNIST
- Input text descriptions is short (≤76 tokens), limiting the capacity to supervise the image encoder
- Learns societal biases through the text-image pairs from the internet.
- Text side analysis is relatively weak
Reflection
Many tasks can be described better by Natural Language than other mathematical formulation, so CLIP might be a better pre-train models for those tasks.
Most interesting thought/idea from reading this paper
Anything can supervise anything now!