CLIP

1 minute read

Paper Title	Learning Transferable Visual Models From Natural Language Supervision (CLIP)
Authors	Radford et al.
Date	2021-02
Link	https://arxiv.org/abs/2103.00020

Paper summary

Paper Review

The slide above and Strengths & Weaknesses are done together with Pratik Ramesh and Suyash Kumar

Short Summary

CLIP is a set of image and text encoders that create visual-language-aligned embeddings. The model architecture is a ViT-based encoder, a Transformer-based sentence encoder and they are supervised by a Contrastive loss. Important implementation details include: Very large batch size (32K), Very large training dataset (400M pairs), Simplified architecture, Training from scratch without pre-training. The model demostrated outstanding Zero-shot learning capability and generalization.

Strengths

First general-purpose, large scale, image-text aligned embeddings which enable subsequent works in multimodal space
Scaled up previous ideas with natural language supervision to get great results on zero-shot image tasks
Efficient implementation: Contrastive learning, Simplified architecture & data transformation
Extensive experiments that prove both the model’s high performance and generalization
Few-shot performance competitive with supervised models.

Weaknesses

Dataset collected is opaque, and doesn’t allow for further community-driven analysis
Struggles with systematic tasks like counting the number of objects
Worse on “potentially OOD” datasets like MNIST
Input text descriptions is short (≤76 tokens), limiting the capacity to supervise the image encoder
Learns societal biases through the text-image pairs from the internet.
Text side analysis is relatively weak

Reflection

Many tasks can be described better by Natural Language than other mathematical formulation, so CLIP might be a better pre-train models for those tasks.

Most interesting thought/idea from reading this paper

Anything can supervise anything now!

Share on

Twitter Facebook LinkedIn

CLIP

Paper summary

Paper Review

Short Summary

Strengths

Weaknesses

Reflection

Most interesting thought/idea from reading this paper

Share on

You may also enjoy

Key Concepts Of Langchain

A Beginner Introduction To Ranking Model

Observation Vs Ground Truth And Why Data Analysts Are Important

Applying Sequence Classification To Grocery Data Using Product Embeddings