Review Segment Anything paper

2 minute read

Paper Title	Segment Anything
Date	2023-05
Link	https://arxiv.org/abs/2304.02643

Paper reading notes

Spontaneous Questions

Q1: They use a lot of off-the-shelf pretrained encoder (ViT, CLIP), and a light weight decoder. How is it possible that the system perform so well?
A1: First of all, the decoder architectures that SAM is built on (DETR and MaskFormer) also performs very well. Secondly, the data engine built was super effective in providing quality mask labels. One could argue the main contribution of this paper is the improvement on the data side instead of architectural or scaling.

Q2: How did they represent mask prediction if there were no class?
A2: The DETR paper described in detailed the formulation of a set prediction problem, which use a combination of the bipartite matching loss and transformer decoding to build a model that are capable of predicting a set of unique bounding boxes without the use of nonmax suppresion. MaskFormer take it further and make an end-to-end trainable architecture for panoptic segmentation with the use of 2 tasks: Predicting a mask embeddings and Predicting mask classification.

Discussion questions

Q1:

Paper Review

Summary

The paper introduces the Segment Anything: a new task, model, and dataset for image segmentation. Architecturally, SAM built on prior work done with Transformer, Vision Transformer, DETR, and MaskFormer to build an end-to-end trainable, prompt-based segmentation model. The main contribution of the paper is 3-fold, first, it combines ideas from LLMs and Segmetation to create a foundational model with great generalization power using a new task and a new model that works with prompt inputs and produce mask outputs. Secondly, the author created a data engine that enable a feedback loop between model and human annotator to greatly improve the number and quality of mask labels. Thirdly, they released the model and a (orders of magnitude more) massive dataset to the research community for further research.

Strengths

The model architecture that can incorporate various prompt types and image encoder to produce reasonable binary mask prediction. This is a novel task and a some-what novel architecture that show the author’s deep consideration for zero-shot generalization
The data engine is effective in getting more data labeled
Massive data release AND detailed description of Dataset, Annotation and model Cards will benefit the research community in terms of data resource, transparency, and reproducibility.
Detailed analysis on zero-shot learning and fairness is positive, especially for a foundation model like this one
Available of Prompt allows solving: Ambiguity for downstream tasks, Assist annotation generation, etc.

Weaknesses

Experimentation and benchmarking compared to other models could be more comprehensive and show more detailed, quantified performance comparison, especially when this model is supposed to generalize well.
Limited demonstration with text-based prompts

Reflection

DETR and MaskFormer architectures propose fascinating ways to eliminate the need for post-processing - a standard in the early days of segmentation and object detection.
Once again scaling the model and data produce significant gains for multi-modality models
Designing model architecture and task with Zero-shot generalization in mind and produce some interesting new approach to existing problems

Most interesting thought/idea from reading this paper

I’m wondering if the mask encoding could be extracted and used for other purposes via fine-tuning.

Share on

Twitter Facebook LinkedIn

Review Segment Anything paper

Paper reading notes

Spontaneous Questions

Discussion questions

Paper Review

Summary

Strengths

Weaknesses

Reflection

Most interesting thought/idea from reading this paper

Share on

You may also enjoy

Key Concepts Of Langchain

A Beginner Introduction To Ranking Model

Observation Vs Ground Truth And Why Data Analysts Are Important

Applying Sequence Classification To Grocery Data Using Product Embeddings