2 minute read

Paper Title Segment Anything
Date 2023-05
Link https://arxiv.org/abs/2304.02643

Paper reading notes

Spontaneous Questions

Q1: They use a lot of off-the-shelf pretrained encoder (ViT, CLIP), and a light weight decoder. How is it possible that the system perform so well?
A1: First of all, the decoder architectures that SAM is built on (DETR and MaskFormer) also performs very well. Secondly, the data engine built was super effective in providing quality mask labels. One could argue the main contribution of this paper is the improvement on the data side instead of architectural or scaling.

Q2: How did they represent mask prediction if there were no class?
A2: The DETR paper described in detailed the formulation of a set prediction problem, which use a combination of the bipartite matching loss and transformer decoding to build a model that are capable of predicting a set of unique bounding boxes without the use of nonmax suppresion. MaskFormer take it further and make an end-to-end trainable architecture for panoptic segmentation with the use of 2 tasks: Predicting a mask embeddings and Predicting mask classification.

Discussion questions

Q1:

Paper Review

Summary

The paper introduces the Segment Anything: a new task, model, and dataset for image segmentation. Architecturally, SAM built on prior work done with Transformer, Vision Transformer, DETR, and MaskFormer to build an end-to-end trainable, prompt-based segmentation model. The main contribution of the paper is 3-fold, first, it combines ideas from LLMs and Segmetation to create a foundational model with great generalization power using a new task and a new model that works with prompt inputs and produce mask outputs. Secondly, the author created a data engine that enable a feedback loop between model and human annotator to greatly improve the number and quality of mask labels. Thirdly, they released the model and a (orders of magnitude more) massive dataset to the research community for further research.

Strengths

  • The model architecture that can incorporate various prompt types and image encoder to produce reasonable binary mask prediction. This is a novel task and a some-what novel architecture that show the author’s deep consideration for zero-shot generalization
  • The data engine is effective in getting more data labeled
  • Massive data release AND detailed description of Dataset, Annotation and model Cards will benefit the research community in terms of data resource, transparency, and reproducibility.
  • Detailed analysis on zero-shot learning and fairness is positive, especially for a foundation model like this one
  • Available of Prompt allows solving: Ambiguity for downstream tasks, Assist annotation generation, etc.

Weaknesses

  • Experimentation and benchmarking compared to other models could be more comprehensive and show more detailed, quantified performance comparison, especially when this model is supposed to generalize well.
  • Limited demonstration with text-based prompts

Reflection

  • DETR and MaskFormer architectures propose fascinating ways to eliminate the need for post-processing - a standard in the early days of segmentation and object detection.
  • Once again scaling the model and data produce significant gains for multi-modality models
  • Designing model architecture and task with Zero-shot generalization in mind and produce some interesting new approach to existing problems

Most interesting thought/idea from reading this paper

I’m wondering if the mask encoding could be extracted and used for other purposes via fine-tuning.

Updated: