1 minute read

Paper Title Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Authors Saharia & Chan et al.
Date 2022-05
Link https://arxiv.org/pdf/2205.11487.pdf

Paper Review

Short Summary

Imagen is a text-to-image diffusion model that achieve great consistency with text prompt and image fidelity. There are 3 stages: Base text-to-image model and 2 super-resolution models. The base model use a frozen T5-XXL as its text encoder, and a UNet conditioned on the text embeddings. The super resolution models use Efficient Unets and Dynamic thresholding for sampling. The model achieve SOTA FID and best result on DrawBench, a new benchmark dataset introduced by Google.

Strengths

  • Can generate text in image very well, which is much better compared to DALLE 2 or DDPM
  • Simple architecture and easy to train
  • Dynamic thresholding is proven to help generate image with better text-alignment consistency and fidelity.
  • Nice demonstration of transfer learning from Text-only domain of T5 to Text-image domain

Weaknesses

  • Not perform so well with prompts that contain Counting, Positional relationship, possibly due to the limitation of the T5-XXL language model.
  • The description of methods and architecture is unituitive and lack details, making it harder for other researchers to understand and replicate results.
  • Doesn’t describe any of the training dataset statistics in any level of details, thus, make it very hard for the research community to understand the implication.

Reflection

It seems like the model demonstrated that text encoding doesn’t neccessary have to be aligned with Vision to generate good image. Put it another way, it is possible, even easier, to take frozen text embeddings and train the vision model to align with it.

Most interesting thought/idea from reading this paper

Most of the improvements in this paper is not outstanding, makes me think that, even for Image generation, computing resources are still the most important factor.

Updated: