跳转至

High-Resolution Image Synthesis with Latent Diffusion Models

Architecture

Architecture

Perceptual Image Compression

  • Encoder: given an image \(x \in \mathbb{R}^{H \times W \times 3}\) in RGB space, the encoder \(\epsilon\) encodes \(x\) into a latent representation \(z = \epsilon(x)\). where \(z \in \mathbb{R}^{h \times w \times C}\).
  • Decoder: given a latent representation \(z\), the decoder \(\delta\) decodes \(z\) into an image \(\hat{x} = \delta(z)\).
  • Regularization: the encoder and decoder are trained to minimize the reconstruction error between the input image \(x\) and the decoded image \(\hat{x}\)
  • KL-reg
  • VQ-reg

Generative Modeling of Latent Representations

  • A Time-conditioned UNet

Conditioning Mechanisms

  • conditional denoising autoencoder \(\epsilon_{\theta}(z_t,t,y)\)
  • Add cross-attention mechanism to the Unet architecture
  • To pre-process y from various modalities (such as language prompts) we introduce a domain specific encoder \(τ_θ\) that projects y to an intermediate representation \(τ_θ(y) \in \mathbb{R}^{h \times w \times C}\), which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing
  • \(Attention(Q,K,V) = softmax(QK^T/\sqrt{d})V\), with \(Q = \epsilon_{\theta}(z_t,t,y)\), \(K = τ_θ(y)\), and \(V\) being the intermediate feature maps of the UNet.
  • \(\tau_θ\) and \(\epsilon_θ\) are trained jointly with the rest of the model.

最后更新: 2024年9月3日 10:43:35
创建日期: 2024年9月3日 10:43:35