U-Nets with attention
U-Net are popular NN architecture which are employed for many applications and were initially developed for medical image segmentation.
U-net consists of a series of up/down sampling convolution blocks which are connected via residual connection, along with a middle block. This set up allow allows one to global location and context at the same time in the learning process. It works with very few training samples and provides better performance for segmentation tasks.
Diffusion models are generative models which are used to learn image/data distribution. The original images is diffused to noise via series of forward steps and in the reverse steps a model is trained to generate the original image from different time steps of noisy images. u-net along with attention model to learning reverse process, i.e. parameters θ are parameters of different blocks of u-net which are are learned in the diffusion process. In contrast to simple u-net there is more nuances for u-nets employed in the diffusion models.
In this post we discuss the implementation of the U-NET model with attention employed in current SOTA latent diffusion model. Specifically we discuss the open source implementation from labmlai. We select this implementation to allow read to load weights from original implementation.