Overview and Summary of 'Diffusion Models are Evolutionary Algorithms' Paper

Posted on Tue 08 October 2024 in posts

Introduction

Over time, I became interested in not simply reading the relevant research papers in my field, but also write their overviews. I believe that writing these will help me solidify information I learned in various papers.

This review begins what I hope to make a series of posts about various papers I read on topics that include machine learning and AI as well as broader scientific themes.

Link

This page overviews the paper titled Diffusion Models are Evolutionary Algorithms. Here is the link to the original pdf of the paper

Review

In regular diffusion model, there are essentially two passes: 1. the "Forward" pass: in T steps, the model goes from original data point to a noisy data point. At each step, add noise and estimate that noise by a neural network, computing a, roughly, MSE:

$$ \mathbb{E}\left[\lvert \varepsilon - \widehat{\varepsilon}\rvert^2\right]$$

Once the neural networks for de-noising are trained (a separate network for each step), we perform the de-noising procedure as a "backward" pass. This is not backward propagation! We start with a noisy data point, and iterate over each network to arrive at a new data estimate.

What the authors did differently

The authors change interpretation of the diffusion algorithm. Instead of predicting the noise, they train the models to directly reconstruct and estimate the original data point.
During de-noising stage of diffusion, instead of using the neural network for predicting the data point, the authors use fitness function¹ and conditional probabilities for estimation. The estimate of the data point is the expected value of $\mathbf{x}$ given the sample at time $t$:
$$\widehat{\mathbf{x}_0}=\mathbb{E}\left[\mathbf{x}\vert \mathbf{x}_t\right]$$
The expectation is computed over $N$ samples at time step $t$ and probability that initial sample (true data point) is equal to each sample is computed via Bayes' theorem:
$$P\left(\mathbf{x}_0 = \mathbf{x}\vert \mathbf{x}_t\right) = \frac{P(\mathbf{x}_t\vert \mathbf{x}_0 = \mathbf{x})P\left(\mathbf{x}_0 = \mathbf{x}\right)}{P\left(\mathbf{x}_t\right)}$$
The denominator in the above expression is treated as normalization term

Key Takeaways

Diffusion models are equivalent to evolutionary algorithms and vice versa: one can view evolutionary algorithm as diffusion process from the point of view of generation
Given that point of view, the evolutionary approach can lead to a wide range of applications (as evidenced by experiments) and scale to higher dimensional spaces

In addition to above points, it does make sense to treat the diffusion process as evolutionary from the point of view of physical observation: if our dataset starts with one distribution, in order to diffuse into another one, it must continuously mold itself into a new state (distribution). We can discretize this process and therefore obtain a sequence of states evolving from start to finish.

The authors pose excellent open questions which I will not reiterate here that can be of interest to those who specialize in the field.

A fitness function evaluates how close a given solution is to the desired outcome, a concept often used in optimization and evolutionary computation. ↩︎

Introduction

Link

Review

What the authors did differently

Key Takeaways

You might enjoy