can you explain me diffusion model of Alphafold 3

Reading File
Searching the Web

AlphaFold 3 diffusion model architecture explained

Searching PubMed

AlphaFold3 diffusion model protein structure prediction

Reading Web Page

AlphaFold3 diffusion model architecture training inference denoising

Here's a thorough explanation of AlphaFold 3's diffusion model:

AlphaFold 3 — The Diffusion Module

Why Diffusion? The Core Idea

AlphaFold 2 used a deterministic, geometry-based module called the Structure Module with Invariant Point Attention (IPA). It predicted residue-level frames and torsion angles (χ angles) and was architecturally tailored to proteins only.
AlphaFold 3 replaces this entirely with a diffusion-based generative model — the same paradigm used in image generators like Imagen. This shift had two goals:
  1. Handle all molecule types (proteins, DNA, RNA, small molecules, ions) without special-casing each one.
  2. Operate at full all-atom resolution rather than just the residue level.

Where the Diffusion Module Sits in AF3

The overall pipeline is:
Input (sequence + MSA + templates)
        ↓
  Input Feature Embedder
        ↓
  Pairformer (replaces Evoformer)
  → Single representation (N-dimensional, per-token)
  → Pair representation (N×N, pairwise relations)
        ↓
  Diffusion Module
        ↓
  Output: Physical 3D atom coordinates
AF3 is best described as a conditional diffusion model where most of the heavy computation happens in the conditioning (i.e., the Pairformer trunk), not in the diffusion step itself.

How the Diffusion Process Works

Training (Forward + Reverse Diffusion)

  1. Forward pass (noising): Take the known ground-truth atomic coordinates from the Protein Data Bank and corrupt them by adding Gaussian noise at varying levels of intensity — from slightly perturbed to nearly pure noise.
  2. Reverse pass (denoising): Train the neural network to receive these noisy coordinates and predict the clean, original coordinates. The conditioning signal is the sequence + pair/single representations from the Pairformer.
  3. Efficiency trick: The Pairformer trunk is expensive, so it is run only once per training step. From its output, 48 versions of the input structure are generated (via random rotations, translations, and independent noise additions) and the diffusion module is trained on all 48 in parallel.

Inference (Structure Generation)

  1. Start from a randomly initialized "cloud" of atom coordinates (a generated conformer), rather than a "black hole" initialization as in AF2.
  2. Over many iterative denoising steps, the model refines these coordinates — stepping from high noise toward a plausible low-energy structure.
  3. The process converges on a final predicted 3D structure for the entire molecular complex.
  4. By default, AF3 runs this sampling 5 times (5 different random seeds), producing 5 candidate structures ranked by confidence.

Internal Architecture of the Diffusion Module

The module takes as input:
  • The Single representation and Pair representation from Pairformer (the conditioning)
  • The current (noisy) atom coordinates
It uses a Transformer-based denoiser with a specific attention pattern:
StageAttention TypeLevel
1Sequence-local attentionAtom-level
2Global attentionToken-level
3Sequence-local attentionAtom-level
At each denoising step, the module outputs updated atom coordinates. A noise-scaled step is added (simulating the reverse diffusion trajectory), and this repeats n times until convergence.

Key Differences: AF2 Structure Module vs. AF3 Diffusion Module

FeatureAF2 Structure ModuleAF3 Diffusion Module
Model typePredictive (discriminative)Generative
ResolutionResidue-level (frames + χ angles)All-atom coordinates
EquivarianceExplicitly enforced (SE(3))Not enforced — learned implicitly
AttentionIPA (Invariant Point Attention)Local → Global → Local
LossFAPE (frame-aligned point error)MSE + bond geometry loss
Initial coordsBlack-hole initializationRandom conformer

SE(3) Equivariance — Dropped But Recovered

A notable design choice: AF3 does not explicitly enforce SE(3) equivariance (i.e., the property that rotating/translating the input gives the same rotated/translated output). Instead, during every diffusion step, all structures are randomly re-centred, rotated, and translated. This forces the model to learn to use positional information regardless of orientation — recovering equivariance through data augmentation rather than architectural constraint. This greatly simplifies the model.

The Loss Function

The training loss combines:
$$L = \frac{t^2 + \sigma_{data}^2}{(t + \sigma_{data})^2} \cdot (L_{MSE} + \alpha_{bond} \cdot L_{bond}) + L_{smooth_lddt}$$
Where:
  • L_MSE — mean squared error on atom coordinate predictions
  • L_bond — penalty for incorrect bond geometry (encourages chemically valid structures)
  • L_smooth_lddt — auxiliary loss on the predicted local distance difference test (confidence) score
  • t — the noise level at that diffusion step

Key Advantages of This Approach

  • Generality: A single unified model handles proteins, nucleic acids, small molecules, and ions — without molecule-type-specific architectural branches.
  • Multiple samples: Running diffusion multiple times gives an ensemble of plausible structures, capturing conformational uncertainty.
  • Hallucination control: Disordered regions that are genuinely flexible are better handled than in AF2's deterministic model (though AF3 had to specifically fix a bug where disordered regions were incorrectly "ordered" during early training).
  • Simplicity: The structure module became conceptually simpler — just coordinates in R³, processed like any other feature.

In Summary

AlphaFold 3's diffusion module is a conditional generative model: the Pairformer encodes rich chemical context (sequence, MSA, templates) into pair and single representations, and the diffusion module uses those representations to iteratively denoise a cloud of atoms into a physically plausible 3D structure. It mirrors the intuition of image diffusion models — start with noise, use contextual information to guide convergence — but applied to the 3D coordinate space of molecular complexes.
This is a shared conversation. Sign in to Orris to start your own chat.