A Foundation Model for Biomolecular Structure and Binding Affinity Prediction
The intersection of deep learning and structural biology has reached a new inflection point with Boltz-2, a foundation model that simultaneously predicts biomolecular complex structures and binding affinities at speeds 1000× faster than free energy perturbation (FEP) methods. Developed by researchers at MIT’s Jameel Clinic and Valence Labs/Recursion, Boltz-2 represents a significant evolution from its predecessor and introduces the first AI model to approach FEP accuracy for small molecule-protein binding affinity prediction. Unlike AlphaFold3, Boltz-2 is released under an MIT license with full training code and weights, making it immediately accessible for both academic and industrial applications.
This technical deep-dive explores Boltz-2’s architectural innovations, training methodology, and how it compares to the current state-of-the-art in biomolecular structure prediction.
Boltz-2’s architecture comprises four main components: the trunk, the denoising module with steering components, the confidence module, and the novel affinity module. The overall framework follows the AlphaFold3 paradigm of combining transformer-based representation learning with diffusion-based structure prediction, but introduces several key modifications.
The trunk module processes input sequences and generates pair representations that encode information about biomolecular interactions. Unlike AlphaFold3’s 48 Pairformer blocks, Boltz-2 uses 64 Pairformer layers, representing a significant increase in model capacity. Each Pairformer block contains triangle multiplicative updates (outgoing and incoming edges), triangle attention operations, single attention with pair bias, and transition blocks using SwiGLU activation.
A major computational advancement in Boltz-2 is the use of mixed-precision training (bfloat16) for the majority of the trunk, combined with custom trifast kernels for triangular attention operations. These optimizations enable scaling the training crop size to 768 tokens, matching AlphaFold3’s capacity while maintaining computational tractability. The pair representation maintains 128 channels throughout processing, with a single representation of 384 channels.
The trunk also includes a template module similar to AlphaFold3’s implementation, with 64-dimensional template pairwise representations processed through 2 template blocks. However, Boltz-2 extends template functionality to support multimeric templates—a departure from previous approaches that only allowed single-chain templates.
Building on innovations from Boltz-1, Boltz-2 retains a modified MSA module operation order that differs from AlphaFold3:
This reordering allows single representations from MSATransition to propagate directly to the pair representation, improving information flow. The model supports up to 8,192 MSA sequences during training and employs a novel MSA sampling strategy where sequences are randomly sampled from the top 16k hits rather than greedy selection, promoting robustness to low-quality MSAs. Additionally, 5% of training iterations randomly drop all MSA data to improve single-sequence prediction capabilities.
Boltz-2’s tokenization scheme assigns one token per standard amino acid and nucleotide, with a key departure from AlphaFold3, Chai-1, and Boltz-1: non-canonical amino acids and nucleotides are kept as single tokens rather than being tokenized at the atomic level. This simplification reduces sequence length while maintaining biological relevance.
New input features compared to Boltz-1 include:
Boltz-2 inherits the diffusion-based structure prediction approach established in AlphaFold3, where atomic coordinates are predicted through iterative denoising of randomly initialized positions. The denoising module operates at two resolution levels: atoms and tokens.
The structure module uses an atom-level transformer that processes local neighborhoods—32-atom blocks attending to the closest 128 atoms—enabling efficient handling of large complexes. The denoising module maintains float32 precision due to observed instabilities at lower precision levels, contrasting with the trunk’s bfloat16 operations.
Key diffusion hyperparameters include:
| Parameter | Value |
|---|---|
| sigma_min | 0.0001 |
| rho | 7 |
| gamma_0 | 0.8 |
| gamma_min | 1.0 |
| noise_scale | 1.003 |
| step_scale | 1.5 |
Default inference uses 200 sampling steps, 10 recycling iterations, and generates 5 output samples. Runtime averages 40-60 seconds per protein-ligand prediction, scaling quadratically with sequence length.
A significant challenge for co-folding models—including AlphaFold3, Chai-1, and Boltz-1—is the production of structures with physical inaccuracies such as steric clashes and incorrect stereochemistry. Boltz-2 addresses this through Boltz-steering, an inference-time technique that applies physics-based potentials during reverse diffusion.
Steering potentials are applied for:
When enabled (producing “Boltz-2x”), 97% of predicted poses pass physical quality checks compared to only 43% without steering.
Boltz-2 introduces three major controllability mechanisms responding to user demand for hypothesis testing without costly retraining.
The model is trained on structures from diverse experimental methods and can condition predictions on the desired output type:
Method conditioning is implemented through one-hot encoding in the single token representation, allowing the model to produce structures matching the characteristic distributions of different experimental techniques.
Unlike AlphaFold3 and Chai-1 which only support monomeric templates, Boltz-2 enables multimeric templates by grouping template hits by PDB ID. During training, 0-4 templates are sampled per chain from the top 20 template hits. For users requiring strict template adherence, a steering potential enforces that structures remain within a user-specified distance cutoff from the template:
\[E*\mathrm{planar}(x) = \sum*{i∈S*\mathrm{template \, atoms}} \max(||x_i - x^\mathrm{ref}\_i|| - \alpha*\mathrm{cutoff}, 0)\]where $x^\mathrm{ref}_i$ is the position of reference atom $i$ after aligning the template to predicted coordinates.
Users can specify distance constraints between tokens through contact and pocket conditioning, encoded as pairwise features. Contact types include: no restraint, pocket-to-binder relationship, binder-to-pocket relationship, and contact relationship. Distance constraints range from 4Å to 20Å, encoded through normalized distance and Fourier embeddings with fixed random bases.
A time-dependent steering potential enforces these constraints:
\[E^t*\mathrm{Contact(A,B)}(x) = \frac{\sum \exp(-λ^t*\mathrm{union} · \max(||x*i-x_j||-r*{AB}, 0)) · \max(||x*i-x_j||-r*{AB}, 0)} {\sum \exp(-λ^t*\mathrm{union} · \max(||x_i-x_j||-r*{AB}, 0))}\]where $λ^t_\mathrm{union}$ increases monotonically as $t$ approaches 0, progressively tightening constraint enforcement.
The most significant innovation in Boltz-2 is its binding affinity prediction capability—the first AI model to approach FEP accuracy while being orders of magnitude faster.
The affinity module operates on Boltz-2’s structural predictions, processing the pair representation and predicted coordinates after 5 recycling iterations. The architecture consists of:
The final predictions are:
Boltz-2 employs two affinity models with different hyperparameters for ensemble robustness:
| Parameter | Model 1 | Model 2 |
|---|---|---|
| PairFormer layers | 8 | 4 |
| $λ_\mathrm{focal}$ | 0.8 | 0.6 |
| Training samples | 55M | 12.5M |
Binary predictions are averaged, while affinity values undergo molecular weight correction:
\[\hat y = C*0 \cdot (y_1+y_2) + C_1 \cdot \mathrm{MW*{binder}} + C_2\]where constants are fitted on a holdout validation set.
On the FEP+ 4-target benchmark (CDK2, TYK2, JNK1, P38):
On the CASP16 blind affinity challenge:
For hit discovery on MF-PCBA:
Training proceeds through four stages with increasing crop sizes:
| Stage | Learning Rate | Crop Size | Steps | MD Data | Distillation |
|---|---|---|---|---|---|
| 1 | 1e-3 | 384 | 88k | No | Yes |
| 2 | 5e-4 | 512 | 4k | Yes | Yes |
| 3 | 5e-4 | 640 | 4k | Yes | Yes |
| 4 | 5e-4 | 768 | 1k | No | No |
The final stage uses only PDB data (cutoff: 2023-06-01) to maintain highest quality. The model is trained with a diffusion multiplicity of 32 samples per example and uses 128 A100 GPUs for affinity module training.
Boltz-2’s training extends beyond the PDB to include:
Molecular dynamics ensembles:
100 frames are uniformly sampled from trajectories for ensemble supervision.
Self-distillation datasets:
A novel addition is B-factor prediction—the trunk’s single representation is supervised to predict each token’s B-factor. For MD structures, B-factors are computed from RMSF values:
\[B = (8 \pi^2/3) \times \mathrm{RMSF^2}\]This supervision specifically targets local structural dynamics and improves the model’s understanding of conformational flexibility.
Affinity training occurs separately with gradients detached from the structure trunk. The pipeline incorporates:
The loss function combines:
Censor-aware supervision handles inequality qualifiers (e.g., “>”) appropriately, treating them as bounds rather than exact measurements.
| Component | AlphaFold3 | Boltz-2 |
|---|---|---|
| Pairformer blocks | 48 | 64 |
| MSA module blocks | 4 | Reordered operations |
| Pair representation dim | 128 | 128 |
| Single representation dim | 384 | 384 |
| Max crop size | 768 | 768 |
| Templates | Monomeric only | Multimeric supported |
| Method conditioning | No | Yes |
| B-factor prediction | No | Yes |
| Affinity prediction | No | Yes |
| Physical steering | No | Yes (Boltz-2x) |
AlphaFold3 uses 4 PairFormer layers for confidence prediction; Boltz-2 uses 8 PairFormer layers but adopts a simpler architecture than Boltz-1’s expensive 48-layer confidence trunk. A key innovation is separating PDE and PAE prediction into two heads—one for intra-chain pairs and one for inter-chain pairs.
On recent PDB structures (2024-2025):
On the Polaris-ASAP ligand pose competition (SARS-CoV-2/MERS-CoV proteases):
AlphaFold3:
Boltz-2:
Boltz-2 enables structure-based virtual screening at unprecedented scale. In a prospective evaluation against TYK2:
Fixed library screening (Enamine Hit Locator Library, 460k compounds):
Generative screening (SynFlowNet + Enamine REAL 76B space):
The combined Boltz-2 + SynFlowNet workflow demonstrates an effective de novo binder generation pipeline validated through ABFE simulations.
Despite significant advances, several limitations remain:
Molecular dynamics: While improved over Boltz-1, ensemble diversity metrics still lag behind specialized models like BioEmu and AlphaFlow. The MD dataset was only introduced in later training stages.
Structure prediction: Performance does not significantly exceed predecessors due to similar training data and architecture. Large conformational changes induced by binding remain challenging.
Affinity prediction dependencies: Accurate affinity prediction requires correct pocket identification and binding interface reconstruction. Performance varies substantially across assays (Pearson R ranging from 0.06 to 0.73), suggesting target-specific applicability.
Cofactor handling: The current affinity module does not explicitly handle cofactors (ions, water, multimeric binding partners) that may be essential for certain binding interactions.
Boltz-2 represents a significant step toward integrated structure-affinity prediction for drug discovery. By combining structural co-folding capabilities with FEP-competitive binding affinity prediction, extensive controllability features, and physical quality enforcement, Boltz-2 provides a foundation for computational drug discovery workflows. The open release of weights, inference code, and training pipelines positions Boltz-2 as an extensible platform for the computational structural biology community.
The key innovations—affinity prediction approaching FEP accuracy at 1000× the speed, multimeric template support, experimental method conditioning, and Boltz-steering for physical plausibility—address critical gaps between structure prediction and practical drug discovery applications. As training data expands and architectural refinements continue, models like Boltz-2 may increasingly complement or replace expensive physics-based simulations in early-stage drug discovery.
Primary Reference: Saro Passaro, Gabriele Corso, Jeremy Wohlwend et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, BioRxiv 2025.
Here are some more articles you might like to read next: