An Overview of Boltz-2

A Foundation Model for Biomolecular Structure and Binding Affinity Prediction

The intersection of deep learning and structural biology has reached a new inflection point with Boltz-2, a foundation model that simultaneously predicts biomolecular complex structures and binding affinities at speeds 1000× faster than free energy perturbation (FEP) methods. Developed by researchers at MIT’s Jameel Clinic and Valence Labs/Recursion, Boltz-2 represents a significant evolution from its predecessor and introduces the first AI model to approach FEP accuracy for small molecule-protein binding affinity prediction. Unlike AlphaFold3, Boltz-2 is released under an MIT license with full training code and weights, making it immediately accessible for both academic and industrial applications.

This technical deep-dive explores Boltz-2’s architectural innovations, training methodology, and how it compares to the current state-of-the-art in biomolecular structure prediction.

Architectural foundation: Building on AlphaFold3’s paradigm

Boltz-2’s architecture comprises four main components: the trunk, the denoising module with steering components, the confidence module, and the novel affinity module. The overall framework follows the AlphaFold3 paradigm of combining transformer-based representation learning with diffusion-based structure prediction, but introduces several key modifications.

Figure 2 (from the paper): Boltz-2 model architecture diagram, showing the trunk, denoising module, confidence module, and affinity module components with their interconnections.

1. The trunk module

The trunk module processes input sequences and generates pair representations that encode information about biomolecular interactions. Unlike AlphaFold3’s 48 Pairformer blocks, Boltz-2 uses 64 Pairformer layers, representing a significant increase in model capacity. Each Pairformer block contains triangle multiplicative updates (outgoing and incoming edges), triangle attention operations, single attention with pair bias, and transition blocks using SwiGLU activation.

A major computational advancement in Boltz-2 is the use of mixed-precision training (bfloat16) for the majority of the trunk, combined with custom trifast kernels for triangular attention operations. These optimizations enable scaling the training crop size to 768 tokens, matching AlphaFold3’s capacity while maintaining computational tractability. The pair representation maintains 128 channels throughout processing, with a single representation of 384 channels.

The trunk also includes a template module similar to AlphaFold3’s implementation, with 64-dimensional template pairwise representations processed through 2 template blocks. However, Boltz-2 extends template functionality to support multimeric templates—a departure from previous approaches that only allowed single-chain templates.

2. MSA module reordering

Building on innovations from Boltz-1, Boltz-2 retains a modified MSA module operation order that differs from AlphaFold3:

AlphaFold3 order: OuterProductMean → PairWeightedAveraging → MSATransition → TriangleUpdates → PairTransition
Boltz-2 order: PairWeightedAveraging → MSATransition → OuterProductMean → TriangleUpdates → PairTransition

This reordering allows single representations from MSATransition to propagate directly to the pair representation, improving information flow. The model supports up to 8,192 MSA sequences during training and employs a novel MSA sampling strategy where sequences are randomly sampled from the top 16k hits rather than greedy selection, promoting robustness to low-quality MSAs. Additionally, 5% of training iterations randomly drop all MSA data to improve single-sequence prediction capabilities.

3. Tokenization and featurization

Boltz-2’s tokenization scheme assigns one token per standard amino acid and nucleotide, with a key departure from AlphaFold3, Chai-1, and Boltz-1: non-canonical amino acids and nucleotides are kept as single tokens rather than being tokenized at the atomic level. This simplification reduces sequence length while maintaining biological relevance.

New input features compared to Boltz-1 include:

Single token level: cyclic flag for cyclic polymers, modified flag for non-canonical residues, one-hot experimental method encoding, molecular type feature (protein/DNA/RNA/other)
Pairwise token level: bond type features encoding order/aromaticity between token pairs
Positional encodings: cyclic-offset encodings for cyclic polymers and relative chain encoding restricted to symmetric chains

The denoising module: Diffusion-based structure prediction

Boltz-2 inherits the diffusion-based structure prediction approach established in AlphaFold3, where atomic coordinates are predicted through iterative denoising of randomly initialized positions. The denoising module operates at two resolution levels: atoms and tokens.

1. Diffusion architecture

The structure module uses an atom-level transformer that processes local neighborhoods—32-atom blocks attending to the closest 128 atoms—enabling efficient handling of large complexes. The denoising module maintains float32 precision due to observed instabilities at lower precision levels, contrasting with the trunk’s bfloat16 operations.

Key diffusion hyperparameters include:

Parameter	Value
sigma_min	0.0001
rho	7
gamma_0	0.8
gamma_min	1.0
noise_scale	1.003
step_scale	1.5

Default inference uses 200 sampling steps, 10 recycling iterations, and generates 5 output samples. Runtime averages 40-60 seconds per protein-ligand prediction, scaling quadratically with sequence length.

2. Boltz-steering for physical quality

A significant challenge for co-folding models—including AlphaFold3, Chai-1, and Boltz-1—is the production of structures with physical inaccuracies such as steric clashes and incorrect stereochemistry. Boltz-2 addresses this through Boltz-steering, an inference-time technique that applies physics-based potentials during reverse diffusion.

Steering potentials are applied for:

Tetrahedral chirality at chiral centers
Bond stereochemistry
Planar bonds in aromatic rings
Internal geometry constraints
Steric clash prevention
Overlapping chain detection
Covalent bond enforcement

When enabled (producing “Boltz-2x”), 97% of predicted poses pass physical quality checks compared to only 43% without steering.

Novel controllability features

Boltz-2 introduces three major controllability mechanisms responding to user demand for hypothesis testing without costly retraining.

1. Experimental method conditioning

The model is trained on structures from diverse experimental methods and can condition predictions on the desired output type:

X-ray diffraction
Electron microscopy (cryo-EM)
Solution NMR / Solid-state NMR
Molecular dynamics
Distillation predictions (AlphaFold Database, Boltz-1)
Other methods (infrared spectroscopy, fluorescence transfer, EPR, solution scattering)

Method conditioning is implemented through one-hot encoding in the single token representation, allowing the model to produce structures matching the characteristic distributions of different experimental techniques.

Figure 5 (from the paper): MD conditioning results showing RMSF correlations for mdCATH and ATLAS datasets, demonstrating the effect of MD vs. X-ray conditioning.

2. Template conditioning and steering

Unlike AlphaFold3 and Chai-1 which only support monomeric templates, Boltz-2 enables multimeric templates by grouping template hits by PDB ID. During training, 0-4 templates are sampled per chain from the top 20 template hits. For users requiring strict template adherence, a steering potential enforces that structures remain within a user-specified distance cutoff from the template:

\[E*\mathrm{planar}(x) = \sum*{i∈S*\mathrm{template \, atoms}} \max(||x_i - x^\mathrm{ref}\_i|| - \alpha*\mathrm{cutoff}, 0)\]

where $x^\mathrm{ref}_i$ is the position of reference atom $i$ after aligning the template to predicted coordinates.

Contact and pocket conditioning

Users can specify distance constraints between tokens through contact and pocket conditioning, encoded as pairwise features. Contact types include: no restraint, pocket-to-binder relationship, binder-to-pocket relationship, and contact relationship. Distance constraints range from 4Å to 20Å, encoded through normalized distance and Fourier embeddings with fixed random bases.

A time-dependent steering potential enforces these constraints:

\[E^t*\mathrm{Contact(A,B)}(x) = \frac{\sum \exp(-λ^t*\mathrm{union} · \max(||x*i-x_j||-r*{AB}, 0)) · \max(||x*i-x_j||-r*{AB}, 0)} {\sum \exp(-λ^t*\mathrm{union} · \max(||x_i-x_j||-r*{AB}, 0))}\]

where $λ^t_\mathrm{union}$ increases monotonically as $t$ approaches 0, progressively tightening constraint enforcement.

The affinity module: Approaching FEP accuracy

The most significant innovation in Boltz-2 is its binding affinity prediction capability—the first AI model to approach FEP accuracy while being orders of magnitude faster.

1. Architecture design

The affinity module operates on Boltz-2’s structural predictions, processing the pair representation and predicted coordinates after 5 recycling iterations. The architecture consists of:

Initialization: LinearNoBias layers initialize single and pair representations from trunk outputs
Distogram conditioning: Predicted inter-token distances are one-hot encoded and added to pair representations
PairFormer processing: 4-8 PairFormer layers process interactions, masked to focus exclusively on protein-ligand and intra-ligand interactions
Mean pooling: Aggregation over interaction pairs produces a scalar representation
Output heads: Two MLP heads predict binding likelihood (classification) and affinity value (regression)

The final predictions are:

Binding likelihood: 0-1 probability for hit discovery applications
Affinity value: log₁₀(IC50) in μM for lead optimization applications

2. Ensemble strategy

Boltz-2 employs two affinity models with different hyperparameters for ensemble robustness:

Parameter	Model 1	Model 2
PairFormer layers	8	4
$λ_\mathrm{focal}$	0.8	0.6
Training samples	55M	12.5M

Binary predictions are averaged, while affinity values undergo molecular weight correction:

\[\hat y = C*0 \cdot (y_1+y_2) + C_1 \cdot \mathrm{MW*{binder}} + C_2\]

where constants are fitted on a holdout validation set.

Figure 1 (from the paper): Comparison the accuracy/speed tradeoff of Boltz-2 against FEP+, ABFE, OpenFE, and ML baselines on the 4-target protein-ligand benchmark.

3. Benchmark performance

On the FEP+ 4-target benchmark (CDK2, TYK2, JNK1, P38):

Boltz-2: Pearson R = 0.66
FEP+: Pearson R = 0.78
ABFE: Pearson R = 0.75
OpenFE: Pearson R = 0.66

On the CASP16 blind affinity challenge:

Boltz-2 achieved Pearson R = 0.65, outperforming all submitted competition entries without fine-tuning

For hit discovery on MF-PCBA:

Average precision: 0.025 (nearly double ML baselines)
Enrichment factor at 0.5%: 18.4

Training methodology

1. Structure training

Training proceeds through four stages with increasing crop sizes:

Stage	Learning Rate	Crop Size	Steps	MD Data	Distillation
1	1e-3	384	88k	No	Yes
2	5e-4	512	4k	Yes	Yes
3	5e-4	640	4k	Yes	Yes
4	5e-4	768	1k	No	No

The final stage uses only PDB data (cutoff: 2023-06-01) to maintain highest quality. The model is trained with a diffusion multiplicity of 32 samples per example and uses 128 A100 GPUs for affinity module training.

2. Extended training data

Boltz-2’s training extends beyond the PDB to include:

Molecular dynamics ensembles:

MISATO (11,235 systems, 8ns NVT trajectories)
ATLAS (1,284 proteins, 100ns NPT trajectories)
mdCATH (5,270 systems, up to 500ns)

100 frames are uniformly sampled from trajectories for ensemble supervision.

Self-distillation datasets:

AlphaFold Database (~5M single-chain monomers)
RNA (Rfam clusters, PDE ≤ 2.0)
Protein-DNA (JASPAR transcription factors, iPDE ≤ 1.0, ipTM ≥ 0.7)
RNA-ligand (R-SIM dataset)
Protein-ligand (BindingDB/ChEMBL, iPDE ≤ 1.0, ipTM ≥ 0.9)
TCR-pMHC (VDJdb dataset)
pMHC-I/II (IEDB epitopes)

3. B-factor supervision

A novel addition is B-factor prediction—the trunk’s single representation is supervised to predict each token’s B-factor. For MD structures, B-factors are computed from RMSF values:

\[B = (8 \pi^2/3) \times \mathrm{RMSF^2}\]

This supervision specifically targets local structural dynamics and improves the model’s understanding of conformational flexibility.

4. Affinity training

Affinity training occurs separately with gradients detached from the structure trunk. The pipeline incorporates:

Pocket pre-processing: For each target, 10 random binders are predicted to identify consensus binding sites
Affinity cropping: Up to 256 tokens (200 protein maximum) around the binding pocket
Feature pre-processing: Trunk representations and coordinates cached to reduce training overhead
Activity cliff sampling: Assays weighted by interquartile range (IQR) of affinity values to prioritize informative examples

The loss function combines:

Huber loss on absolute values: $δ$ = 0.5, weight = 0.1
Huber loss on pairwise differences: $δ$ = 0.5, weight = 0.9
Focal loss for binary classification: $γ$ = 1, $α = λ_\mathrm{focal}$

Censor-aware supervision handles inequality qualifiers (e.g., “>”) appropriately, treating them as bounds rather than exact measurements.

Figure 6 (from the paper): Pearson correlation comparison across affinity benchmarks showing Boltz-2 performance on FEP+ subsets, OpenFE, internal targets, and CASP16.

Comparison with AlphaFold3

1. Architectural differences

Component	AlphaFold3	Boltz-2
Pairformer blocks	48	64
MSA module blocks	4	Reordered operations
Pair representation dim	128	128
Single representation dim	384	384
Max crop size	768	768
Templates	Monomeric only	Multimeric supported
Method conditioning	No	Yes
B-factor prediction	No	Yes
Affinity prediction	No	Yes
Physical steering	No	Yes (Boltz-2x)

2. Confidence module

AlphaFold3 uses 4 PairFormer layers for confidence prediction; Boltz-2 uses 8 PairFormer layers but adopts a simpler architecture than Boltz-1’s expensive 48-layer confidence trunk. A key innovation is separating PDE and PAE prediction into two heads—one for intra-chain pairs and one for inter-chain pairs.

3. Performance comparison

Figure 3 (from the paper): Structure prediction benchmark figure comparing Boltz-2, AlphaFold3, Chai-1, ProteinX, and Boltz-1 across different complex types.

On recent PDB structures (2024-2025):

Boltz-2 matches or moderately improves over Boltz-1 across modalities
Strongest improvements in RNA chains and DNA-protein complexes where distillation data was most significantly augmented
Boltz-2 edges out Chai-1 and ProteinX but lags slightly behind AlphaFold3
On antibody-antigen benchmarks, Boltz-2 shows improvement over Boltz-1 while still trailing AlphaFold3

Figure 4 (from the paper): Antibody benchmark and Polaris-ASAP competition results.

On the Polaris-ASAP ligand pose competition (SARS-CoV-2/MERS-CoV proteases):

Boltz-2 achieved 84.8% success rate (< 2Å RMSD)
Matched performance of top-5 competition entries without fine-tuning or physics relaxation

4. Design philosophy differences

AlphaFold3:

Closed-source weights and training code
Focus on structure prediction accuracy
Limited controllability features
Heavy reliance on MSA signal

Boltz-2:

Fully open-source (MIT license)
Joint structure and affinity prediction
Extensive controllability (method, template, contact conditioning)
Explicit physical quality enforcement through steering
Designed for drug discovery workflows

Practical applications: Virtual screening at scale

Boltz-2 enables structure-based virtual screening at unprecedented scale. In a prospective evaluation against TYK2:

Fixed library screening (Enamine Hit Locator Library, 460k compounds):

8/10 top compounds predicted to bind by ABFE validation

Generative screening (SynFlowNet + Enamine REAL 76B space):

All 10 selected compounds predicted to bind
Higher average affinity than fixed library screens
Required only 117k Boltz-2 evaluations vs. 460k for fixed library

The combined Boltz-2 + SynFlowNet workflow demonstrates an effective de novo binder generation pipeline validated through ABFE simulations.

Figure 4 (from the paper): TYK2 virtual screening results showing correlation between Boltz-2 screen scores and ABFE readouts, plus distribution across screening strategies.

Current limitations and future directions

Despite significant advances, several limitations remain:

Molecular dynamics: While improved over Boltz-1, ensemble diversity metrics still lag behind specialized models like BioEmu and AlphaFlow. The MD dataset was only introduced in later training stages.

Structure prediction: Performance does not significantly exceed predecessors due to similar training data and architecture. Large conformational changes induced by binding remain challenging.

Affinity prediction dependencies: Accurate affinity prediction requires correct pocket identification and binding interface reconstruction. Performance varies substantially across assays (Pearson R ranging from 0.06 to 0.73), suggesting target-specific applicability.

Cofactor handling: The current affinity module does not explicitly handle cofactors (ions, water, multimeric binding partners) that may be essential for certain binding interactions.

Conclusions

Boltz-2 represents a significant step toward integrated structure-affinity prediction for drug discovery. By combining structural co-folding capabilities with FEP-competitive binding affinity prediction, extensive controllability features, and physical quality enforcement, Boltz-2 provides a foundation for computational drug discovery workflows. The open release of weights, inference code, and training pipelines positions Boltz-2 as an extensible platform for the computational structural biology community.

The key innovations—affinity prediction approaching FEP accuracy at 1000× the speed, multimeric template support, experimental method conditioning, and Boltz-steering for physical plausibility—address critical gaps between structure prediction and practical drug discovery applications. As training data expands and architectural refinements continue, models like Boltz-2 may increasingly complement or replace expensive physics-based simulations in early-stage drug discovery.

Primary Reference: Saro Passaro, Gabriele Corso, Jeremy Wohlwend et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, BioRxiv 2025.

DOI: https://doi.org/10.1101/2025.06.14.659707

GitHub: https://github.com/jwohlwend/boltz.git

Enjoy Reading This Article?

Here are some more articles you might like to read next:

When Transformers learn to speak protein

Deep Learning for Computational Structural Biology

Transformers Revolutionize Protein Structure Prediction and Design

IgFold - a fast, accurate antibody structure prediction

BoltzGen Redefines Protein Binder Design