An Overview of Boltz-2

A Foundation Model for Biomolecular Structure and Binding Affinity Prediction

The intersection of deep learning and structural biology has reached a new inflection point with Boltz-2, a foundation model that simultaneously predicts biomolecular complex structures and binding affinities at speeds 1000× faster than free energy perturbation (FEP) methods. Developed by researchers at MIT’s Jameel Clinic and Valence Labs/Recursion, Boltz-2 represents a significant evolution from its predecessor and introduces the first AI model to approach FEP accuracy for small molecule-protein binding affinity prediction. Unlike AlphaFold3, Boltz-2 is released under an MIT license with full training code and weights, making it immediately accessible for both academic and industrial applications.

This technical deep-dive explores Boltz-2’s architectural innovations, training methodology, and how it compares to the current state-of-the-art in biomolecular structure prediction.


Architectural foundation: Building on AlphaFold3’s paradigm

Boltz-2’s architecture comprises four main components: the trunk, the denoising module with steering components, the confidence module, and the novel affinity module. The overall framework follows the AlphaFold3 paradigm of combining transformer-based representation learning with diffusion-based structure prediction, but introduces several key modifications.

Figure 2 (from the paper): Boltz-2 model architecture diagram, showing the trunk, denoising module, confidence module, and affinity module components with their interconnections.

1. The trunk module

The trunk module processes input sequences and generates pair representations that encode information about biomolecular interactions. Unlike AlphaFold3’s 48 Pairformer blocks, Boltz-2 uses 64 Pairformer layers, representing a significant increase in model capacity. Each Pairformer block contains triangle multiplicative updates (outgoing and incoming edges), triangle attention operations, single attention with pair bias, and transition blocks using SwiGLU activation.

A major computational advancement in Boltz-2 is the use of mixed-precision training (bfloat16) for the majority of the trunk, combined with custom trifast kernels for triangular attention operations. These optimizations enable scaling the training crop size to 768 tokens, matching AlphaFold3’s capacity while maintaining computational tractability. The pair representation maintains 128 channels throughout processing, with a single representation of 384 channels.

The trunk also includes a template module similar to AlphaFold3’s implementation, with 64-dimensional template pairwise representations processed through 2 template blocks. However, Boltz-2 extends template functionality to support multimeric templates—a departure from previous approaches that only allowed single-chain templates.

2. MSA module reordering

Building on innovations from Boltz-1, Boltz-2 retains a modified MSA module operation order that differs from AlphaFold3:

This reordering allows single representations from MSATransition to propagate directly to the pair representation, improving information flow. The model supports up to 8,192 MSA sequences during training and employs a novel MSA sampling strategy where sequences are randomly sampled from the top 16k hits rather than greedy selection, promoting robustness to low-quality MSAs. Additionally, 5% of training iterations randomly drop all MSA data to improve single-sequence prediction capabilities.

3. Tokenization and featurization

Boltz-2’s tokenization scheme assigns one token per standard amino acid and nucleotide, with a key departure from AlphaFold3, Chai-1, and Boltz-1: non-canonical amino acids and nucleotides are kept as single tokens rather than being tokenized at the atomic level. This simplification reduces sequence length while maintaining biological relevance.

New input features compared to Boltz-1 include:


The denoising module: Diffusion-based structure prediction

Boltz-2 inherits the diffusion-based structure prediction approach established in AlphaFold3, where atomic coordinates are predicted through iterative denoising of randomly initialized positions. The denoising module operates at two resolution levels: atoms and tokens.

1. Diffusion architecture

The structure module uses an atom-level transformer that processes local neighborhoods—32-atom blocks attending to the closest 128 atoms—enabling efficient handling of large complexes. The denoising module maintains float32 precision due to observed instabilities at lower precision levels, contrasting with the trunk’s bfloat16 operations.

Key diffusion hyperparameters include:

Parameter Value
sigma_min 0.0001
rho 7
gamma_0 0.8
gamma_min 1.0
noise_scale 1.003
step_scale 1.5

Default inference uses 200 sampling steps, 10 recycling iterations, and generates 5 output samples. Runtime averages 40-60 seconds per protein-ligand prediction, scaling quadratically with sequence length.

2. Boltz-steering for physical quality

A significant challenge for co-folding models—including AlphaFold3, Chai-1, and Boltz-1—is the production of structures with physical inaccuracies such as steric clashes and incorrect stereochemistry. Boltz-2 addresses this through Boltz-steering, an inference-time technique that applies physics-based potentials during reverse diffusion.

Steering potentials are applied for:

When enabled (producing “Boltz-2x”), 97% of predicted poses pass physical quality checks compared to only 43% without steering.


Novel controllability features

Boltz-2 introduces three major controllability mechanisms responding to user demand for hypothesis testing without costly retraining.

1. Experimental method conditioning

The model is trained on structures from diverse experimental methods and can condition predictions on the desired output type:

Method conditioning is implemented through one-hot encoding in the single token representation, allowing the model to produce structures matching the characteristic distributions of different experimental techniques.

Figure 5 (from the paper): MD conditioning results showing RMSF correlations for mdCATH and ATLAS datasets, demonstrating the effect of MD vs. X-ray conditioning.

2. Template conditioning and steering

Unlike AlphaFold3 and Chai-1 which only support monomeric templates, Boltz-2 enables multimeric templates by grouping template hits by PDB ID. During training, 0-4 templates are sampled per chain from the top 20 template hits. For users requiring strict template adherence, a steering potential enforces that structures remain within a user-specified distance cutoff from the template:

\[E*\mathrm{planar}(x) = \sum*{i∈S*\mathrm{template \, atoms}} \max(||x_i - x^\mathrm{ref}\_i|| - \alpha*\mathrm{cutoff}, 0)\]

where $x^\mathrm{ref}_i$ is the position of reference atom $i$ after aligning the template to predicted coordinates.

Contact and pocket conditioning

Users can specify distance constraints between tokens through contact and pocket conditioning, encoded as pairwise features. Contact types include: no restraint, pocket-to-binder relationship, binder-to-pocket relationship, and contact relationship. Distance constraints range from 4Å to 20Å, encoded through normalized distance and Fourier embeddings with fixed random bases.

A time-dependent steering potential enforces these constraints:

\[E^t*\mathrm{Contact(A,B)}(x) = \frac{\sum \exp(-λ^t*\mathrm{union} · \max(||x*i-x_j||-r*{AB}, 0)) · \max(||x*i-x_j||-r*{AB}, 0)} {\sum \exp(-λ^t*\mathrm{union} · \max(||x_i-x_j||-r*{AB}, 0))}\]

where $λ^t_\mathrm{union}$ increases monotonically as $t$ approaches 0, progressively tightening constraint enforcement.


The affinity module: Approaching FEP accuracy

The most significant innovation in Boltz-2 is its binding affinity prediction capability—the first AI model to approach FEP accuracy while being orders of magnitude faster.

1. Architecture design

The affinity module operates on Boltz-2’s structural predictions, processing the pair representation and predicted coordinates after 5 recycling iterations. The architecture consists of:

  1. Initialization: LinearNoBias layers initialize single and pair representations from trunk outputs
  2. Distogram conditioning: Predicted inter-token distances are one-hot encoded and added to pair representations
  3. PairFormer processing: 4-8 PairFormer layers process interactions, masked to focus exclusively on protein-ligand and intra-ligand interactions
  4. Mean pooling: Aggregation over interaction pairs produces a scalar representation
  5. Output heads: Two MLP heads predict binding likelihood (classification) and affinity value (regression)

The final predictions are:

2. Ensemble strategy

Boltz-2 employs two affinity models with different hyperparameters for ensemble robustness:

Parameter Model 1 Model 2
PairFormer layers 8 4
$λ_\mathrm{focal}$ 0.8 0.6
Training samples 55M 12.5M

Binary predictions are averaged, while affinity values undergo molecular weight correction:

\[\hat y = C*0 \cdot (y_1+y_2) + C_1 \cdot \mathrm{MW*{binder}} + C_2\]

where constants are fitted on a holdout validation set.

Figure 1 (from the paper): Comparison the accuracy/speed tradeoff of Boltz-2 against FEP+, ABFE, OpenFE, and ML baselines on the 4-target protein-ligand benchmark.

3. Benchmark performance

On the FEP+ 4-target benchmark (CDK2, TYK2, JNK1, P38):

On the CASP16 blind affinity challenge:

For hit discovery on MF-PCBA:


Training methodology

1. Structure training

Training proceeds through four stages with increasing crop sizes:

Stage Learning Rate Crop Size Steps MD Data Distillation
1 1e-3 384 88k No Yes
2 5e-4 512 4k Yes Yes
3 5e-4 640 4k Yes Yes
4 5e-4 768 1k No No

The final stage uses only PDB data (cutoff: 2023-06-01) to maintain highest quality. The model is trained with a diffusion multiplicity of 32 samples per example and uses 128 A100 GPUs for affinity module training.

2. Extended training data

Boltz-2’s training extends beyond the PDB to include:

Molecular dynamics ensembles:

100 frames are uniformly sampled from trajectories for ensemble supervision.

Self-distillation datasets:

3. B-factor supervision

A novel addition is B-factor prediction—the trunk’s single representation is supervised to predict each token’s B-factor. For MD structures, B-factors are computed from RMSF values:

\[B = (8 \pi^2/3) \times \mathrm{RMSF^2}\]

This supervision specifically targets local structural dynamics and improves the model’s understanding of conformational flexibility.

4. Affinity training

Affinity training occurs separately with gradients detached from the structure trunk. The pipeline incorporates:

  1. Pocket pre-processing: For each target, 10 random binders are predicted to identify consensus binding sites
  2. Affinity cropping: Up to 256 tokens (200 protein maximum) around the binding pocket
  3. Feature pre-processing: Trunk representations and coordinates cached to reduce training overhead
  4. Activity cliff sampling: Assays weighted by interquartile range (IQR) of affinity values to prioritize informative examples

The loss function combines:

Censor-aware supervision handles inequality qualifiers (e.g., “>”) appropriately, treating them as bounds rather than exact measurements.

Figure 6 (from the paper): Pearson correlation comparison across affinity benchmarks showing Boltz-2 performance on FEP+ subsets, OpenFE, internal targets, and CASP16.

Comparison with AlphaFold3

1. Architectural differences

Component AlphaFold3 Boltz-2
Pairformer blocks 48 64
MSA module blocks 4 Reordered operations
Pair representation dim 128 128
Single representation dim 384 384
Max crop size 768 768
Templates Monomeric only Multimeric supported
Method conditioning No Yes
B-factor prediction No Yes
Affinity prediction No Yes
Physical steering No Yes (Boltz-2x)

2. Confidence module

AlphaFold3 uses 4 PairFormer layers for confidence prediction; Boltz-2 uses 8 PairFormer layers but adopts a simpler architecture than Boltz-1’s expensive 48-layer confidence trunk. A key innovation is separating PDE and PAE prediction into two heads—one for intra-chain pairs and one for inter-chain pairs.

3. Performance comparison

Figure 3 (from the paper): Structure prediction benchmark figure comparing Boltz-2, AlphaFold3, Chai-1, ProteinX, and Boltz-1 across different complex types.

On recent PDB structures (2024-2025):

Figure 4 (from the paper): Antibody benchmark and Polaris-ASAP competition results.

On the Polaris-ASAP ligand pose competition (SARS-CoV-2/MERS-CoV proteases):

4. Design philosophy differences

AlphaFold3:

Boltz-2:


Practical applications: Virtual screening at scale

Boltz-2 enables structure-based virtual screening at unprecedented scale. In a prospective evaluation against TYK2:

Fixed library screening (Enamine Hit Locator Library, 460k compounds):

Generative screening (SynFlowNet + Enamine REAL 76B space):

The combined Boltz-2 + SynFlowNet workflow demonstrates an effective de novo binder generation pipeline validated through ABFE simulations.

Figure 4 (from the paper): TYK2 virtual screening results showing correlation between Boltz-2 screen scores and ABFE readouts, plus distribution across screening strategies.

Current limitations and future directions

Despite significant advances, several limitations remain:

Molecular dynamics: While improved over Boltz-1, ensemble diversity metrics still lag behind specialized models like BioEmu and AlphaFlow. The MD dataset was only introduced in later training stages.

Structure prediction: Performance does not significantly exceed predecessors due to similar training data and architecture. Large conformational changes induced by binding remain challenging.

Affinity prediction dependencies: Accurate affinity prediction requires correct pocket identification and binding interface reconstruction. Performance varies substantially across assays (Pearson R ranging from 0.06 to 0.73), suggesting target-specific applicability.

Cofactor handling: The current affinity module does not explicitly handle cofactors (ions, water, multimeric binding partners) that may be essential for certain binding interactions.


Conclusions

Boltz-2 represents a significant step toward integrated structure-affinity prediction for drug discovery. By combining structural co-folding capabilities with FEP-competitive binding affinity prediction, extensive controllability features, and physical quality enforcement, Boltz-2 provides a foundation for computational drug discovery workflows. The open release of weights, inference code, and training pipelines positions Boltz-2 as an extensible platform for the computational structural biology community.

The key innovations—affinity prediction approaching FEP accuracy at 1000× the speed, multimeric template support, experimental method conditioning, and Boltz-steering for physical plausibility—address critical gaps between structure prediction and practical drug discovery applications. As training data expands and architectural refinements continue, models like Boltz-2 may increasingly complement or replace expensive physics-based simulations in early-stage drug discovery.


Primary Reference: Saro Passaro, Gabriele Corso, Jeremy Wohlwend et al. Boltz-2: Towards Accurate and Efficient Binding Affinity Prediction, BioRxiv 2025.

DOI: https://doi.org/10.1101/2025.06.14.659707

GitHub: https://github.com/jwohlwend/boltz.git

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • When Transformers learn to speak protein
  • Deep Learning for Computational Structural Biology
  • Transformers Revolutionize Protein Structure Prediction and Design
  • IgFold - a fast, accurate antibody structure prediction
  • BoltzGen Redefines Protein Binder Design