
What You Should Know:
– NVIDIA Research, in collaboration with the University of Oxford and Mila – Québec AI Institute, has unveiled La-Proteina, a novel method for atomistic protein design.
– Published on arXiv on July 13, 2025, La-Proteina is designed to directly generate fully atomistic protein structures jointly with their underlying amino acid sequences, addressing a critical challenge in de novo protein design.
Optimizing Protein Design with Fixed-Dimensional Latent Space
Existing methods often decouple sequence and structure generation or struggle with modeling accuracy and scalability when tackling full atomistic structures. La-Proteina introduces a “partially latent protein representation” where the coarse backbone structure (alpha-carbon coordinates) is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables of fixed dimensionality. This approach effectively sidesteps challenges associated with explicit side-chain representations, which vary in length during generation.
La-Proteina combines the strengths of explicit and latent modeling through a novel partially latent flow matching framework. This method models the alpha-carbon coordinates explicitly, while encompassing the sequence and coordinates of all other non-alpha-carbon atoms within a continuous, fixed-size latent representation for each residue.
The model is trained in two stages:
- Variational Autoencoder (VAE): An encoder maps the input protein (sequence and structure) to latent variables, and a decoder reconstructs complete proteins from these latent variables and alpha-carbon coordinates.
- Partially Latent Flow Matching Model: This model learns the joint distribution over latent variables and alpha-carbon atom coordinates, building on the VAE.
This partially latent approach transforms the core learning problem from a mixed discrete-continuous space with variable dimensionality into a per-residue, continuous space of fixed dimensionality, making it amenable to powerful generative modeling techniques like flow matching.
State-of-the-Art Performance and Scalability
La-Proteina achieves state-of-the-art performance on multiple generation benchmarks, including all-atom co-designability, diversity, and structural validity, as confirmed through detailed structural analyses and evaluations.
Key achievements include:
- High Sensitivity: Achieves excellent all-atom co-designability, designability, and diversity, while remaining competitive in novelty.
- Scalability to Large Proteins: La-Proteina can generate co-designable proteins of up to 800 residues, a regime where most baselines collapse and fail to produce valid samples due to computational limitations and memory constraints. This demonstrates La-Proteina’s robustness and strong scalability.
- Structural Validity: Produces structures with higher structural validity, including better MolProbity scores, clash scores, Ramachandran angle outliers, and covalent bond geometry outliers, making them more physically realistic than existing all-atom generators. It accurately recovers rotameric states and their frequencies, unlike baselines that miss modes or populate unrealistic angular regions.
- Atomistic Motif Scaffolding: La-Proteina significantly surpasses previous models in atomistic motif scaffolding performance, unlocking critical atomistic structure-conditioned protein design tasks. It successfully solves most benchmark tasks across all-atom and tip-atom scaffolding, in both indexed and unindexed setups.
Architectural Design and Training
La-Proteina’s neural networks (encoder, decoder, denoiser) are implemented using efficient transformer architectures. The denoiser network, which accounts for approximately 160M parameters, conditions on interpolation times, crucial for performance. The encoder and decoder each consist of about 130M parameters. A key design decision involves using two separate interpolation times for alpha-carbon coordinates