Evo2 Published in Nature: The 40-Billion-Parameter AI Model That Reads, Predicts, and Writes DNA Across All Domains of Life

Arc Institute, NVIDIA, and collaborators from Stanford, UC Berkeley, and UCSF have published Evo2 in Nature — a 40-billion-parameter open-source DNA foundation model trained on 9.3 trillion nucleotides that can predict pathogenic mutations with over 90% accuracy and generate entirely new genomes.

Key Takeaways

Evo2 is a 40-billion-parameter open-source DNA foundation model published in Nature on March 7, 2026, by the Arc Institute and NVIDIA. Trained on 9.3 trillion nucleotides from 128,000+ genomes, it predicts BRCA1 mutations with over 90% accuracy and can generate entirely synthetic genomes, representing a major milestone for computational biology.

On March 7, 2026, the journal Nature published what may be the most significant computational biology paper of the year: 'Genome modeling and design across all domains of life with Evo 2.' The paper introduces Evo2, a DNA foundation model developed by the Arc Institute in collaboration with NVIDIA and researchers from Stanford University, UC Berkeley, and UC San Francisco. With 40 billion parameters trained on 9.3 trillion nucleotides — the largest biological sequence dataset ever assembled for a single model — Evo2 represents a fundamental shift in how artificial intelligence interacts with the code of life.

Unlike previous genomic AI models that focused on specific organisms or narrow tasks, Evo2 operates across all domains of life: bacteria, archaea, plants, animals, humans, and viruses. It processes genetic sequences up to one million base pairs with nucleotide-level resolution, effectively treating DNA as a language to be read, understood, and written. The model was trained on the NVIDIA DGX Cloud AI platform using more than 2,000 H100 GPUs, employing a novel architecture called StripedHyena 2 that handles extremely long sequences more efficiently than traditional transformer architectures.

Training Data: 128,000 Genomes Spanning All Life

The scale of Evo2's training data is unprecedented in computational biology. The dataset encompasses over 128,000 complete genomes and metagenomics assemblies, covering organisms from single-celled bacteria to complex multicellular life. This breadth allows the model to learn universal patterns in genetic code — patterns that are conserved across billions of years of evolution and that underlie fundamental biological processes such as gene regulation, protein coding, and genome organization. The 9.3 trillion nucleotides in the training set dwarf previous efforts: the original Evo model, published as a preprint in February 2025, was trained on a substantially smaller dataset.

Source: Arc Institute / Nature 2026

Clinical Relevance: BRCA1 Mutation Prediction

Perhaps the most clinically significant demonstration of Evo2's capabilities is its performance on BRCA1 variant classification. BRCA1 is a tumor suppressor gene whose mutations are associated with significantly elevated risks of breast and ovarian cancer. Classifying BRCA1 variants as benign or pathogenic is a critical clinical task — and one that has historically required expensive functional assays or large-scale epidemiological studies. In benchmark tests, Evo2 achieved over 90% accuracy in predicting whether BRCA1 mutations are benign or pathogenic, using only the DNA sequence as input. This performance approaches the accuracy of the best specialized variant-effect prediction tools while being derived from a general-purpose model that was not specifically trained on BRCA1 data.

Capability	Evo2 Performance	Previous Best
BRCA1 pathogenicity prediction	>90% accuracy	~85% (specialized tools)
Sequence length processing	1 million base pairs	~100K base pairs
Training data scale	9.3 trillion nucleotides	~300 billion nucleotides
Genome generation	Full synthetic genomes	Short sequences only
Cross-domain coverage	All domains of life	Single-domain models

Genome Design: Writing New DNA

Evo2 does not merely read and predict — it can also generate entirely new DNA sequences. The researchers demonstrated this capability by producing synthetic genomes inspired by the bacterium Mycoplasma genitalium, one of the simplest self-replicating organisms. While these generated genomes are not yet functional organisms, they represent a proof of concept for AI-driven genome design. The ability to generate biologically plausible genome-length sequences opens possibilities in synthetic biology, from designing organisms for bioremediation to creating customized microbial factories for pharmaceutical production. The implications are profound — and the ethical questions they raise are equally significant.

Mechanistic Interpretability: Understanding What the Model Learns

A critical concern with large AI models in biology is interpretability — understanding not just that a model makes correct predictions, but why. The Arc Institute collaborated with Goodfire AI to develop interpretability tools specifically for Evo2. Through mechanistic interpretability analyses, the researchers demonstrated that Evo2 has learned to represent key biological features: exon-intron boundaries (the junctions between protein-coding and non-coding regions of genes), transcription factor binding sites (regulatory sequences that control gene expression), and protein structural elements. These findings suggest that the model has, through exposure to raw DNA sequences alone, discovered fundamental principles of molecular biology.

Evo2 Pipeline: From Training Data to Applications

graph LR
    A["Raw DNA Sequences<br/>9.3T nucleotides"] --> B["StripedHyena 2<br/>Architecture"]
    B --> C["Evo2 Model<br/>40B parameters"]
    C --> D["Variant Effect<br/>Prediction"]
    C --> E["Genome<br/>Generation"]
    C --> F["Feature<br/>Discovery"]
    D --> G["BRCA1 Classification<br/>>90% accuracy"]
    E --> H["Synthetic Genomes<br/>M. genitalium-scale"]
    F --> I["Exon-Intron Boundaries<br/>TF Binding Sites"]

Source: Arc Institute / Nature 2026

Open Science and Industry Implications

In a decision that distinguishes Evo2 from many commercial AI models, the Arc Institute released it as fully open-source — code, training data, training and inference code, and all model weights are publicly available on Arc's GitHub repository. The model is also integrated into NVIDIA's BioNeMo framework, making it accessible to researchers and developers working within NVIDIA's computational biology ecosystem. This open-source approach contrasts sharply with the proprietary strategies of most large AI model developers and reflects the Arc Institute's mission as a nonprofit research organization focused on advancing fundamental science.

For the pharmaceutical and biotechnology industries, Evo2 represents both an opportunity and a challenge. The model's ability to predict variant effects across the genome could accelerate drug target identification and reduce the cost of genetic screening. Its genome generation capabilities could reshape synthetic biology. But these applications will require careful validation — a high-performing AI model is not a substitute for wet-lab experiments and clinical trials. The gap between computational prediction and biological reality remains substantial, and Evo2's real-world impact will ultimately be measured not by benchmark scores but by the discoveries and therapies it enables. Nevertheless, the publication in Nature marks a milestone: the era of general-purpose genomic AI has arrived, and the tools to explore it are, by deliberate choice, available to everyone.

📚 Sources & References

#	Source	Link
[1]	Genome modeling and design across all domains of life with Evo 2 Arc Institute et al., 2026	nature.com
[2]	Evo 2 Genomics AI Model Arc Institute, 2026	arcinstitute.org
[3]	NVIDIA BioNeMo Framework NVIDIA, 2026	nvidia.com