Introducing Evo 2, a predictive and generative genomic AI for all domains of life

Researchers at the Arc Institute, Stanford University, and NVIDIA have developed Evo 2, an advanced AI model capable of predicting genetic variations and generating genomic sequences across all domains of life.
Testing shows that Evo 2 accurately predicts the functional effects of mutations across prokaryotic and eukaryotic genomes. It also successfully annotated the woolly mammoth genome from raw genomic sequences without a direct training reference, showing an ability to generalize function from the sequence alone.
Current genomic models struggle with predicting functional impacts of mutations across diverse biological systems, particularly for eukaryotic genomes. Machine learning approaches have demonstrated some success in modeling protein sequences and prokaryotic genomes. The complexity of eukaryotic DNA, with its long-range interactions and regulatory elements, presents more of a challenge.
Evo 2 was developed to address these limitations by incorporating a large-scale training dataset spanning bacteria, archaea, eukaryotes, and bacteriophages, with a focus on broad genomic patterns across species rather than being trained for a single specific function.
In the study, "Genome Modeling and Design Across All Domains of Life with Evo 2," as a bioRxiv preprint, the team details how a model trained on 9.3 trillion DNA base pairs enables genome-scale predictions and design.
Evo 2 trained on 9.3 trillion nucleotides (A, T, C, or G), making it one of the largest biological models ever developed. The model can analyze and generate up to 1 million nucleotides at a time, allowing it to capture long-range patterns and relationships within DNA sequences.
During training, Evo 2 learned by predicting the next base pair in a sequence, similar to how language models predict the next word in a sentence. This approach enables Evo 2 to identify complex genomic structures and accurately model the functional impact of genetic variations across all domains of life.
The training dataset, OpenGenome2, was carefully curated to exclude genomic sequences from viruses that infect eukaryotic hosts to mitigate potential misuse.
A two-phase training strategy was used, beginning with a pretraining phase that prioritized functional genetic elements and a midtraining phase that extended context length to capture broader genomic patterns.
Evo 2 employs StripedHyena 2, a novel architecture combining input-dependent convolution operators with attention mechanisms, optimized to efficiently handle long DNA sequences at scale. The model was trained using 1,024 GPUs at the 40-billion-parameter level, achieving higher efficiency compared to traditional transformer models.
Results showed that Evo 2 accurately predicts the functional effects of mutations across prokaryotic and eukaryotic genomes without the need for task-specific fine-tuning. The model demonstrated sensitivity to mutations in start codons, splice sites, and conserved genomic regions, with performance aligning with known biological constraints.
Specialized models such as AlphaMissense and GPN-MSA performed slightly better for coding single-nucleotide variants, whereas Evo 2 demonstrated superior accuracy for indels and noncoding variants. Embedding-based classifiers trained on Evo 2 representations achieved state-of-the-art performance in classifying BRCA1 breast cancer variants.
Interpretability analysis revealed that Evo 2 autonomously learns key biological structures, including transcription factor binding sites, exon-intron boundaries, and protein structural motifs.
Sparse autoencoder techniques identified latent features corresponding to mobile genetic elements, prophages, and CRISPR-associated sequences. Evo 2's ability to generalize was demonstrated by successfully annotating the woolly mammoth genome, a species not present in its training data.
Genome-scale sequence generation was also tested, with Evo 2 successfully creating complete mitochondrial genomes, bacterial genomes, and yeast chromosome-scale sequences. Generated sequences exhibited realistic structural and evolutionary properties, including accurate synteny patterns, protein-coding regions, and regulatory elements.
When prompted with mitochondrial genome sequences, Evo 2 produced DNA with the correct number of coding genes, tRNAs, and rRNAs.
Beyond sequence generation, Evo 2 was applied in an inference-time controlled design task to engineer DNA sequences with programmable chromatin accessibility. Integrating chromatin accessibility models such as Enformer and Borzoi, Evo 2 generated sequences with specific regulatory features, including the ability to encode Morse code messages within epigenetic structures.
Evo 2 represents a significant advancement in genomic AI, combining predictive accuracy with generative capabilities at genome-wide scales. By making Evo 2's training code, model parameters, and the OpenGenome2 dataset openly available, researchers hope to accelerate genomic research.
Future applications of Evo 2 may include large-scale population genetics studies, synthetic biology, and advanced epigenomic design.
More information: Garyk Brixi et al, Genome modeling and design across all domains of life with Evo 2, bioRxiv (2025).
Journal information: bioRxiv
© 2025 Science X Network