|

Arc Institute’s Evo 2 Sets a New Bar: The Largest AI Model for Biology That Reads, Writes, and Designs the Genetic Code

What if an AI could “read” entire genomes end to end—and then help you write a new one? That’s not sci-fi anymore. It’s Evo 2, a new generative model from the Arc Institute built with collaborators at NVIDIA, Stanford, UC Berkeley, and UCSF. It’s the largest AI model for biology announced to date, trained on a sweeping cross-section of life’s code and capable of both understanding and designing DNA at unprecedented scales.

Evo 2 doesn’t just skim the surface of genomics. It ingests long-range patterns—up to one million nucleotides at a time—spanning regulatory elements, gene clusters, structural motifs, and evolutionary signals that typically take teams of scientists years to decipher experimentally. Early results suggest it can predict the impact of mutations (including in high-stakes genes like BRCA1) with over 90% accuracy and even design genome-length sequences as long as those of simple bacteria.

Below, we break down what Evo 2 is, how it works, why it matters, and what it could mean for drug discovery, disease research, and the future of synthetic biology.

To dive deeper into the official announcement and preprint, start here: Arc Institute: Evo 2

Meet Evo 2: An AI Native to the Language of Nucleotides

Built as a successor to Evo 1, Evo 2 is a generative model that treats genetic code like a language—one with its own syntax, grammar, and long-distance dependencies. But unlike typical text, genomes carry multi-scale information layered across kilobases to megabases. Evo 2’s superpower is seeing the forest and the trees at once.

Trained Across the Tree of Life

  • Scale: Trained on more than 9.3 trillion nucleotides
  • Breadth: Over 128,000 whole genomes, spanning bacteria, archaea, phages, humans, plants, and other eukaryotes
  • Diversity: Cross-domain training helps the model generalize biological “rules” that recur across evolution, enabling better transfer learning from microbes to mammals and beyond

This diversity gives Evo 2 a panoramic view of evolution’s playbook—how motifs, domains, and control logic are conserved, repurposed, or innovated across life.

A One-Million-Nucleotide Context Window

Most AI sequence models struggle to reason over very long contexts. Evo 2 raises the ceiling dramatically:

  • Context length: Up to 1,000,000 nucleotides at once
  • Compared to Evo 1: 8× longer sequences in context
  • Why this matters: Many functional relationships in DNA and RNA span huge distances—enhancer-promoter loops, gene neighborhoods, operons, repetitive elements, and 3D genome features. Bigger context windows mean more complete biological reasoning.

The StripedHyena 2 Architecture

Evo 2 is powered by StripedHyena 2, a novel architecture contributed by OpenAI co-founder Greg Brockman during a sabbatical. While classic Transformers shine at many language tasks, long-range genomics pushes their limits. StripedHyena 2 is purpose-built for vast sequences, enabling efficient handling of million-nucleotide windows and training on far larger datasets.

  • 30× more training data than Evo 1
  • Long-context reasoning optimized for genomic structure
  • Designed for scalable training and inference at biological lengths

Massive Compute, Carefully Orchestrated

Evo 2’s training took months on NVIDIA DGX Cloud via AWS, using over 2,000 H100 GPUs. That level of compute, paired with targeted data curation, allowed the team to train a model that doesn’t just memorize sequences—it learns generalized rules of genomic organization and function.

What Evo 2 Can Do Today

Evo 2 is a foundational model, not a single-task tool. But the early demonstrations are eye-opening.

Predict Disease-Causing Mutations with High Accuracy

One headline result: Evo 2 predicts the functional impact of mutations in human genes like BRCA1 with more than 90% accuracy. That’s significant because understanding whether a specific variant is benign or pathogenic is a cornerstone of modern genetics and precision medicine.

  • Why this matters: Faster, more accurate variant interpretation could help researchers prioritize experiments, understand disease mechanisms, and support the development of diagnostics and therapeutics.
  • Important note: This is a research-stage model and not a clinical diagnostic. Results are reported alongside a preprint and require continued validation.

For background on BRCA1 and its role in DNA repair and cancer risk, see NCBI’s BRCA1 gene overview.

Design Novel Genomes at Bacterial Scales

Perhaps the most striking capability: Evo 2 can generate genome-length sequences on the order of simple bacteria. In plain English, that means it can propose coherent, large-scale designs—not just snippets—while respecting the long-range constraints that make genomes function.

  • Think of it as “co-writing” with evolution’s style guide in mind.
  • Practical impact: This could greatly accelerate early-stage design ideation in synthetic biology. Lab validation and safety review remain essential steps outside the scope of the model itself.

Discover Patterns Hidden in Plain Sight

By training across bacterial, archaeal, phage, human, and plant genomes, Evo 2 can surface patterns that are often non-obvious:

  • Conserved motifs and regulatory logic across species
  • Long-distance dependencies spanning operons to chromosomal domains
  • Hypotheses for functional annotation in poorly characterized regions

This is where AI complements bench science: by narrowing the search space, spotlighting testable hypotheses, and pointing out candidate features that warrant deeper study.

Why Evo 2 Is a Big Deal for Genomics, Drug Discovery, and Synthetic Biology

It’s not just about “bigger is better.” Evo 2’s capabilities unlock workflows that were previously impractical.

  • End-to-end genomic reasoning: Long windows allow the model to reason across promoters, enhancers, silencers, and gene neighborhoods together, rather than in isolation.
  • Accelerated hypothesis generation: From variant effect predictions to putative regulatory elements, Evo 2 can rapidly generate ranked lists of candidates for validation.
  • Design at scale: Generative capabilities extend beyond short sequences to genome-length constructs, opening new frontiers in synthetic biology ideation.
  • Cross-domain transfer: Training across the tree of life teaches general rules of sequence-function relationships that can inform research in less-studied organisms.

For pharma and biotech R&D teams, this could streamline target discovery, functional genomics, and sequence design—potentially shaving months off early research cycles.

Open, Accessible, and Built for the Community

Arc has coupled Evo 2’s release with a suite of tools and resources to make the model easier to explore, study, and build upon.

Evo Designer: A User-Friendly Interface

To lower the barrier to entry, Arc is releasing Evo Designer—an interface designed to help researchers interact with the model without heavy engineering lift. It represents a step toward democratizing genomics-focused generative AI.

  • Idea exploration: Probe sequence designs or analyze variants in a guided environment
  • Visualization: Inspect outputs with context and metadata

Details and links are available from the Arc Institute announcement.

Open-Source Code, Data, and Weights

Arc is releasing fully open-source code, training data references, and model weights on GitHub. This level of transparency invites the community to audit, extend, and integrate Evo 2 across research pipelines. Check the announcement for the repository and documentation links.

  • Benefits: Reproducibility, community validation, and rapid iteration
  • Ecosystem impact: Lowers entry barriers for academic labs and startups alike

Integrated with NVIDIA BioNeMo

Evo 2 integrates with NVIDIA’s BioNeMo, a platform for life sciences foundation models. This helps teams deploy models within GPU-accelerated, enterprise-grade environments and connect to downstream tools.

  • Enterprise readiness: Scalability for large datasets and secure environments
  • Interoperability: Connects with the broader AI for biology ecosystem

Mechanistic Interpretability with Goodfire

Working with AI lab Goodfire, Arc is also releasing a mechanistic interpretability visualizer that helps researchers see what Evo 2 has actually learned—feature maps, motif detectors, and more. This is key for scientific trust and model debugging, moving beyond black-box predictions.

Under the Hood: Training Data, Compute, and Architecture

Let’s zoom in on the engineering milestone.

  • Data scope: 9.3 trillion nucleotides from 128,000+ whole genomes capture broad evolutionary diversity.
  • Compute: Months of training on 2,000+ NVIDIA H100 GPUs via NVIDIA DGX Cloud on AWS.
  • Architecture: StripedHyena 2 enables efficient long-context training and inference, with an 8× jump in context length and 30× more data relative to Evo 1.

In other words, the Evo 2 team fused cutting-edge compute with a model design purpose-built for genomic sequence lengths—and then trained it on an unprecedented corpus of biological data.

Quotes That Frame the Moment

  • “Our development of Evo 1 and Evo 2 represents a key moment in generative biology, enabling machines to read, write, and think in the language of nucleotides,” said Patrick Hsu, Arc Co-Founder and Core Investigator.
  • NVIDIA’s Anthony Costa called Evo 2 a fundamental advance that’s poised to accelerate solutions for major health challenges.

You can read more in the official announcement here: Arc Institute: Evo 2

How Evo 2 Compares: Evo 1, Protein Models, and Other Approaches

Evo 2 isn’t the first AI for biology—but its scope and design set it apart.

Evo 1 vs. Evo 2

  • Data scale: Evo 2 trained on 30× more data
  • Context length: 8× longer sequences (up to 1 million nucleotides)
  • Architecture: Upgraded to StripedHyena 2 for long-range reasoning
  • Capability: From strong variant prediction to genome-length generative design

Genomic Models vs. Protein-Focused Models

Protein structure models like AlphaFold have transformed structural biology. Evo 2 plays in a different but complementary space:

  • Focus: Nucleotide sequences (DNA/RNA), not just proteins
  • Strength: Long-range genomic reasoning, regulatory logic, and generative genome design
  • Complementarity: Insights from Evo 2 can inform protein expression contexts and regulatory strategies upstream of protein function

High-Impact Use Cases on the Horizon

While Evo 2 is a research-stage model, its potential spans multiple domains:

  • Variant interpretation and prioritization: Speeding up functional genomics studies
  • Noncoding regulatory discovery: Mapping enhancers, silencers, and long-range interactions
  • Synthetic biology ideation: Generating candidate designs at genome scale (with downstream validation)
  • Microbial engineering: Exploring pathway organization and operon-level logic
  • Phage and viral research: Understanding host interactions and capsid/packaging constraints from sequence signals
  • Comparative genomics: Learning cross-species constraints and innovations
  • Tooling and platforms: Embedding Evo 2 into pipelines via BioNeMo for scalable, secure deployments

Responsible Innovation: Safety, Ethics, and Governance

With powerful generative models in biology, safety is non-negotiable.

  • Research context: Evo 2’s release aligns with open science values—code, data, and weights—paired with tooling that makes interpretability a first-class citizen.
  • Guardrails: Use of any generative biological model should follow institutional review, biosafety frameworks, and applicable regulations.
  • Clinical caution: Variant predictions are not clinical diagnoses. They require experimental validation and clinical interpretation.
  • Community governance: Open releases invite broader oversight and peer review, which historically improve robustness and safety outcomes.

Arc’s approach—pairing openness with interpretability, platform integrations, and collaborative partnerships—helps set a constructive precedent for the field.

Collaboration at Scale

Evo 2 reflects a deep partnership network:

Big science is increasingly a team sport, and Evo 2 is a prime example.

What’s Next for Evo 2 and Generative Biology

Expect rapid iteration and deeper validation:

  • Richer modalities: Integrating epigenomic marks, chromatin conformation, and transcriptomics to contextualize sequence function
  • Longer contexts and conditioned generation: Even bigger windows and controllable generation for targeted design objectives
  • Bench integration: Closed-loop cycles where in silico design and wet-lab validation iterate for faster discovery
  • Application-specific finetuning: Tailored variants for oncology, antimicrobial resistance, agriculture, and environmental genomics
  • Standards and best practices: Shared benchmarks for safety, interpretability, and performance

The pace of improvement in sequence modeling suggests Evo 2 is less a finish line and more a launchpad.

FAQs

What is Evo 2?

Evo 2 is a large-scale generative AI model from the Arc Institute designed to understand and design nucleotide sequences. It’s trained on 9.3 trillion nucleotides across 128,000+ genomes, with a one-million-nucleotide context window for long-range genomic reasoning.

Who built Evo 2?

Arc Institute led development with collaborators from NVIDIA, Stanford, UC Berkeley, and UCSF. StripedHyena 2, Evo 2’s architecture, was contributed by Greg Brockman during a sabbatical.

Is Evo 2 open-source?

Yes. Arc is releasing open-source code, model weights, and training data references on GitHub, along with Evo Designer, a user-friendly interface. See the Evo 2 announcement for links.

How accurate is Evo 2 at predicting disease mutations?

In early results shared with the preprint, Evo 2 predicts the impact of mutations in genes like BRCA1 with over 90% accuracy. These findings are research-stage and require continued validation; the model is not a clinical diagnostic tool.

Can Evo 2 design whole genomes?

Evo 2 can generate genome-length sequences comparable to simple bacteria, capturing long-range constraints. Any use of generative sequences requires appropriate biosafety reviews and experimental validation—Evo 2’s outputs are hypotheses, not ready-to-deploy biological systems.

How is Evo 2 different from previous models?

Compared to Evo 1, Evo 2 trains on 30× more data, processes sequences 8× longer (up to one million nucleotides), and uses the StripedHyena 2 architecture optimized for long-range genomic reasoning.

How can researchers try Evo 2?

Arc is releasing Evo Designer for interactive use and making code and weights available on GitHub, with integration into NVIDIA BioNeMo for enterprise deployments. Start with the official announcement for access details.

What infrastructure powered Evo 2’s training?

Training ran for months on NVIDIA DGX Cloud via AWS, using over 2,000 NVIDIA H100 GPUs.

What about safety and ethics?

Responsible use is essential. Evo 2 is intended for research, and any generative design work should comply with institutional biosafety frameworks, regulations, and ethical guidelines. Arc’s inclusion of interpretability tools and open science practices supports transparency and community oversight.

How does Evo 2 relate to protein models like AlphaFold?

Evo 2 focuses on nucleotide sequences and genomic logic, while AlphaFold predicts protein structures. They are complementary: Evo 2 can inform upstream regulatory context; protein models address downstream structure and function.

The Bottom Line

Evo 2 is a watershed moment for generative biology: a model that can read, reason over, and design the code of life at biologically realistic scales. With a one-million-nucleotide context window, training across 128,000+ genomes, and integration into open and enterprise ecosystems, it promises to accelerate variant interpretation, functional genomics, and synthetic biology ideation.

It’s still early days—this is a research-stage model, and clinical or experimental adoption must follow rigorous validation and safety protocols. But the direction is clear. Evo 2 doesn’t just make biology faster; it changes what’s feasible to ask and answer. For teams working at the frontier of genomics, this feels like stepping into a larger lab—one where evolution’s patterns are visible, testable, and, increasingly, designable.

For details, resources, and the preprint, visit the official announcement: Arc Institute: Evo 2

Discover more at InnoVirtuoso.com

I would love some feedback on my writing so if you have any, please don’t hesitate to leave a comment around here or in any platforms that is convenient for you.

For more on tech and other topics, explore InnoVirtuoso.com anytime. Subscribe to my newsletter and join our growing community—we’ll create something magical together. I promise, it’ll never be boring! 

Stay updated with the latest news—subscribe to our newsletter today!

Thank you all—wishing you an amazing day ahead!

Read more related Articles at InnoVirtuoso

Browse InnoVirtuoso for more!