Âé¶¹ÒùÔº


How massive datasets generated are powering the latest AI models in biology

How massive datasets generated at Broad are powering the latest AI models in biology
AlphaGenome model architecture, training regimes, and comprehensive evaluation performance. Credit: Bioinformatics (2025). DOI: 10.1101/2025.06.25.661532

In June, Google DeepMind took the wraps off , its latest machine learning model for biological discovery. While DeepMind's Nobel Prize-winning AlphaFold model focuses on proteins and how they fold, AlphaGenome predicts how genetic variants affect the processes that control when and where genes are turned on and off.

In their announcement and on bioRxiv, DeepMind cited two resources—largely created at the Broad Institute in the 2010s—as their main sources of training data for AlphaGenome: the Encyclopedia of DNA Elements () Consortium, which cataloged more than a million across the genome, and the Genotype-Tissue Expression () Project, which continues to map the gene expression patterns of human and primate tissues.

Both resources have also been instrumental in revealing how the genome works and how noncoding genetic variants impact , and laid the groundwork for efforts like the NIH's Impact of Genomic Variation on Function Consortium, the Human Cell Atlas, and the Broad's Gene Regulation Observatory (GRO).

To learn more about how ENCODE, GTEx, and similar datasets are fueling science in the AI age, we spoke with Kristin Ardlie, an institute scientist at Broad and director of GTEx; and Brad Bernstein, an institute member, leader of the GRO, director of Broad's Epigenomics Program, and a leader of the ENCODE Consortium.

When they started, what were the goals of ENCODE and GTEx?

Bernstein: ENCODE's goal was to understand the language of the genome. When it started, only 1% to 2% of the genome could be explained. Nobody knew how much of the other 98% was functional, or how it impacted the regulation of the cell. With ENCODE, we realized that maybe 20% of the genome looked like it had regulatory or functional roles. It changed the idea that the noncoding part of the genome was just junk.

Ardlie: And that's what launched GTEx. Once we reached the point where human genetic studies were reliably finding variants associated with diseases and traits, we realized that most were in those unknown regions of the genome, and we had no idea how they functioned. GTEx was launched as a way to systematically measure whether those genetic variants might have regulatory roles that affect gene expression in the context of tissues and cells and disease.

What does the emergence of AlphaGenome and other large language models say about the value of resources like ENCODE and GTEx?

Ardlie: These resources' legacy is enduring, in that more than a decade or two after we started building them, they're enabling developments that we couldn't have considered. They were designed to be community resources and to be as utilitarian as possible, with no constraints on their use. This latest development is a testament to the fact that they really are working as intended.

And I think that this success shows us the path forward. The last five or 10 years has seen a lot of effort put into building atlases of single cells, which is going to be remarkably powerful as well. To find new opportunities to impact disease, we need to understand how biology and disease work at a cellular level.

I keep thinking about sitting at the eye doctor, where they flip little lenses in front of your eyes to see what your eyeglass prescription should be so you can see things more clearly. The resolution these models have achieved is so much finer than before, but they need to be even finer and more powerful to really help us understand genome function.

To achieve that, we need more of these unbiased foundational resources. Their value as training data for models that could help define the systematic rules of the genome is truly remarkable.

What are some other ways in which AI is helping us understand genome regulation?

Bernstein: There's a number of labs just here at Broad that are applying machine learning to the regulatory code. Jason Buenrostro, who leads the GRO with me, is using deep learning to work out how regulatory elements close to genes like promoters as cells develop.

Our colleague Anders Hansen is applying AI to in 3D, which is incredibly important for understanding long-range interactions between elements and how they control expression of both genes and entire genetic programs.

My own team collaborated with scientists at Google on a of the genome's regulatory code that can be readily applied to any new cell type. There's a lot going on.

What do you want to see happen over the next five years to make AI as useful as possible for genomic discovery?

Ardlie: We need to continue developing resources that focus on perturbations in humans—biological changes that affect health. Take development. We go through many changes as we develop, and in a sense that's a big sort of perturbation. How do we study that process systematically and at scale, and what can it teach us? That's what the next phase of GTEx is working to discover.

Disease is another form of perturbation, one that we often look at just from the perspective of an endpoint. But really it's a process by which cells go from normal to not-normal. What's going on there? We need to gather data systematically across that continuum.

A better understanding of variants' functions will help us to interpret the results of genetic testing. When we screen a patient's genome, we often end up with variants whose significance we can't determine. Many of these might be regulatory variants that could be very consequential in disease, but which we can't yet interpret. We need these data resources and models like AlphaGenome to help us better interpret what these variants are doing.

Bernstein: We have a lot of data from ENCODE and other resources about where and other things bind to DNA, which genes are turned on in which cell type, and whatnot. But we don't have massive amounts of data in human cells about variations and perturbations. I'd like to see more data that comes from picking individual cell types and mutagenizing the whole genome to help us decipher the genome's regulatory code.

It's a complicated question, though. What and how much data would we have to generate to power models that could fully understand long-range regulatory events, complex mechanisms, chromatin structures, and conformations that go well beyond transcription factor binding? It kind of boggles the mind.

If we get it right, however, like AlphaGenome could help us answer a debate about how best to interpret variants' functions: should we drill down on variants one-by-one, or should we use AI models to explore the rules of the in an agnostic, holistic way? I'm excited about figuring that out.

More information: Žiga Avsec et al, AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model, bioRxiv (2025).

Journal information: Bioinformatics , bioRxiv

Citation: How massive datasets generated are powering the latest AI models in biology (2025, September 4) retrieved 8 September 2025 from /news/2025-09-massive-datasets-generated-powering-latest.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further


0 shares

Feedback to editors