New AI models enhance protein data analysis for medical research

Researchers have developed new AI models that can vastly improve accuracy and discovery within protein science. The models could assist the medical sciences in overcoming present challenges within personalized medicine, drug discovery, and diagnostics.
In the wake of the widespread availability of AI tools, most fields in the technical and natural sciences are advancing rapidly. This is particularly true in biotechnology, where AI models power breakthroughs in drug discovery, precision medicine, gene editing, food security, and many other research areas.
One sub-field is proteomics—the study of proteins on a large scale—where vast amounts of protein data are gathered in databases against which a sample can be compared. These databases enable scientists to discern which proteins—and, thereby, microorganisms—are present in a sample. They allow a doctor to diagnose diseases, monitor the effectiveness of a treatment, or identify pathogens present in a patient's sample.
Although these tools are useful and effective, there are limits to what they can do, says Timothy Patrick Jenkins, an Associate Professor at DTU Bioengineering and corresponding author:
"First off, no database includes everything, so you need to know which databases are relevant to your particular needs. Then deep searches are very time-consuming and demand a lot of computer power. And, finally, it's nearly impossible to identify proteins that haven't been registered yet."
For this reason, some groups have worked on so-called "de novo sequencing algorithms" that improve accuracy and lower computational costs with increasing database size. Still, according to Jenkins and colleagues from DTU, Delft University in the Netherlands and the British AI company InstaDeep, their performance remained "underwhelming."
Exceeding state-of-the-art
In a in Nature Machine Intelligence, they propose two novel AI models to assist researchers, medical practitioners, and commercial entities in finding exactly the necessary information in the vast amounts of data. These are called InstaNovo and InstaNovo+ and are available to researchers through the InstaDeep website.
"Seen together, our models exceed state-of-the-art and are significantly more precise than currently available tools. Furthermore, as we show in the paper, our models are not specific to a particular research area. Instead, these tools could propel significant advances in all fields involving proteomics," says Kevin Michael Eloff, a research engineer at InstaDeep and co-first author of the paper.
To assess the usefulness of their models, the researchers have trained and tested them on several specific tasks within major areas of interest.
One investigation was conducted on wound fluid from patients with venous leg ulcers. Since venous leg ulcers are notoriously difficult to treat and often become chronic, knowing which microorganisms, such as bacteria, are present is crucial to treatment.
The models could map 10 times as many sequences as a database search, including those of E. coli and Pseudomonas aeruginosa—the latter being a multidrug-resistant bacterium.
Another use case was conducted on small pieces of protein, called peptides, displayed on the surface of cells. These help the immune system recognize infections and diseases such as cancer. The InstaNovo models identified thousands of new peptides that were not found using traditional methods.
In personalized cancer treatments, empowering the immune system—also known as immunotherapy—these peptides are all potential targets for attack.
"In combination, our tests of the model on complex cases, where, for example, unknown proteins are present, or where we have no prior knowledge of the organisms involved, show that they are suitable to improve our understanding significantly. That this bodes well for biomedicine is a given, since it can directly improve identification of our microbiome, as well as improve our efforts within personalized medicine and cancer immunology," says Konstantinos Kalogeropoulos, co-first author and Assistant Professor at DTU Bioengineering.
The paper provides six additional cases that demonstrate how these models improve therapeutic sequencing, discover novel peptides, detect unreported organisms, and significantly enhance proteomics searches. The implications of their results extend far beyond the medical sciences, says Timothy Patrick Jenkins:
"Looking at it from a purely technical, scientific perspective, it is also true that, with these tools, we can improve our understanding of the biological world as a whole, not only in terms of health care, but also in industry and academia.
"Within every field using proteomics—be it plant science, veterinary science, industrial biotech, environmental monitoring, or archaeology—we can gain insights into protein landscapes that have been inaccessible until now."
More information: Kevin Eloff et al, InstaNovo enables diffusion-powered de novo peptide sequencing in large-scale proteomics experiments, Nature Machine Intelligence (2025).
Journal information: Nature Machine Intelligence
Provided by Technical University of Denmark