Âé¶¹ÒùÔº


Understanding us: Researchers apply algorithm to decode complex genome sequences

Understanding us: Researchers apply algorithm to decode complex genome sequences
Error identification and curation analysis overview. Credit: Genome Biology (2025). DOI: 10.1186/s13059-025-03594-7

Over the last 10 years, breakthroughs in understanding the genetic instructions passed from parent to offspring have put researchers closer than ever before to efficiently decoding DNA with 100% accuracy. However, this analysis approach, called genome sequencing, still poses a challenge for certain regions of the genome.

Portions of the DNA with highly complex replications, combinations and variations still cannot be automatically sequenced with a high level of accuracy. Instead, they typically require time-consuming and expensive manual analysis.

A group of researchers, co-led by two faculty members from the Penn State School of Electrical Engineering and Computer Science, developed a tool to streamline the analysis of these complicated regions, specifically the ones that code for an organism's immune system.

They tested their algorithm, called "CloseRead," on 74 publicly available genome sequences and were able to identify errors in these curated assemblies with more accuracy than other existing verification tools, which are not specialized for these complex regions.

The team their research in Genome Biology.

Consisting of billions of tiny fragments known as nucleotides, are difficult to accurately sequence fully, explained Anton Bankevich, assistant professor of computer science at Penn State and co-corresponding author of the paper. The process involves examining which nucleotides appear in the genome to extract the entirety of genetic information stored in the DNA molecule.

"You can imagine a genome like a page from a book with very tiny text on it, so small that you can't read without a magnifying glass," Bankevich said.

"Although we can use the magnifying glass to see the individual words on the page, it is difficult to see how all of the words fit together to make the whole. This is the issue we run into when trying to reconstruct the ."

According to Bankevich, much of this reconstruction can now be done with algorithms that are trained to reconstruct a complete nucleotide sequence based on many smaller subsequences. However, different errors can occur during reconstruction, many of which can be easily missed by researchers verifying the assemblies, complicating the process.

Additionally, mammals are diploid organisms, meaning they obtain two sets of genetic information from both parents, adding another layer of complexity to their genome, which can be billions of nucleotides long.

The first human genome was sequenced in 2001 using low-resolution, short-read sequencing, which can only analyze small portions of a genome at a time. In recent years, however, scientists have made rapid developments in the application of long-read sequencing, allowing for much more genetic information to be sequenced at once with unprecedented accuracy.

The breakthroughs in long-read sequencing have caused an "explosion" of mammalian genome sequence data generation, according to Yana Safonova, assistant professor of computer science at Penn State and co-corresponding author of the paper.

Scientists, including Safonova, just recently analyzed close relatives to humans and can now better examine the connection between an organism's genetic blueprint, or genotype, and how those genes translate into desirable traits like .

While tools exist to help verify these sequences, Safonova explained how portions of the genome are still too complex to be accurately sequenced without extensive manual analysis. She specializes in studying the immunoglobulin (IG) loci, a complex portion of the genome responsible for the production of antibodies.

"The IG loci is responsible for your adaptive immune response, which helps your body recognize and deal with unfamiliar viruses and bacteria," Safonova said.

"This part of the genome is complex with many repeating pieces across the whole structure and is very divergent from individual to individual. This makes it hard to analyze, even for specialists."

The authors developed CloseRead to verify this complicated region in existing genome sequences. They examined the IG loci assembly in the public genome sequences of 61 mammals and 13 reptiles.

The tool scans each nucleotide individually, looking for instances where nucleotides do not perfectly align within the assembly—known as mismatches—and sections of the genome where data is entirely missing from the assembly—known as breaks in coverage. Additionally, CloseRead visualizes the sequences, highlighting any possible errors for researchers reviewing the data and simplifying the verification process.

"We found that, surprisingly, there is a lot of incompleteness in the IG loci region, with around 50% of the proposed assemblies appearing either incorrect or incomplete," Bankevich said.

"The most frequent error we found was that while one copy of the genetic material was assembled correctly, the other was assembled incorrectly or missing entirely in mammals. In the IG loci, there is so much complexity that a small error like this could have a big impact on your analysis."

Up until the recent long-read breakthroughs, accurate analysis of the IG loci had been infeasible due to the complexity of the region, which is responsible for resistance to diseases like hepatitis and heart disease or predisposition to auto-immune disorders.

According to Safonova, better understanding of the region will accelerate not just immunogenomics—the study of how the immune system reacts to diseases—and biomedical research, but genetics research and biology as a whole.

"The development of CloseRead is just another part of the collaborative research effort to make accurate genome sequences accessible," Safonova said.

"By comparing the genome of two organisms with their unique exhibited traits, we can better understand the connection between the genotype and the visible traits that appear in mammals. This has been one of the biggest questions in biology, and the data we have available now can really help scientists make these connections."

Additionally, deeper understanding of the IG loci region could shed light on the genetic history of species, Safonova said. The team conducted case studies on three randomly selected mammalian species, including the Greenland wolf, a subspecies of gray wolf native to Greenland.

"When we examined the Greenland wolf genome, we saw what looked like errors in the assembly," Safonova said. "Upon further review, however, we found that the assemblies were correct and served as evidence that the Greenland wolf actually crossbred with gray wolves a long time ago."

Although CloseRead was developed to specifically target the IG loci, with further development, it could be applied to other complex regions of genomes—like the Y-chromosome—that have eluded genetics researchers for years. While the hope is to eventually eliminate the need for manual review, Bankevich said the technology is not there quite yet.

"This is a cautionary story—we have incredible potential for genome sequence reconstruction, but we have to be careful with what we use in our analysis," Bankevich said.

"Genome assembly without fine curation is currently not perfect. CloseRead assists with verifying information in complex regions, but the data still needs to be analyzed, and we need to keep this in mind when reviewing the new genome sequences being published."

More information: Yixin Zhu et al, CloseRead: a tool for assessing assembly errors in immunoglobulin loci applied to vertebrate long-read genome assemblies, Genome Biology (2025).

Journal information: Genome Biology

Citation: Understanding us: Researchers apply algorithm to decode complex genome sequences (2025, June 12) retrieved 12 June 2025 from /news/2025-06-algorithm-decode-complex-genome-sequences.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further


0 shares

Feedback to editors