Âé¶¹ÒùÔº


For the first time, scientists have access to a comprehensive data set for identifying unknown compounds

For the first time, scientists have access to a comprehensive data set for identifying unknown compounds
To acquire the spectral library data, a pipetting robot is used to prepare mixtures of 10 chemical compounds in plates, and the mass spectrometer then analyzes each mixture for about 90 seconds. During this time, the spectrometer collects all the needed spectra and the analysis can move on to the next mixture of compounds. This efficient procedure makes it possible to collect spectra for about 3,000 substances per day. Credit: Nature Methods (2025). DOI: 10.1038/s41592-025-02813-0

Scientists from the laboratory of Dr. Tomáš Pluskal at IOCB Prague are helping colleagues around the world identify previously unknown compounds. They have created an extensive library called MSnLib, which contains several million records showing how small molecules "break apart" when measured by mass spectrometry.

Until now, comparable databases have expanded only very slowly, but thanks to a new approach developed at IOCB Prague, data on unknown molecules can now be obtained in a matter of minutes.

This opens the potential for faster drug discovery, better monitoring of in the environment, and further advances in artificial intelligence for biomedicine.

An article about the library has been in the journal Nature Methods.

Credit: Institute of Organic Chemistry and Biochemistry of the CAS

Mass spectrometry reveals the composition of chemical substances and is a key tool in medicine, pharmacy, and environmental research. The instrument breaks a compound into smaller parts, and from these fragments scientists determine the structure of the original molecule.

Fragment spectra, which can be imagined as a fingerprint unique to each substance, are compared with already known spectra stored in libraries. However, existing databases have covered only a limited number of known compounds, making the search considerably more difficult.

Pluskal and his team have moved the development of spectral libraries significantly forward. At the time they prepared their study for Nature Methods, they had compiled a catalog of thirty thousand . For these, they recorded two million high-quality spectra, and they did not settle for a rough picture.

Through multistage fragmentation (MSn), i.e. repeated breaking of molecules, they obtained a more detailed view of their internal structure. Such a comprehensive data set is available to the scientific world for the first time.

Pluskal explains, "During the twenty years I've worked in this field, spectral libraries have not expanded much. We managed to change this practice and created the largest database currently in existence. Moreover, we've made it openly available to the global scientific community."

The researchers also substantially accelerated the analysis itself. They can measure ten compounds at once, and the entire process takes only a minute and a half. Because Pluskal's team is exceptionally well known and active in the global scientific community, they have received thousands of compounds as gifts from companies and institutions.

"Since writing the article in Nature Methods, we've advanced further. So far, we've processed about 70,000 compounds, and we have another 150,000 awaiting analysis. We continue uploading data online, and by the end of the year we'd like to reach 200,000 measured compounds. That's roughly 10 times more than has been available over the past 20 years," says the first author of the article, Dr. Corinna Brungs.

Pluskal and his colleagues are also using the enormous amount of new data to improve AI algorithms that autonomously recognize unknown chemical substances—from metabolites in the to compounds in plants and microorganisms.

Scientists "feed" the machine learning model with data from the chemical library. The more data it receives, the more accurately the model can predict, based on the supplied spectrum, what the molecule behind the spectrum might look like.

The spectral library was created using the open-source software "mzmine," which enabled automated processing of a vast number of measurements. As a result, the resource is not only extensive but also easily usable for further scientific projects worldwide.

More information: MSnLib: efficient generation of open multi-stage fragmentation mass spectral libraries, Nature Methods (2025). .

Journal information: Nature Methods

Citation: For the first time, scientists have access to a comprehensive data set for identifying unknown compounds (2025, September 16) retrieved 16 September 2025 from /news/2025-09-scientists-access-comprehensive-unknown-compounds.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

Building a better database to detect designer drugs

0 shares

Feedback to editors