Chemical language models don't need to understand chemistry, study demonstrates

October 15, 2025

Chemical language models don't need to understand chemistry, study demonstrates

edited by , reviewed by

Language models are now also being used in the natural sciences. In chemistry, they are employed, for instance, to predict new biologically active compounds. Chemical language models (CLMs) must be extensively trained. However, they do not necessarily acquire knowledge of biochemical relationships during training. Instead, they draw conclusions based on similarities and statistical correlations, as a recent study by the University of Bonn demonstrates. The results have now been in the journal Patterns.

Large language models are often astonishingly good at what they do, whether that's proving mathematical theorems, composing music, or drafting advertising slogans. But how do they arrive at their results? Do they actually understand what constitutes a symphony or a good joke? It is not so easy to answer that question. "All language models are a black box," emphasizes Prof. Dr. J眉rgen Bajorath. "It's difficult to look inside their heads, metaphorically speaking."

Nevertheless, Bajorath, a cheminformatics scientist at the Lamarr Institute for Machine Learning and Artificial Intelligence at the University of Bonn, has attempted to do just that. Specifically, he and his team have focused on a special form of AI algorithm: transformer CLM.

This model works in a similar way to ChatGPT, Google Gemini and Elon Musk's "Grok", which are trained using vast quantities of text, enabling them to generate sentences independently. CLMs, on the other hand, are usually based on significantly less data. They acquire their knowledge from molecular representations and relationships, e.g., the so-called SMILES strings. These are character strings that represent molecules and their structure as a sequence of letters and symbols.

Systematic manipulation of training data

In pharmaceutical research, scientists often attempt to identify substances that can inhibit certain enzymes or block receptors. CLMs can be used to predict active molecules based on the amino acid sequences of target proteins. "We used sequence-based molecular design as a test system to better understand how transformers arrive at their predictions," explains Jannik Roth, a doctoral student working with Bajorath.

"After the training phase, if you introduce a new enzyme to such a model, it may produce a compound that can inhibit it. But does that mean that the AI has learned the biochemical principles behind such inhibition?"

CLMs are trained using pairs of amino acid sequences of target proteins and their respective known active compounds. In order to address their research question, the scientists systematically manipulated the training data.

"For example, we initially only fed the model specific families of enzymes and their inhibitors," explains Bajorath. "When we then used a new enzyme from the same family for testing purposes, the algorithm actually suggested a plausible inhibitor."

However, the situation was different when the researchers used an enzyme from a different family in the test, i.e., one that performs a different function in the body. In this case, the CLM failed to correctly predict active compounds.

Statistical rule of thumb

"This suggests that the model has not learned generally applicable chemical principles, i.e., how enzyme inhibition usually works chemically," says the scientist. Instead, the suggestions are based solely on statistical correlations, i.e., patterns in the data. For example, if the new enzyme resembles a training sequence, a similar inhibitor will probably be active. In other words, similar enzymes tend to interact with similar compounds.

"Such a rule of thumb based on statistically detectable similarity is not necessarily a bad thing," says Bajorath, who leads the area "AI in Life Sciences and Health" at the Lamarr Institute. "After all, it can also help to identify new applications for existing active substances."

However, the models used in the study lacked biochemical knowledge when estimating similarities. They considered enzymes (or receptors and other proteins) to be similar if they matched 50%鈥�60% of their amino acid sequence, and, accordingly, suggested similar inhibitors. The researchers could randomize and scramble the sequences at will, as long as sufficient original amino acids were retained.

However, often only very specific parts of an enzyme are necessary for it to perform its task. A single amino acid change in such a region can render an enzyme dysfunctional. Other areas are more important for structural integrity and less relevant for specific functions. "During their training, the models did not learn to distinguish between functionally important and unimportant sequence parts," says Bajorath.

Discover the latest in science, tech, and space with over 100,000 subscribers who rely on 麻豆淫院 for daily insights. Sign up for our and get updates on breakthroughs, innovations, and research that matter鈥�daily or weekly.

Models simply repeat what they have read before

The results of the study therefore show that the transformer CLMs trained for sequence-based compound design lack any deeper chemical understanding, at least for this test system. In other words, they merely recapitulate, with minor variations, what they had already picked up in a similar context at some point.

"This does not mean that they are unsuitable for drug research," says Bajorath. "It is quite possible that they suggest drugs that actually block certain receptors or inhibit enzymes."

However, this is certainly not because they understand chemistry so well, but because they recognize similarities in text-based molecular representations and statistical correlations that remain hidden from us. This does not discredit their results. However, they should not be overinterpreted either.

More information: Jannik P. Roth and J眉rgen Bajorath, Unraveling learning characteristics of transformer models for molecular design, Patterns (2025). .

Journal information: Patterns

Provided by University of Bonn

Citation: Chemical language models don't need to understand chemistry, study demonstrates (2025, October 15) retrieved 15 October 2025 from /news/2025-10-chemical-language-dont-chemistry.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

麻豆淫院