Chemistry LLM developed for faster drug discovery

Gaby Clark
scientific editor

Robert Egan
associate editor

Southwest Research Institute scientists and engineers have developed a custom large language model (LLM) to accelerate drug design and discovery.
A multidisciplinary team developed the Generative Approaches for Molecular Encodings (GAMES) LLM to generate Simplified Molecular Input Line Entry System (SMILES) strings. SMILES is an industry standard system that represents the structure of molecules using a short series of text characters to facilitate storage, retrieval and modeling. Researchers trained GAMES to understand and generate valid new SMILES combinations.
"This project demonstrates a systematic way to build databases and networks of molecules for AI processing and comparison using only language," said Institute Scientist Dr. Jonathan Bohmann, lead developer of SwRI's Rhodium molecular docking software designed to virtually screen drug compounds.
Rhodium software uses descriptors along with graphical processing to visualize the chemical properties of compounds. Incorporating GAMES into the Rhodium workflow offers a faster generalized approach to drug discovery and design.
"Using LLMs, we can directly apply machine learning and AI to molecules via SMILES strings, because they appear as readable text characters and don't require translation into abstract representations," Bohmann said.
SwRI trained the GAMES model with classes of carbon-based molecules and other reference compounds to validate and fine-tune the SMILES strings it generated.

"This project showcases the power of training LLMs in highly technical scientific domains to focus on specific tasks," said SwRI Lead Computer Scientist Michael Hartnett. "In this case, we are working in the drug discovery domain, and our fine-tuning is focused on unlocking the most relevant knowledge."
GAMES combines LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) techniques to efficiently fine tune LLMs, reducing the hardware and energy needed to run Rhodium models. The team hopes to apply this approach to other applications and domains across the Institute.
"Using LLMs to generate accurate SMILES could transform the drug discovery process, especially when trained using specific datasets," said SwRI Research Scientist Daniel Hinojosa. "The fine-tuned techniques significantly improved performance, increasing the number of valid SMILES while reducing invalid outputs. Structured datasets and specific training techniques were key to this accomplishment."
Researchers hope GAMES will offer a powerful framework for ranking compounds found in chemical libraries based on drug-likeness, a shorthand term for a combination of properties that make it most likely to be approved as a safe drug. Additionally, they plan to explore chemical landscapes systematically through testing. Hinojosa and Bohmann plan to pursue additional internal funding to advance the next phase of the project.
"While we're in early stages of development, the results are already having a direct impact on ongoing research programs at SwRI," Bohmann said.
Provided by Southwest Research Institute