This article has been reviewed according to Science X's and . have highlighted the following attributes while ensuring the content's credibility:
fact-checked
peer-reviewed publication
trusted source
proofread
Prominent chatbots routinely exaggerate science findings, study shows

When summarizing scientific studies, large language models (LLMs) like ChatGPT and DeepSeek produce inaccurate conclusions in up to 73% of cases, according to a study by Uwe Peters (Utrecht University) and Benjamin Chin-Yee (Western University, Canada/University of Cambridge, UK). The researchers tested the most prominent LLMs and analyzed thousands of chatbot-generated science summaries, revealing that most models consistently produced broader conclusions than those in the summarized texts.
Surprisingly, prompts for accuracy increased the problem and newer LLMs performed worse than older ones.
The work is in the journal Royal Society Open Science.
Almost 5,000 LLM-generated summaries analyzed
The study evaluated how accurately ten leading LLMs, including ChatGPT, DeepSeek, Claude, and LLaMA, summarize abstracts and full-length articles from top science and medical journals (e.g., Nature, Science, and The Lancet). Testing LLMs over one year, the researchers collected 4,900 LLM-generated summaries.
Six of ten models systematically exaggerated claims found in the original texts, often in subtle but impactful ways; for instance, changing cautious, past-tense claims like "The treatment was effective in this study" to a more sweeping, present-tense version like "The treatment is effective." These changes can mislead readers into believing that findings apply much more broadly than they actually do.
Accuracy prompts backfired
Strikingly, when the models were explicitly prompted to avoid inaccuracies, they were nearly twice as likely to produce overgeneralized conclusions than when given a simple summary request.
"This effect is concerning," Peters said. "Students, researchers, and policymakers may assume that if they ask ChatGPT to avoid inaccuracies, they'll get a more reliable summary. Our findings prove the opposite."
Do humans do better?
Peters and Chin-Yee also directly compared chatbot-generated to human-written summaries of the same articles. Unexpectedly, chatbots were nearly five times more likely to produce broad generalizations than their human counterparts.
"Worryingly," said Peters, "newer AI models, like ChatGPT-4o and DeepSeek, performed worse than older ones."
Reducing the risks
The researchers recommend using LLMs such as Claude, which had the highest generalization accuracy, setting chatbots to lower "temperature" (the parameter fixing a chatbot's "creativity"), and using prompts that enforce indirect, past-tense reporting in science summaries.
Finally, "If we want AI to support science literacy rather than undermine it," Peters said, "we need more vigilance and testing of these systems in science communication contexts."
More information: Uwe Peters et al, Generalization bias in large language model summarization of scientific research, Royal Society Open Science (2025).
Journal information: The Lancet , Royal Society Open Science , Science , Nature
Provided by Utrecht University