Âé¶¹ÒùÔº


AI could one day replace tutors, but its reliability still lags

AI
Datasets used to train AI algorithms may underrepresent older people. Credit: Pixabay/CC0 Public Domain

Artificial intelligence has become an integral part of many people's everyday lives. Large language models (LLMs) such as ChatGPT, Gemini or Copilot write letters and term papers for them, give tips for excursions on holiday or answer questions on every conceivable topic.

The use of has also long been routine at universities in many areas. To what extent can support students in the natural sciences as unsupervised tutors? A research team at Julius-Maximilians-Universität Würzburg (JMU) has now investigated this question. The team's results are on the arXiv preprint server.

A freely accessible evaluation tool

The research group from the Department of Âé¶¹ÒùÔºical Chemistry, which has so far mainly conducted research into the spectroscopy of nanomaterials, has now developed a tool that tests the thermodynamic understanding of modern LLMs—in particular, whether their skills go beyond mere factual knowledge. The tool, called UTQA (Undergraduate Thermodynamics Question Answering), is freely accessible and is intended to support teachers and researchers in evaluating LLMs in a fair and subject-specific way—and to make progress measurable.

"Our wish is that AI will one day be able to support us as an unsupervised partner in teaching—for example, in the form of competent chatbots that respond individually to the needs of each student in the preparation and follow-up of lectures. We're clearly not there yet, but the progress is breathtaking," says project manager Professor Tobias Hertel.

"With UTQA, we show where current language models are already convincing and where they systematically fail—this is exactly what lecturers need in order to be able to plan their use in teaching responsibly."

Born out of teaching

Hertel's team has been using LLMs in the thermodynamics lecture with over 150 students for weekly knowledge checks since the winter semester of 2023. Models such as ChatGPT-3.5 and ChatGPT-4 showed their strengths, but also clear weaknesses.

This led to the desire for a subject-specific benchmark: "UTQA therefore comprises 50 challenging single-choice tasks from the basic thermodynamics lecture—two thirds text-based, one third with diagrams and sketches, as is typical for didactic exercises," explains Hertel.

The aim was not only to test factual knowledge and definitions, but also to test the language models' ability to link different boundary conditions in a targeted manner and to understand complex process sequences.

Results: Solid, but not (yet) reliable enough

According to Hertel, the test of the best-performing models of the year 2025 paints a clear picture: with UTQA, no model achieved the of 95% required by the research group for unsupervised assistance as an AI tutor. Even the leading GPT-o3 model in many benchmarks only achieved 82% overall accuracy.

"Two weaknesses were noticeable: Firstly, the models consistently had difficulties with so-called irreversible processes, where the speed of the state change influences the outcome. Secondly, there were clear deficits in tasks that required image interpretation," says the scientist.

A historical review shows that this is not surprising. Around 100 years ago, the French physicist Pierre Duhem already described the phenomenon of reversibility as one of the most difficult phenomena in thermodynamics. The fact that LLMs have problems interpreting diagrams is also not surprising, as the perception and processing of visual content is one of the outstanding cognitive strengths of humans.

Not good enough for unsupervised use yet

"In practice, this means that LLMs can already be very useful in teaching with or without supervision—but not yet enough to be used as unsupervised tutors," says Hertel. "At the same time, we have seen enormous progress in the last two years. We are therefore confident that—provided development does not suddenly come to a standstill—the expertise required for teaching assistants in our discipline can soon be achieved."

Hertel is particularly pleased that two student teachers were significantly involved in the research project, contributing their specialized didactic perspectives. Luca-Sophie Bien created an initial German version of many of the tasks; Anna Geißler translated and expanded the collection for international use.

Why thermodynamics?

According to Hertel, thermodynamics is ideal for testing the models' understanding and reasoning ability.

"It is fundamental to our understanding of nature, has compact basic laws, but in application requires a precise distinction between state and process variables, heat or work, and reversible or irreversible processes. This is precisely where reasoning ability is separated from mere memorization," says the physical chemist.

As a next step, the team is now planning to expand the tool to include real gases, mixtures, phase diagrams and standard cycles. The aim is to cover further concepts that are central to teaching.

"The better models can handle multimodal binding, i.e. the combination of text and images, as well as irreversible regimes, the closer we get to reliable, subject-sensitive AI tutorials," says Hertel.

More information: Anna Geißler et al, From Canonical to Complex: Benchmarking LLM Capabilities in Undergraduate Thermodynamics, arXiv (2025).

Journal information: arXiv

Citation: AI could one day replace tutors, but its reliability still lags (2025, September 6) retrieved 6 September 2025 from /news/2025-09-ai-day-reliability-lags.html
This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further


24 shares

Feedback to editors