LLMs can predict educational and psychological outcomes from childhood essays with remarkable accuracy

Ingrid Fadelli
contributing writer

Gaby Clark
scientific editor

Robert Egan
associate editor

Large language models (LLMs), advanced artificial intelligence (AI) models trained to analyze and generate texts in different human languages, have become increasingly widespread over the past few years. Since the release of the conversational platform ChatGPT, which relies on different versions of an LLM called GPT, these models have become widely used by individuals worldwide, while also making their way into some professional and research settings.
Tobias Wolfram, a researcher with a Ph.D. in Sociogenomics from Bielefeld University, recently carried out a study aimed at assessing the extent to which LLMs can predict people's educational and psychological outcomes by analyzing essays they wrote during childhood. His findings, in Communications Psychology, suggest that some computational models can predict these outcomes with an accuracy comparable to that of teacher assessments and substantially better than genetic data.
"During my undergraduate studies, I was already fascinated by any type of data that deviated from the standard survey questions common in the social and behavioral sciences as part of the back then ongoing computational social science revolution," Wolfram told Âé¶¹ÒùÔº.
"I conducted network analyses, scraped web data and eventually got into natural language processing. However, I soon realized how limited the tools available at the time were. That was around 2014–16, so long before large language models became a thing. I followed the progress over the coming years from a distance."
In 2020, when Wolfram started his Ph.D. in Sociogenomics, LLMs had only recently been introduced, following the public release of the GPT2 and GPT3 models. Around the same time, he also uncovered a dataset that could be interesting for conducting sociology research, containing educational and psychology-related information for a large cohort of individuals born in the 1950s.
"Thousands of participants, extensively surveyed for decades? That in itself was already exciting enough, but then finding the essays these people wrote at age 11, which at that time had just been digitized—I knew this was a one-of-a-kind chance," said Wolfram. "Just reading them, you immediately realize the tremendous variation in complexity and sophistication, in length, scope and correct spelling and grammar.
"To a human eye, it was immediately obvious, but how well could you quantify this? And what does it mean for your life? Does it predict other things we care about, like cognitive ability or education? Fortunately, I got incredibly generous support from my Ph.D. advisor, Felix Tropf, and Charles Rahal at Oxford, who convinced me to and helped me spend the actual time developing my first analyses into a full paper."
Inspired by the dataset he found, Wolfram started working on this recent study. Firstly, he tried to determine whether he could quantify the information contained in the childhood essays he uncovered using recently developed computational tools,
"My main approach was to use a large language model—specifically, a technology similar to what underlies tools like ChatGPT—to analyze the roughly 250-word essays the children wrote at age 11," explained Wolfram.
"I used a model to convert each essay into a complex numerical profile, known as a 'text embedding,' which captures its meaning and style across over 1,500 dimensions. I also extracted over 500 other metrics that measured things like lexical diversity, sentence complexity, readability, and even the number of grammatical errors."
After extracting this data from the essays, Wolfram trained a machine learning model to make predictions based on this extracted data. For the purpose of his study, he decided to employ an ensemble machine learning model known as a "SuperLearner."

"You can think of it as a master model that intelligently combines the predictions from several different algorithms—like Random Forest, Neural Networks, and Support Vector Machines, to produce the most accurate final prediction possible," said Wolfram. "To evaluate how well these models worked, I used 10-fold cross-validation, where I would train the model on one part of the data and test it on a part it had never seen before."
To assess the extent to which the machine learning models predicted educational and psychological outcomes, the author primarily relied on a metric referred to as "predictive holdout R2." This metric essentially quantifies how much of the variation in an outcome (e.g., a person's cognitive ability) a machine model can explain in new data, compared to just guessing an average value.
For example, a score of 0.6 on this metric would suggest that a model could explain 60% of the variance. Using this approach, Wolfram was able to assess a model's true predictive power, as opposed to how well it could summarize the data it was trained on.
"A natural benchmark for the model's performance seemed to be the set of relatively detailed teacher evaluations on all the participants that were given at the same time as the essays were written," said Wolfram.
"In comparison, it is truly surprising how much variation just these ultra-short essays are able to predict in cognition and education—they are basically on par with a survey assessment of an education professional who often knew these children for years. And again—these essays were on average just 250 words in length, written at age 11."
Overall, the findings of this recent study suggest that LLMs and other advanced machine learning models hold great potential for making accurate predictions based on textual data. In addition, they reiterate the value of rich texts, such as essays and personal writings, showing that they can be used to derive important information about the person who wrote them.
"This project took almost five years to be published, even though the main analyses were rather straightforward," said Wolfram.
"My actual Ph.D. was much more focused on questions at the intersection of social stratification research, differential psychology and genomics and since I defended my thesis last year and left academia, I will likely not have the opportunity to follow up on this project, but if I did, a natural extension (using the same dataset) would be to get access to the actual raw text scans.
"All I did was based on the digitized text, but given the powerful multimodal models available today, I would expect that incorporating things like handwriting, etc., might give us even further information."
Notably, at the time Wolfram carried out his study, LLMs and other machine learning models were not as advanced as they are today. Seeing as these models are developing at remarkable speeds, similar studies employing more recent computational models could yield even better predictions.
"The whole approach of the paper was still very much based in the 'old-school' machine learning and data science tradition of having a set of examples to train a model on and then validate on data the model did not see during training," added Wolfram.
"Nowadays, it would of course be natural to simply prompt an LLM using a chat interface without giving it any training data at all. I would not be surprised if such an approach outperforms the results in the paper, a testament to how quickly things are moving."
Written for you by our author , edited by , and fact-checked and reviewed by —this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive. If this reporting matters to you, please consider a (especially monthly). You'll get an ad-free account as a thank-you.
More information: Tobias Wolfram, Large language models predict cognition and education close to or better than genomics or expert assessment, Communications Psychology (2025). .
Journal information: Communications Psychology
© 2025 Science X Network