AI-detection software isn't the solution to classroom cheating—assessment has to shift
Two years since the release of ChatGPT, teachers and institutions are still struggling with (AI).
Some have . Others have or have called for teachers to .
The result of responses, leaving many kindergarten to Grade 12 and post-secondary teachers to make decisions about AI use that with the teacher next door, institutional policies, or current research on what AI can and cannot do.
One response has been to , which rely on algorithms to try to identify how a specific text was generated.
AI detection tools . But they're a sufficiently imperfect solution, and they do nothing to address the core validity problem of designing assessments where we can be confident in what students know and can do.
Get free science updates with Science X Daily and Weekly Newsletters — to customize your preferences!
Teachers using AI detectors
A , based published by the , reported that 68% of teachers use AI detectors.
This practice has also found its way into some Canadian and .
AI detectors vary in their methods. Two common to check for qualities described referring to alternating and short and long sentences (the way humans tend to write) and complexity (or "perplexity"). If an assignment does not have the typical markers of human-generated text, the software may flag it as AI-generated, prompting the teacher to begin an investigation for academic misconduct.
To its credit, AI detection software is more reliable than human detection. Repeated studies show humans— and —are incapable of reliably distinguishing AI-generated text, .
Accuracy of detectors varies
While or , others seem to be more successful. However, what success rates should really signal for educators is questionable.
Turnitin boasts that their AI detector has a 99% success rate, vis-Ã -vis their (that is, the number of human-generated submissions their tool incorrectly flags as AI-generated). This accuracy has been challenged by a recent study that found Turnitin only detected AI-generated text .
The same study suggested how different factors could shape accuracy results. For example, GPTZero's accuracy , especially if students edit the output an AI tool generates. Yet a different study of the same detector suggested (for example, between 23% and 82% accuracy or 74% and 100% accuracy).
Considering numbers in context
The value of a percentage depends on its context. In most courses, being correct 99% of the time is exceptional. It's above the most common threshold for , which is often set at 95%.
But a 99% success rate would be atrocious in air travel. There, a 99% success rate would mean . That level of failure would be unacceptable.
To suggest what this could look like: at an institution like mine, the University of Winnipeg, submit multiple assignments—we could ballpark five, for argument's sake—for around five courses every year.
That would be about 250,000 assignments every year. There, even a 99% success rate means roughly 2,500 failures. That's 2,500 false positives where students did not use ChatGPT or other tools, but the AI detection software flags them for possible use of AI, potentially initiating hours of investigative work for teachers and administrators alongside stress for students who may be .
Time wasted investigating false positives
While AI detection software merely flag possible problems, we've already seen that humans are unreliable detectors. We cannot tell which of these 2,500 assignments are false positives, meaning cheaters will still slip through the cracks and precious teacher time will be wasted investigating innocent students who did nothing wrong.
This is not a new problem. . Ubiquitous AI has merely shed a spotlight on a long-standing .
When students can plagiarize, hire contract cheaters, rely on ChatGPT or have their friend or sister write the paper, relying on take-home assessments written outside class time without any teacher oversight is indefensible. I cannot presume that such forms of assessment represent the student's learning, because I cannot reliably discern if the student actually wrote them.
Need to change assessment
The solution to taller cheating ladders is not taller walls. The solution is to change how we are assessing—something have been advocating for .
Just as we don't spend thousands of dollars on "did-their-sister-write-this" detectors, schools should not rest easy simply because AI detection companies have a product to sell. If educators want to make valid inferences about what students know and can do, assessment practices are needed that emphasize (like drafts, works-in-progress and repeated observations of student learning).
These need to be rooted in that center as a shared responsibility of students, teachers and system leaders—not just a mantra of "don't cheat and if we catch you we will punish you."
Let's spend less on flawed detection tools and more on supporting teachers to across the board.
Provided by The Conversation
This article is republished from under a Creative Commons license. Read the .