Technology

UF study finds AI text detectors unreliable for schools

A University of Florida team testing five popular AI text detectors says false positives and false negatives swing wildly, and the tools can be broken further by prompts that add more complex vocabulary. The researchers argue the detectors are poorly suited fo

A university deadline passes, a paper gets flagged, and then—sometimes quietly, sometimes permanently—a career reputation takes damage. For institutions using AI text detectors to police submissions. a new study is delivering an uncomfortable message: the tools don’t work as reliably as schools may believe.

In a paper presented at this week’s 2026 IEEE Symposium on Security and Privacy, researchers at the University of Florida conclude that commercially available AI-generated text detectors are “poorly suited for deployment in academic or high-stakes contexts.”

The person leading the work is Patrick Traynor. Ph.D. professor and interim chair of UF’s Department of Computer & Information Science & Engineering. His team tested the five most popular commercially available AI text detectors—and built the test so it wouldn’t rely on vague assumptions about what “AI writing” looks like.

They began with roughly 6,000 research papers submitted to top-tier security conferences before ChatGPT even arrived. Large language models were then used to create clones of those same papers. Both the original submissions and the LLM-created clones were run through the detectors.

The results were staggering. False positive rates ranged from 0.05% to 68.6%. False negative rates ran from 0.3% to 99.6%—meaning some detectors missed almost all AI-generated text. In Traynor’s own phrasing, the problem isn’t academic trivia. “We really can’t use them to adjudicate these decisions. People’s careers are on the line here.”.

image

The study also found that two of the five detectors performed well at first. But once the researchers asked the LLM to rewrite its outputs using more complex vocabulary—what the paper calls a lexical complexity attack—those detectors were “rendered largely useless.”

That fragility matters because the consequences of a detector flag aren’t limited to a lab report or a grade. An accusation of AI-generated writing in a submission can permanently damage a researcher’s reputation. Yet. as Traynor argues. institutions can’t rely on evidence that appears to come from measurements the tools themselves can’t make reliably. “For as many studies as we see claiming that a certain percentage of academic work is AI-generated. we actually don’t have tools to measure any of that. ” he said.

Taken together, the study lands on a more systemic failure than a bad tool in a locked testing room. The researchers describe a breakdown in due diligence: institutions adopted AI detectors without demanding evidence that the systems are accurate enough for the high-stakes decisions they’re used to support.

In a setting where a verdict can follow someone long after the submission process ends. the study’s core warning is blunt: until institutions can measure AI involvement with dependable evidence. using these detectors as arbiters of authorship is the wrong kind of certainty for a question that demands proof.

AI text detectors University of Florida Patrick Traynor 2026 IEEE Symposium on Security and Privacy academic integrity cybersecurity research false positives false negatives lexical complexity attack large language models

Leave a Reply

Your email address will not be published. Required fields are marked *

Are you human? Please solve:Captcha


Secret Link