Hoover Institution (Stanford, CA) – For Eric Mitchell, who joined OpenAI after defending his PhD in computer science at Stanford last Spring, the path to developing an effective AI-generated text detection system has been characterized by constant learning and experimentation, but also a sense of urgency: so much, from the health of the information ecosystem to the future of education to the further development of the large language models (LLMs) themselves, depends on it.

“It’s hard to get a sense of where the technology itself is as someone building it,” Mitchell, who previously interned at Google DeepMind, told the inaugural session of the Hoover Institution Seminar on Challenges and Safeguards against AI-Generated Disinformation. “It changes very quickly on a day-to-day or week-to-week basis. And when it comes to wanting to use these technologies as part of a policy toolkit to control or regulate them,” Mitchell emphasized. “It's really slippery to get our hands around. What can we actually do reliably enough? And what can we not?”

Mitchell joined Hoover Fellow Sergey Sanovich and other experts at Hoover’s Annenberg Hall late last Fall to discuss the complexities of using AI to identify AI-generated content and the hurdles of maintaining the reliability of detectors as models evolve. His presentation kicked off with DetectGPT, a “zero-shot” tool he developed at Stanford for detecting machine-generated text. “Zero-shot” means that an LLM – unlike in other approaches – undergoes no additional training before it is used to detect if a piece of text was written by a human or not. Mitchell also highlighted the pressing challenges facing industries in need of detection and governments trying to make it part of their regulatory framework for AI – especially honing in on how DetectGPT and similar tools are not perfect, and flaws in detection will continue to persist.

DetectGPT model has been a big step forward compared to models before such as GPTZero which shot onto headlines months after ChatGPT saw widespread adoption, particularly in schools. Mitchell used this opportunity to show the audience the nuances of AI detection, the risks of false yet convincingly sounding information generated by AI, and the constant arms race between detection tools and evasion tactics – obfuscation methods making text sound ‘’human’’ – that have already been used against DetectGPT and other detectors, and even offered as a service.

Detecting AI-Written Text

Throughout the seminar, it was apparent that the need for AI-generated text detection is more pressing than ever, as recent studies show that nearly a half of LLM-generated search results lack adequate citation support, and between a quarter and a half of citations failed to substantiate their associated claims. "We’re tempted to use AI despite the risks," Mitchell pointed out, noting the challenges in distinguishing credible content from misleading or unsupported AI outputs.

On the opposite side of the spectrum is a flawless student essay that is too good to be written by the student submitting it. In this case, it’s the high quality of work that creates a problem. “Language models now generate text that often carries with it the tone and style we might consider sign of a trusted or reliable source. So we can't always rely on usual cues to know if what we're reading is coming from someone that knows what they're talking about,” Mitchell explained.

Two primary detection methods were discussed. The first is training a dedicated machine learning (ML)-based classifier on samples of human-written and AI-generated text. It requires extensive data collection but still would never encompass all the domains even a single LLM can cover. As a result, it’s likely to be unreliable due to overfitting to the data it was trained on. “We can end up with a classifier that works really well to discriminate between AI- and human-generated texts that looks like student essays. But when it comes to blog posts or social media posts, it might not work so well,” Mitchell explained.

The second approach, "zero-shot" detection, utilizes LLM’s ability to detect its own output without additional training, “converting generators into detectors” as Mitchell put it. To generalize across models and domains, DetectGPT takes advantage of the distinctive patterns in how LLMs assign probability of being machine-generated to the pieces of content they are told to evaluate. Specifically, DetectGPT focuses on the unique “curvature” in a model's log probability function, looking for regions of negative curvature where machine-generated text often resides.

The core mechanism of DetectGPT involves generating slight variations, or “perturbations,” of the candidate passage to assess how minor rephrasings affect that log probability. DetectGPT operates on the hypothesis that machine-generated text tends to sit at a local maximum in the probability function, where slight alterations decrease the log probability more significantly (and always in one direction) than they would for human-written content.

While DetectGPT was shown to be consistently discriminative across a range of texts, it is computationally intensive. Its accuracy drops with creatively rephrased content, where detection tools struggle to keep up. "Detection works best with more standardized text types, like news or Wikipedia-style articles, but when it comes to creatively rephrased content, accuracy takes a hit," Mitchell added.

Can AI Detectors Be Trusted?

The discussion then shifted to the limitations of AI detectors and the risks of false positives, where human-generated content is mistaken for machine-generated. Mitchell expressed concerns about the vulnerability of current AI detectors, particularly in the face of adversarial models crafted to evade detection. "AI detectors are already unreliable. What happens when people start training language models specifically to be difficult to detect?" he asked.

"False positives are a big problem," he continued, noting that "commercial AI detectors can be reduced to close to random-chance performance." This revelation casts doubt on the effectiveness of relying solely on AI to catch machine-generated content, especially when faced with adversarial models. Yet even in this case, AI can aid, selecting suspicious candidates for human review.

As Kirill Kalinin emphasized during the discussion, making this combination effective will become particularly important when more and more data used to train LLMs comes from LLMs themselves, creating a very real threat of a bad feedback loop, which gradually erodes the quality of an originally good model, and even a complete model collapse. In addition to people and algorithms (for example, search ranking), undetected and low-quality AI generated texts thus risk fooling AI itself.

Ramifications and Future Considerations

At the end of the presentation, Mitchell opened the seminar for questions about his model and the role such methods can play in shaping AI governance and regulation. The most contentious issue in the discussion was the degree of optimism we should associate with tools like DetectGPT, given the ease with which adversarial models can degrade their effectiveness. Mitchell addressed this concern candidly, acknowledging that AI detection race might be inherently asymmetrical: “evasion might just be easier than detection."

Yet it doesn’t mean they couldn’t play their role, Mitchel explained: "AI tools like DetectGPT can sift through vast amounts of text efficiently," before human analysts focus their attention on the borderline cases. Importantly, he added “if you want to detect if some content came from a particular language model, it's way easier to do that if you are able to monitor it at the point of access.” That, in turn, means the coordination between governments in the areas identified as critical for safety and security and companies providing services is essential to get the incentives right.

“That is exactly why the policy discussion must be forward-looking to keep up,” Sanovich emphasized. "The goal here is to understand what technology can do in the medium to long term, not just what’s currently available on the market. Only that will help us find the optimal point of regulation."

“Indeed,” Mitchell concurred, “The models are going to keep getting better. They are not plateauing.”

Expand
overlay image