EN KO
← All Publications

Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection

arXiv 2024
Choonghyun Park, Hyuhng Joon Kim, Junyeob Kim, Youna Kim, Taeuk Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-goo Lee, Kang Min Yoo

One-Line Summary

We reveal that AI-generated text detectors learn prompt-specific shortcut features rather than genuine AI writing patterns, and propose FAILOpt (Feedback-based Adversarial Instruction List Optimization) -- an attack that discovers deceptive instructions exploiting these shortcuts, which also doubles as a data augmentation strategy that dramatically improves detector robustness.

Detection failure from prompt-specific shortcuts
Figure 1. Detection failure due to prompt-specific shortcuts: When AI text is generated with different prompts than those used in the detector's training data, the detector fails because it relies on prompt-specific surface features rather than genuine AI writing characteristics.

Background & Motivation

The rapid advancement of large language models (LLMs) has raised serious concerns about misuse, including academic dishonesty and misinformation. AI-Generated Text (AIGT) detectors have emerged as a critical countermeasure, but they suffer from a fundamental yet overlooked vulnerability: standard AIGT datasets use only a small number of prompts for text generation, despite the countless prompt variations available to LLM users.

This narrow data collection introduces what the authors term "prompt-specific shortcut features" -- spurious correlations present in training data from a limited set of prompts that fail to represent genuine AI generation patterns across diverse instructions. Since LLMs exhibit high instruction-following capacity, different prompts can dramatically alter the stylistic characteristics of generated text, rendering detectors trained on single-prompt data fundamentally brittle.

Key Insight: Detectors were not actually learning to distinguish "AI-written text from human-written text." Instead, they were learning to recognize "text generated with a specific prompt" -- a much narrower and easily exploitable signal. This distinction has been largely overlooked in prior AIGT detection literature.

Proposed Method: FAILOpt

FAILOpt (Feedback-based Adversarial Instruction List Optimization) is an iterative optimization algorithm that automatically discovers deceptive instructions capable of fooling AIGT detectors. It operates through two phases per iteration over 6 total iterations:

1
Candidate Generation
The system analyzes small batches of LLM outputs paired with human writings to identify approximately 10 distinguishing characteristics between the two styles. Each characteristic is converted into a specific instruction that guides the LLM to produce more human-like text without altering the core task requirements (e.g., "Include witty remarks and irony," "Use specific examples and technical terminology").
2
Instruction Selection & Refinement
Candidate instructions generate new text samples that are evaluated against the target detector. The top-performing deceptive instructions -- those that minimize detection accuracy the most -- advance to the next iteration. Instructions also undergo paraphrasing to optimize phrasing while maintaining semantic content, exploring the instruction space more effectively.
3
Iterative Optimization
The algorithm iterates 6 times, selecting the top-2 instruction lists per iteration. Each iteration builds on the previous best instructions, progressively discovering more effective deceptive prompts. The final output is the instruction list demonstrating maximum evasion effectiveness.

Examples of discovered effective instructions:

These instructions remove the characteristic markers of AI text that detectors relied upon. Crucially, the same FAILOpt-generated texts can then be used as training data augmentation to build more robust detectors -- turning the attack into a defense.

Experimental Results

Experiments were conducted on three datasets -- ELI5 (long-form QA), XSum (summarization), and SQuAD (reading comprehension) -- using GPT-3.5-turbo (versions 0301 and 0613) as generators. The target detector was a RoBERTa-base model fine-tuned on the HC3 dataset. FAILOpt was compared against four baselines: PARA (self-paraphrasing), DIPPER (T5-XXL paraphraser), SICO (adversarial in-context examples), and IP (instruction-prompt optimization).

FAILOpt first iteration
Figure 2. Results on the ELI5 dataset from FAILOpt's first iteration: The detector's AUROC drops dramatically as FAILOpt discovers effective deceptive instructions.

Attack Performance (ChatGPT Detector)

ModelDatasetAUROC (Before)AUROC (After)ASR
gpt-3.5-turbo-0301ELI593.33%78.17%46.55%
gpt-3.5-turbo-0613ELI598.23%62.49%95.72%
gpt-3.5-turbo-0613XSum86.16%63.96%55.75%
gpt-3.5-turbo-0613SQuAD91.84%44.52%90.93%

Defense via Data Augmentation

SettingELI5 (0613)XSum (0613)SQuAD (0613)
Original Detector62.49%63.96%44.52%
Augmented Detector100.00%~100%~100%
Defense via augmentation
Figure 3. Defense effectiveness over training: Augmented training with FAILOpt data produces monotonic improvement against various attacks, while single-source training shows degradation over time.

Why It Matters

AI text detection is central to academic integrity and information trustworthiness, yet this study reveals that current detectors suffer from a fundamental vulnerability that has been largely overlooked: dependence on prompt-specific artifacts rather than genuine AI writing characteristics. The practical implications are twofold:

Links

Detection Safety