FAILOpt - HYU NLP Lab

One-Line Summary

We reveal that AI-generated text detectors learn prompt-specific shortcut features rather than genuine AI writing patterns, and propose FAILOpt (Feedback-based Adversarial Instruction List Optimization) -- an attack that discovers deceptive instructions exploiting these shortcuts, which also doubles as a data augmentation strategy that dramatically improves detector robustness.

Background & Motivation

The rapid advancement of large language models (LLMs) has raised serious concerns about misuse, including academic dishonesty and misinformation. AI-Generated Text (AIGT) detectors have emerged as a critical countermeasure, but they suffer from a fundamental yet overlooked vulnerability: standard AIGT datasets use only a small number of prompts for text generation, despite the countless prompt variations available to LLM users.

This narrow data collection introduces what the authors term "prompt-specific shortcut features" -- spurious correlations present in training data from a limited set of prompts that fail to represent genuine AI generation patterns across diverse instructions. Since LLMs exhibit high instruction-following capacity, different prompts can dramatically alter the stylistic characteristics of generated text, rendering detectors trained on single-prompt data fundamentally brittle.

Key Insight: Detectors were not actually learning to distinguish "AI-written text from human-written text." Instead, they were learning to recognize "text generated with a specific prompt" -- a much narrower and easily exploitable signal. This distinction has been largely overlooked in prior AIGT detection literature.

Proposed Method: FAILOpt

FAILOpt (Feedback-based Adversarial Instruction List Optimization) is an iterative optimization algorithm that automatically discovers deceptive instructions capable of fooling AIGT detectors. It operates through two phases per iteration over 6 total iterations:

1

Candidate Generation

The system analyzes small batches of LLM outputs paired with human writings to identify approximately 10 distinguishing characteristics between the two styles. Each characteristic is converted into a specific instruction that guides the LLM to produce more human-like text without altering the core task requirements (e.g., "Include witty remarks and irony," "Use specific examples and technical terminology").

2

Instruction Selection & Refinement

Candidate instructions generate new text samples that are evaluated against the target detector. The top-performing deceptive instructions -- those that minimize detection accuracy the most -- advance to the next iteration. Instructions also undergo paraphrasing to optimize phrasing while maintaining semantic content, exploring the instruction space more effectively.

3

Iterative Optimization

The algorithm iterates 6 times, selecting the top-2 instruction lists per iteration. Each iteration builds on the previous best instructions, progressively discovering more effective deceptive prompts. The final output is the instruction list demonstrating maximum evasion effectiveness.

Examples of discovered effective instructions:

"Include witty remarks and irony"
"Provide structured and systematic answers"
"Use specific examples and technical terminology"

These instructions remove the characteristic markers of AI text that detectors relied upon. Crucially, the same FAILOpt-generated texts can then be used as training data augmentation to build more robust detectors -- turning the attack into a defense.

Experimental Results

Experiments were conducted on three datasets -- ELI5 (long-form QA), XSum (summarization), and SQuAD (reading comprehension) -- using GPT-3.5-turbo (versions 0301 and 0613) as generators. The target detector was a RoBERTa-base model fine-tuned on the HC3 dataset. FAILOpt was compared against four baselines: PARA (self-paraphrasing), DIPPER (T5-XXL paraphraser), SICO (adversarial in-context examples), and IP (instruction-prompt optimization).

FAILOpt first iteration — **Figure 2.** Results on the ELI5 dataset from FAILOpt's first iteration: The detector's AUROC drops dramatically as FAILOpt discovers effective deceptive instructions.

Attack Performance (ChatGPT Detector)

Model	Dataset	AUROC (Before)	AUROC (After)	ASR
gpt-3.5-turbo-0301	ELI5	93.33%	78.17%	46.55%
gpt-3.5-turbo-0613	ELI5	98.23%	62.49%	95.72%
gpt-3.5-turbo-0613	XSum	86.16%	63.96%	55.75%
gpt-3.5-turbo-0613	SQuAD	91.84%	44.52%	90.93%

Defense via Data Augmentation

Setting	ELI5 (0613)	XSum (0613)	SQuAD (0613)
Original Detector	62.49%	63.96%	44.52%
Augmented Detector	100.00%	~100%	~100%

Defense via augmentation — **Figure 3.** Defense effectiveness over training: Augmented training with FAILOpt data produces monotonic improvement against various attacks, while single-source training shows degradation over time.

Devastating Attack on Supervised Detectors: On SQuAD with gpt-3.5-turbo-0613, FAILOpt dropped AUROC below chance level (44.52%), meaning the detector would be better off flipping its predictions. On ELI5, 95.72% of previously detected AI texts escaped detection after the attack.
Metric-based Detectors Are More Robust: Perplexity-based detectors and DetectGPT showed inconsistent and reduced degradation from FAILOpt, confirming that the attack specifically exploits prompt-specific shortcuts in learning-based detectors rather than universal text features.
Near-Perfect Defense: Augmenting the detector's training data with FAILOpt-generated texts restored AUROC to nearly 100% across all datasets and both GPT-3.5 versions, demonstrating the dual-use nature of the attack.
Cross-Attack Generalization: The augmented detector showed improved robustness not only against FAILOpt but also against other attack methods (PARA, DIPPER, SICO), indicating that diverse prompt exposure helps detectors learn more generalizable features.

Why It Matters

AI text detection is central to academic integrity and information trustworthiness, yet this study reveals that current detectors suffer from a fundamental vulnerability that has been largely overlooked: dependence on prompt-specific artifacts rather than genuine AI writing characteristics. The practical implications are twofold:

For Detection Research: The findings establish that reliable AIGT detection requires training datasets with comprehensive prompt diversity, not merely varied content inputs. This reframes how the community should approach dataset construction.
Dual-Purpose Solution: FAILOpt demonstrates the security principle of "know your enemy to defend yourself" -- the same adversarial attack that exposes detector weaknesses also produces the training data needed to fix them, achieving near-perfect robustness through augmentation.
Broader Impact: As LLM-generated content becomes ubiquitous, understanding and addressing shortcut learning in detectors is essential for maintaining trust in written communication across education, journalism, and public discourse.

Links

arXiv Paper

Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection