Most detectors return a score without explaining how they got there. This guide covers the real mechanics — perplexity, burstiness, model fingerprints — and why the AI family your text came from determines whether that score means anything.
AI detection is statistical pattern recognition trained on known examples of AI and human writing. Here is what happens under the hood, step by step.
When you paste text into an AI detector, it is first broken into tokens — small units of text roughly equivalent to word fragments. The detector then analyzes the statistical distribution of those tokens: how predictable each word choice is given the words before it, how consistent the sentence length and structure is across the document, and how much variation exists paragraph to paragraph.
TokenizationPerplexity measures how surprising each word choice is given the tokens before it. AI models generate text by selecting statistically likely next tokens, so their outputs tend toward lower perplexity than human writing. Humans are messier: we choose unusual words, break conventions, and make stylistic decisions a probability model would not. A critical complication is that different AI model families have different perplexity profiles. Claude 4's distribution differs from GPT-5's, and Gemini 2.5 differs from both. A classifier calibrated on GPT perplexity ranges may misinterpret Claude's range entirely — producing false negatives specifically on Claude content.
Perplexity signalBurstiness measures variation in sentence length and structural complexity across a document. Human writing is bursty: short punchy sentences alternate with longer elaborations, and paragraph rhythm shifts throughout. AI outputs tend toward uniformity — consistent sentence lengths, parallel constructions, and predictable transitions. Both metrics are analyzed together because a text can fool one while failing the other. Claude 4 shows higher apparent burstiness than GPT-5, which is why GPT-trained classifiers routinely misclassify Claude outputs as human-written.
Burstiness signalThe detector's classifier — typically a fine-tuned language model or trained neural network — compares the statistical profile of your text against patterns it learned from training data. If your text profile closely resembles the AI outputs it was trained on, the score rises. This is the critical step: the classifier can only reliably recognize model families it has seen. A classifier trained exclusively on GPT data has no inherent ability to recognize Claude 4 or Gemini 2.5 patterns. It generalizes, often poorly, to unfamiliar model families — producing false negative rates of 21 to 32% on Claude and Gemini content across competing tools.
ClassificationThe output of any AI detector is a probability, not a verdict. A 94% AI score means the text's statistical profile is highly consistent with AI outputs in the training distribution — not that the text is definitively AI-generated. Most tools return a single document-level score by averaging signals across the entire text. GPTOne also operates at the sentence and paragraph level, highlighting the specific passages driving the overall score. This matters most for mixed documents where a human writer used AI to draft specific sections, because document-level averaging often hides the AI-generated paragraphs entirely.
GPTOne advantageEach AI model family produces text with distinct statistical and stylistic patterns — the result of different training data, different RLHF processes, and different output formatting defaults. Detection accuracy depends entirely on whether a classifier has seen those specific patterns during training.
Distribution shift is the core problem in multi-model AI detection. A classifier trained on GPT-3.5 data learns to recognize GPT-3.5 patterns. When it encounters Claude 4 or Gemini 2.5 text, it is operating outside its training distribution — and its predictions become unreliable in proportion to how different the new model's patterns are from what it learned.
GPT-family outputs are the best-understood by detectors because most training data historically came from these models. GPT-3.5 in particular produced highly recognizable patterns. GPT-5 and o3 outputs are more naturalistic but still carry statistical tells that well-trained classifiers identify reliably.
Anthropic's constitutional AI training produces outputs with noticeably different statistical patterns from GPT. Claude hedges more, varies its structure more deliberately, and produces paragraph rhythms that fall closer to high-quality human writing. GPT-only classifiers routinely misclassify Claude 4 text as human-written.
Google's Gemini models lean toward structured, list-adjacent formats in explanatory contexts. Gemini 2.5 Pro is the hardest version to detect — its prose outputs are notably more naturalistic than earlier versions, with fewer formatting tells. Version currency in training data matters significantly for Gemini detection.
Open-source model families are the fastest-growing category in active academic and enterprise use — and the least covered by existing detection tools. Each family carries distinct patterns: DeepSeek's reasoning-focused outputs differ from Grok's conversational style, and Llama 4 differs from Mistral's dense prose. Generic classifiers miss all of them at high rates.
GPTOne publishes separate accuracy benchmarks for each model family — the only free detector to do so. Figures reflect internal testing across 1,400 samples spanning academic, business, creative, and technical writing across all major model families and versions.
GPTOne is the only free AI detector that publishes separate accuracy benchmarks for Claude and Gemini. Other tools report a single blended accuracy figure that obscures performance on individual model families — making their coverage claims unverifiable. In comparative testing, GPTZero showed a 24% false negative rate on Claude content and ZeroGPT showed a 32% false negative rate on Gemini content.
How the major AI detectors compare on model coverage, transparency, and cost.
| Feature | GPTOne Best | GPTZero | ZeroGPT | Copyleaks |
|---|---|---|---|---|
| Claude 4 detection | ✓ 99.99% | Partial ~76% | Partial ~71% | Partial ~78% |
| Gemini 2.5 detection | ✓ 99.99% | Partial ~74% | Partial ~68% | Partial ~75% |
| DeepSeek and Grok detection | ✓ | ✗ | ✗ | ✗ |
| Open-source models (Llama, Mistral) | ✓ | ✗ | ✗ | Partial |
| Section-level highlighting | ✓ | ✓ | ✗ | ✓ |
| Per-model accuracy benchmarks | ✓ | ✗ | ✗ | ✗ |
| Free with no account | ✓ Always free | Limited | Limited | ✗ Paid only |
Understanding where detection fails is as important as understanding where it succeeds. These limitations apply to every tool on the market, including GPTOne.
Synonym substitution, sentence reordering, and light restructuring all disrupt the statistical fingerprints classifiers rely on. In GPTOne's testing, lightly edited AI outputs saw accuracy drop by 9 to 11 percentage points. Heavier rewriting reduces accuracy further.
Texts under 150 words provide insufficient statistical data for reliable classification. Cover letters, short responses, and brief messages are harder to classify accurately across all detectors. Longer texts of 300 words or more produce substantially more reliable results.
Formal writing in a second language can resemble AI output statistically: consistent structure, careful word choice, low perplexity. GPTOne holds this rate below 5% on non-native speaker samples — lower than competitors — but not zero. This limitation is systemic across the entire field.
A document that is 40% AI-generated and 60% human may score below the flagging threshold on document-level averaging tools. GPTOne's section-level highlighting partially addresses this, but mixed document detection remains the least reliable category for every tool available today.
Claude Opus 4.7 writes differently from Claude 3.5 Sonnet. Gemini 2.5 Pro writes differently from Gemini 1.5. A detector trained on older model outputs and not updated will underperform on the versions your students and candidates are actually using today. Training data currency is a continuous maintenance requirement.
No current AI detector is appropriate as a standalone basis for disciplinary action, employment decisions, or legal proceedings. Detection scores should open a review process, not close one. Process evidence, human judgment, and direct conversation remain essential alongside any detection result.
The safest approach treats detection as a first signal that opens a review process — not a gate that renders a verdict automatically.
Define what AI assistance is and is not permitted before scanning anything.
Run submissions through GPTOne for multi-model coverage across GPT, Claude, Gemini and more in one scan.
Treat scores above your threshold as a prompt for deeper human review, not as a conclusion.
Ask for drafts, notes, version history, or a live demonstration of subject knowledge.
Record the score, the flagged sections, the evidence requested, and the final decision reached.
Model-specific detection guides
Each guide covers model-specific fingerprints, detection rates, and accuracy benchmarks for a single model family cluster.
All GPT models through GPT-5, all Claude 4 models, and Gemini 2.5 — accuracy benchmarks and model-by-model breakdowns.
View guide →DeepSeek V3, R1, R2 and Grok 1, 2, 3 — the fastest-growing model families that most detectors were not built for.
View guide →Llama 4, Mistral Large 2, Qwen 3 and Microsoft Phi 4 — the open-source families now embedded in academic and enterprise workflows.
View guide →Generate a short text in GPT-5, Claude 4, and Gemini 2.5 and paste all three into GPTOne. Multi-model detection in one scan — free, no account required.
Run a free multi-model AI scanFree forever · No sign-up · GPT-5 · Claude 4 · Gemini 2.5 · DeepSeek · Grok · Llama 4 · Mistral · And more