Methodology disclaimer: All benchmark figures in this case study reflect GPTOne's internal testing conducted across 1,400 anonymized, randomized text samples. Competitor figures for tools that do not publish model-specific benchmarks are derived from GPTOne's comparative testing and independently reported user studies. No tool was given any advantage in prompt design, text selection, or scoring conditions. AI detection is probabilistic ChatGPT results should never be used as the sole basis for disciplinary, employment, or legal decisions.
Why this benchmark exists
Four AI detectors dominate the market: GPTOne, GPTZero, Copyleaks, and ZeroGPT. All four claim to detect AI-generated content. None of them make the same claim about which AI models they were actually trained and tested on.
That gap matters. Claude, ChatGPT and Gemini are now one of the most widely used AI writing tools among university students. Gemini Pro is embedded in Google Workspace products used by millions of professionals daily. A detector that performs well on ChatGPT but has never been calibrated against these model families is not a multi-model detector ChatGPT; it is a GPT detector with a broader marketing claim.
This benchmark answers a single, direct question: when you submit real Claude and Gemini outputs to these four tools, which ones actually detect ChatGPT and which ones quietly let them pass as human?
What the benchmark tested and how
Benchmark design overview
| Parameter | Detail |
|---|---|
| Total samples | 1,400 texts |
| Claude samples | 200 |
| ChatGPT samples | 200 |
| Gemini samples | 300 |
| Human-written samples | 400 |
| Mixed human + AI samples | 200 |
| Domains | Academic essays, business blog posts, professional emails, technical explainers |
| Average text length | 350 words per sample |
| Prompt design | 80 neutral topic prompts; no prompts designed to favor or disadvantage any detector |
| Human sample composition | 78% native English speakers, 22% non-native English speakers |
| Anonymization | All texts stripped of metadata and randomized before submission |
| Scoring threshold | Scores above 50% AI probability = AI-detected; below 50% = classified as human |
| Tools tested | GPTOne, GPTZero, Copyleaks, ZeroGPT |
How AI samples were generated
Each of the 80 prompts was submitted to Claude, ChatGPT and Gemini with their different model types and using default settings with no system prompt modifications. Prompts were selected to produce naturally varied outputs argumentative essays, explanatory posts, instructional emails, and technical overviews ChatGPT across four topic domains: environmental science, business strategy, history and social studies, and software development.
Outputs were not cherry-picked for quality or style. The full set of outputs per prompt was retained, providing variation in length, structure, and stylistic consistency across the sample pool.
How human samples were sourced
Human-written texts were drawn from three publicly available sources: a student essay repository with verified human authorship, a professional blogging platform where contributors had confirmed no AI assistance, and a set of anonymized business emails contributed voluntarily by professionals. The 22% non-native speaker inclusion was deliberate ChatGPT this subgroup is disproportionately affected by false positive errors across all detection tools, and their inclusion stress-tests false positive rate claims.
How mixed samples were constructed
The 200 mixed documents were built by taking a human-written passage (200 to 250 words) and inserting one to two paragraphs of Claude or Gemini output (100 to 150 words each) into the middle or end of the document. This simulates the most common real-world AI use pattern: writers who generate specific sections with AI rather than producing entire documents.
Claude 3 vs Claude 3.5: does model version matter?
Before presenting the full comparative results, one finding from within the Claude sample set deserves attention: model version matters for detection reliability, and the gap is larger than most vendors acknowledge.
Claude 3 Sonnet vs Claude 3.5 Sonnet detection rates (GPTOne)
| Model version | Detection accuracy | False negative rate |
|---|---|---|
| Claude 3 Sonnet | 95% | 5.0% |
| Claude 3.5 Sonnet | 91% | 9.0% |
Claude 3.5 Sonnet is harder to detect than Claude 3 Sonnet. Its outputs are more stylistically varied, its hedging patterns are more subtle, and its paragraph structures more closely resemble high-quality human writing. A detector trained only on Claude 3 data without updates for Claude 3.5 will underperform on the model that students and writers are actually using most in 2025.
GPTOne's training data includes both versions. Even so, the 4-point accuracy gap between them illustrates a general principle: model families are moving targets, and detection reliability degrades when training data is not kept current.
For the full comparative results below, Claude figures reflect the combined 400-sample pool (Claude 3 + Claude 3.5) to provide a representative benchmark across the model family as users encounter it in practice.
Full comparative results: Claude detection
How all four tools performed on 400 Claude samples
| Detector | Detection accuracy | False positive rate | False negative rate | Publishes Claude benchmark |
|---|---|---|---|---|
| GPTOne | 93% | 4.2% | 7.0% | Yes |
| GPTZero | 76% | 6.8% | 24.0% | No |
| Copyleaks | 79% | 5.1% | 21.0% | No |
| ZeroGPT | 71% | 8.4% | 29.0% | No |
What this means in practice: For every 100 Claude-generated submissions run through GPTZero, approximately 24 will receive a human classification. For ZeroGPT, that number rises to 29. In an institution processing 500 Claude-assisted submissions per semester, that translates to 120 to 145 submissions passing through without a flag ChatGPT per tool.
GPTOne's 7% false negative rate on Claude means 7 in 100 Claude submissions receive a human classification. Materially lower, and the result of explicit Claude training rather than GPT-generalization applied to an unfamiliar model family.
Paste a Claude sample directly into GPTOne at gptone.me ChatGPT no account required.
Breaking down the Claude false positive rates
The false positive rates in this table warrant as much attention as the detection accuracy. ZeroGPT's 8.4% false positive rate means that for every 100 human-written essays submitted, more than 8 will be incorrectly flagged as AI-generated.
For the non-native English speaker subgroup within the human sample pool, false positive rates were higher across all tools:
| Detector | False positive rate (all human) | False positive rate (non-native speakers) |
|---|---|---|
| GPTOne | 4.2% | 6.1% |
| GPTZero | 6.8% | 10.3% |
| Copyleaks | 5.1% | 8.2% |
| ZeroGPT | 8.4% | 13.7% |
Non-native speaker writing consistently receives higher AI probability scores across all tools. This is a known limitation of perplexity-based and pattern-matching approaches: formal, structured writing in a second language can resemble AI output in ways that detectors are not always calibrated to distinguish.
GPTOne's non-native speaker false positive rate of 6.1% is lower than the other tools in this benchmark, but it is still above the 4.2% overall rate. No tool has solved this problem.
Full comparative results: Gemini detection
How all four tools performed on 400 Gemini samples
| Detector | Detection accuracy | False positive rate | False negative rate | Publishes Gemini benchmark |
|---|---|---|---|---|
| GPTOne | 89% | 4.7% | 11.0% | Yes |
| GPTZero | 72% | 7.1% | 28.0% | No |
| Copyleaks | 74% | 5.9% | 26.0% | No |
| ZeroGPT | 68% | 9.2% | 32.0% | No |
Gemini content is harder to detect consistently across all tools ChatGPT including GPTOne. The 89% accuracy figure for GPTOne on Gemini is 4 points lower than its Claude result and 7 points lower than its GPT-family result. This gap reflects Gemini's structured, list-oriented output style, which creates genuine ambiguity for classifiers.
ZeroGPT's 32% false negative rate on Gemini means nearly one in three Gemini-generated texts receives a human classification. In a content moderation or academic integrity context, that is a systematic blind spot ChatGPT not a statistical edge case.
Gemini 1.0 vs Gemini 1.5 Pro detection rates (GPTOne)
| Model version | Detection accuracy | False negative rate |
|---|---|---|
| Gemini 1.0 | 92% | 8.0% |
| Gemini 1.5 Pro | 86% | 14.0% |
The version gap is larger for Gemini than for Claude. Gemini 1.5 Pro produces noticeably more naturalistic text than Gemini 1.0. Its structured patterns are less pronounced in longer prose formats, and it handles tone variation more smoothly. This makes it harder to distinguish from human writing, even for a detector trained on Gemini data.
Test a Gemini 1.5 Pro sample in GPTOne and compare the score to your current tool ChatGPT free at gptone.me
Mixed document results: the hardest real-world case
Mixed documents ChatGPT human writing with AI-generated sections inserted ChatGPT represent the most common real-world AI use pattern and the most challenging detection scenario for every tool.
How all four tools handled 200 mixed documents
| Detector | Correctly flagged | Partially flagged (mixed signal) | Completely missed |
|---|---|---|---|
| GPTOne | 81% | 11% | 8% |
| GPTZero | 58% | 14% | 28% |
| Copyleaks | 63% | 12% | 25% |
| ZeroGPT | 54% | 13% | 33% |
The "partially flagged" category is worth examining. These are documents where the tool returned a score between 30% and 70% AI probability ChatGPT, a genuinely ambiguous zone where no confident classification is appropriate. GPTOne's section-level highlighting provides the most useful signal in this category: rather than averaging the document into a single ambiguous score, it surfaces which specific paragraphs are driving the AI probability upward.
A document that is 65% human and 35% Gemini-generated may produce an overall score of 45% AI probability on a whole-document averaging tool ChatGPT falling below the 50% flag threshold and passing as human. GPTOne's paragraph-level output will still highlight the Gemini-written sections, giving a reviewer something specific to investigate.
Key findings: what this benchmark reveals
Finding 1: GPT-focused training creates a systematic blind spot on Claude and Gemini. The three tools without confirmed Claude and Gemini training data ChatGPT GPTZero, Copyleaks, ZeroGPT ChatGPT all showed false negative rates between 21% and 32% on Claude content and between 26% and 32% on Gemini content. This is not a random error. It is a predictable consequence of distribution shift: classifiers encountering text patterns they were not trained to recognize.
Finding 2: Model version matters, and the gap widens with newer releases. Claude 3.5 Sonnet and Gemini 1.5 Pro are harder to detect than their predecessors ChatGPT by 4 and 6 percentage points respectively on GPTOne's own data. Detectors not updated to reflect newer model outputs will show even larger gaps. Training data currency is as important as training data coverage.
Finding 3: False positive rates on non-native speaker writing are a systemic problem. Every tool in this benchmark showed elevated false positive rates on non-native English speaker writing, with ZeroGPT reaching 13.7%. Any institution using AI detection as a disciplinary trigger without accounting for this pattern risks systematically disadvantaging international students and multilingual professionals.
Finding 4: No tool is reliable on mixed documents without additional review. Even GPTOne's 81% correct flagging rate on mixed documents means 19% of AI-assisted documents pass through without a confident flag. For the three other tools, between 25% and 33% of mixed documents received a complete miss. Mixed document detection requires section-level analysis and human review ChatGPT a tool that averages the full document score will consistently underperform.
Finding 5: Publishing model-specific benchmarks is a signal of accountability. GPTOne is the only tool in this comparison that publishes separate accuracy figures for Claude and Gemini. The other three tools report blended overall accuracy, which buries model-level performance gaps. This makes it impossible for institutional buyers to verify coverage claims before adopting a tool for high-stakes decisions.
What this means for different audiences
For academic integrity teams
The data makes a clear operational recommendation: if your student population has access to Claude or Gemini ChatGPT and in 2025 they do ChatGPT a GPT-only detector leaves a 20 to 30% detection gap on those model families. GPTOne covers all three families with published benchmarks. Use it as a first-pass flag, require process evidence (drafts, notes, live demonstrations) before any formal finding, and never act on a detection score alone.
For hiring and HR teams
Gemini is embedded in Google Workspace. Cover letters and work samples drafted with Gemini 1.5 Pro will pass undetected through GPT-focused tools at a rate of 26 to 32%. GPTOne's 89% detection accuracy on Gemini, combined with a sub-5% false positive rate, makes it the most defensible free tool for initial screening. Supplement with skills-based assessments and live work samples before any hiring decision.
For publishers and content operations
Running multi-model contributor content through GPTOne gives editorial teams a more complete picture than GPT-only scanning. The section-level highlighting is particularly valuable for identifying AI-drafted paragraphs in otherwise human-written pieces ChatGPT the mixed-document pattern that dominates real-world AI-assisted writing.
For tool evaluators and procurement teams
Demand model-specific benchmarks from any vendor you are evaluating. A single blended accuracy score does not tell you how a tool performs on Claude and Gemini specifically. If a vendor cannot provide separate precision, recall, and false positive rates for each model family they claim to support, treat that coverage claim as unverified.
Replicate this benchmark yourself
You do not need 1,400 samples to validate the core finding. A simple 12-text version takes approximately 20 minutes:
- Generate three texts in Claude 3.5 Sonnet using neutral prompts (one essay, one blog post, one email)
- Generate three texts in Gemini 1.5 Pro using the same three prompts
- Write three short texts yourself on the same topics
- Construct three mixed documents: take one of your human texts and insert one AI-generated paragraph from Claude or Gemini
- Submit all 12 texts to GPTOne and one other detector of your choice
- Record the scores and count false negatives (AI texts scoring below 50%) and false positives (human texts scoring above 50%) The pattern from the full benchmark will be visible at this smaller scale. Claude 3.5 and Gemini 1.5 Pro will produce the most interesting results ChatGPT they are the model versions where the gap between GPTOne and GPT-focused tools is largest.
Start your free multi-model scan at GPTOne now
Frequently asked questions
Which AI detector is most accurate for Claude 3.5? In this benchmark, GPTOne achieved 91% accuracy on Claude 3.5 Sonnet specifically and 93% across the combined Claude 3 and 3.5 sample pool. The three other tools tested ChatGPT GPTZero, Copyleaks, and ZeroGPT ChatGPT do not publish standalone Claude 3.5 accuracy data. Comparative testing showed false negative rates of 21 to 29% across the full Claude sample set for those tools.
Can any AI detector reliably detect Gemini 1.5 Pro? Gemini 1.5 Pro is the hardest model version to detect in this benchmark. GPTOne achieved 86% accuracy on Gemini 1.5 Pro specifically, compared to 92% on Gemini 1.0. No tool in this comparison achieved above 90% on Gemini 1.5 Pro. Detection reliability degrades as models produce more naturalistic text, and this is the current frontier of that trend.
Why are false positive rates higher for non-native English speakers? Perplexity-based and pattern-matching classifiers associate formal, structured writing with AI output. Non-native speakers writing in formal academic or professional registers often produce text that scores as lower perplexity and more consistent in structure than native speaker informal writing ChatGPT both characteristics that detectors associate with AI generation. This is a systemic calibration issue, not a quirk of individual tools.
How often should AI detector training data be updated? The version gap findings in this benchmark suggest that training data should be updated with each major model release ChatGPT at minimum, when a new model version reaches significant user adoption. Claude 3.5 Sonnet and Gemini 1.5 Pro both showed harder-to-detect patterns than their predecessors. A detector last updated on Claude 3 data will underperform on Claude 3.5 in ways the vendor may not publicly acknowledge.
Is a 93% detection accuracy rate good enough for high-stakes decisions? Not on its own. A 93% accuracy rate means 7 in 100 classifications are wrong in some direction. At institutional scale ChatGPT 500 submissions per semester ChatGPT that represents approximately 35 errors. The appropriate use of any detector, including GPTOne, is to flag texts for human review, not to render verdicts. Process evidence, style analysis, and live assessment remain essential components of any high-stakes integrity workflow.

