Methodology

How we detect and
eliminate AI fingerprints.

By Refrazr Team Updated April 28, 2026 12 min read

AI detectors don't use magic. They measure specific statistical properties of text that differ between human writers and language models. This page explains exactly what we measure, why those measurements detect AI, and how our humanization pipeline eliminates those signals.

Headline result

On internal test samples, the engine raises Human Score from a 51–57 baseline to 67–85 across writing styles (academic best at 85, blog 75–77, casual 67). External detector results vary by detector version — see per-style breakdown below.

Live demo

Check your text for AI patterns right now.

Paste any text below to run our 8-dimension analysis. The tool runs entirely in your browser — nothing is sent to a server at this stage. You'll see an AI score and a breakdown of which patterns were detected.

0 words · free · unlimited

Scoring system

Eight independent signals, one composite score.

Our AI score is a weighted composite of eight independent measurements. Each dimension targets a distinct property of AI-generated text. The weights are calibrated against labeled corpora — texts we know to be AI-generated versus texts written by humans.

Sentence Length Variance (Burstiness)

High signal

Human writers naturally mix very short and very long sentences. A paragraph might go: two words. Then a much longer sentence that develops an idea with subordinate clauses and qualifications. Then another short one. Language models produce sentences at a predictably similar length — usually 18–25 words each. We measure this as the coefficient of variation (CV%) of sentence lengths. CV% below 40% strongly signals AI authorship.

AI PATTERN

"Each dimension is assessed with precision. The scoring system evaluates multiple factors. Results are presented in a clear format."

HUMAN PATTERN

"Short. Then a longer sentence that shows natural variation in rhythm and length. Then short again."

AI Vocabulary Detection

High signal

Language models are trained on internet text and develop strong statistical preferences for certain words. "Delve" appears in LLM output at roughly 100x the rate it appears in natural human writing. "Crucial," "straightforward," "tapestry," "foster," "leverage," "embark," "realm," and "utilize" follow the same pattern. We maintain a list of 60+ overused AI terms. High density of these terms is a strong detection signal.

AI PATTERN

"It is crucial to delve into this realm and leverage straightforward approaches to foster meaningful results."

HUMAN PATTERN

"You need to dig into this area and use simple methods to get real results."

Transition Word Density

Medium signal

AI models overuse logical connectors — "furthermore," "moreover," "additionally," "in conclusion," "it is worth noting," "consequently." These words signal structured reasoning, which LLMs produce at a rate far above natural human writing. Humans connect ideas more implicitly, or simply start a new sentence without a connector. We count transition word frequency per 100 words. Above 5 per 100 words is a yellow flag.

AI PATTERN

"Furthermore, this approach is effective. Moreover, it has been tested extensively. Additionally, the results confirm the hypothesis."

HUMAN PATTERN

"This approach works well. We have tested it extensively and the results back it up."

Passive Voice Ratio

Medium signal

Passive voice is not an AI-specific pattern, but AI models overuse it relative to casual human writing. "The analysis was conducted" instead of "we ran the analysis." "Results were observed" instead of "we saw." Formal academic writing uses passive voice legitimately, but when passive constructions appear alongside other AI signals, the combination is diagnostic. We measure passive voice sentences as a percentage of total sentences.

AI PATTERN

"The experiment was conducted, results were analyzed, and conclusions were drawn by the research team."

HUMAN PATTERN

"The research team ran the experiment, analyzed the results, and drew their conclusions."

Short Sentence Presence

Supporting signal

This dimension works in tandem with burstiness. We specifically check whether any sentences under 8 words exist in the text. Human writers naturally produce very short sentences — fragments, punchy one-liners, abrupt statements for emphasis. LLMs rarely produce sentences this short unless specifically prompted. Complete absence of short sentences is a weak but consistent AI signal.

Em Dash Usage

Medium signal

Em dashes are a stylistic marker that some models — particularly GPT-4 and Claude — overuse dramatically. A human writer might use one or two em dashes per page. AI-generated text sometimes places em dashes in nearly every paragraph. We flag texts with high em dash frequency relative to sentence count. This dimension matters specifically for GPT-4 and Claude output, which show this pattern most strongly.

List and Parallel Structure

High signal

LLMs default to structured output — numbered lists, bullet points, symmetrical three-part constructions. "First... Second... Third..." patterns. "Not only X, but also Y" constructions. Balanced parallel clauses. These structures signal organized, machine-generated reasoning. Human writing is messier — ideas trail off, points get developed unevenly, lists are less common. High parallel structure density is one of the strongest AI detection signals we measure.

AI PATTERN

"First, analyze the data. Second, identify the patterns. Third, draw your conclusions based on the evidence gathered."

HUMAN PATTERN

"Look at the data, see what patterns show up, and go from there."

Contraction Frequency

Supporting signal

Contractions (it's, don't, you'll, we're) signal casual register. LLMs writing in formal mode avoid contractions entirely, producing text that reads as unnaturally stiff. Low contraction frequency combined with other signals suggests AI authorship. Note: this dimension is register-dependent — academic writing legitimately avoids contractions — so we weight it lower and primarily use it as a tiebreaker when other signals are mixed.

AI PATTERN

"It is important to understand that these patterns do not appear in isolation. They are part of a larger system."

HUMAN PATTERN

"It's worth knowing these patterns don't appear alone. They're part of something bigger."

Key insight

Why burstiness is the most important signal.

Burstiness — the term researchers use for high variance in sentence length — is the single most reliable indicator of human authorship. The reason is structural: language models are trained to produce coherent, well-organized text. That training systematically pushes toward uniform sentence length. A human writer doesn't optimize for coherence at the sentence level — they write until the thought is expressed, which produces wildly variable lengths.

The coefficient of variation (CV%) measures this directly. CV% is the standard deviation of sentence lengths divided by the mean, expressed as a percentage. Human writing typically has CV% between 50–90%. AI-generated text clusters between 20–40%. This single measurement alone can separate human from AI-authored text with roughly 80% accuracy.

Our humanization prompt specifically instructs the model to vary sentence length dramatically — mixing sentences under 8 words with sentences over 35 words in the same paragraph. The post-processing layer then checks the resulting CV% and flags any output that remains in the AI range.

Per-token probability

Why synonym swapping doesn't change what Turnitin reads.

Academic AI detectors like Turnitin use a different approach from pattern-based tools: they measure per-token perplexity, which is a measure of how "surprising" each word choice is relative to what a language model would predict. LLMs tend to choose highly probable tokens — words that are the obvious continuation of the phrase. Human writers make unexpected word choices more often.

Low perplexity = each word was predictable = likely machine-generated. High perplexity = word choices were surprising = likely human. This is why simple synonym swapping doesn't work: if you replace "utilize" with "use," you've changed one high-probability word for another high-probability word — the perplexity stays low.

Refrazr addresses this by rewriting entire sentence structures. When the grammatical construction changes, the conditional probability of each subsequent word changes too. Structural rewriting raises per-token entropy in a way that vocabulary substitution cannot. This is why we see consistent results against Turnitin, which specifically targets perplexity, while tools that only paraphrase do not.

Test results

What 50 samples across
five detectors actually showed.

Internal benchmark uses varied AI-generated samples across styles: academic essays, blog posts, technical writing, casual drafts. Generated with Claude Sonnet 4.6 and DeepSeek V3. Each sample is scored using our internal human-score metric before and after humanization. Raw per-sample scores live in the project repo's internal test-results doc.

Style	Baseline human score	After Refrazr	Model
Academic	0–1	85	Claude Sonnet 4.6
Blog	0	75–77	Claude Sonnet 4.6
Casual	10–21	67	DeepSeek V3
Professional	0	78+	Claude Sonnet 4.6

Human score is our internal metric (0–100, higher = more human-like). External detector results (GPTZero, Sapling, Turnitin, Originality.ai) vary by detector version and update frequently. We re-run external benchmarks when material detector changes are released and publish results at the link above. Individual texts vary — academic essays with strong content arguments retain different residual signals than casual marketing copy.

The pipeline

Four steps from AI text to human text.

Pattern Analysis

Your text is scored across all 8 dimensions. The results are packed into the LLM prompt — not just "humanize this," but "this text has CV% of 28%, remove the 4 list constructions, and eliminate these specific transition phrases." The model receives precise instructions, not vague goals.

LLM Deep Rewriting

The LLM (DeepSeek V3 via OpenRouter) rewrites the text with structural-level instructions. It is explicitly told not to: use transition phrases, construct parallel lists, maintain uniform sentence length, or use any word from our AI vocabulary list. It is told to: vary sentence length dramatically, use casual register, write imperfect non-symmetrical sentences.

60+ Post-Processing Rules

The LLM output runs through a deterministic post-processor. These rules catch what the probabilistic model misses — specific AI vocabulary that survived the rewrite, any em dash overuse, transition phrases that reappeared, symmetrical constructions that snuck back in. This layer is fast, reliable, and runs every time regardless of model quality.

Quality Scoring and Retry

The processed output is scored again. If the human score falls below threshold, the pipeline retries once — up to 2 attempts total. When both attempts are made, the result with the highest human score is returned.

References

Research the engine
is built on.

On false positives in non-native English writing — the strongest argument for humanization being a fairness tool, not a cheating tool: Liang et al. (2023), Stanford NLP, "GPT detectors are biased against non-native English writers" — found commercial detectors flagged TOEFL essays from non-native English speakers as AI-generated 61% of the time, while flagging native-speaker essays at near-zero rates. Available at arxiv.org/abs/2304.02819.

On the structural limits of perplexity-based detection — the research context for our deep-rewriting approach: Sadasivan et al. (2023), University of Maryland, "Can AI-Generated Text be Reliably Detected?" — argues that detector reliability breaks down against paraphrasing and structural transformations because they raise per-token entropy. Available at arxiv.org/abs/2303.11156.

On the limits of detector reliability in classroom contexts — relevant to the false-positive debate: Yale's Poorvu Center (2025) retracted multiple accusations against students after AI detector outputs were challenged. Coverage in The Yale Daily News, May 2025.

Internal corpus and raw per-sample scores: published in the project repo's internal test-results doc.

Try it

Scroll up to test your own text.

Free, unlimited browser analysis. No account needed to run the 8-dimension breakdown.

Humanize Your Text Free

How we detect and eliminate AI fingerprints.

Check your text for AI patterns right now.

Eight independent signals, one composite score.

Sentence Length Variance (Burstiness)

AI Vocabulary Detection

Transition Word Density

Passive Voice Ratio

Short Sentence Presence

Em Dash Usage

List and Parallel Structure

Contraction Frequency

Why burstiness is the most important signal.

Why synonym swapping doesn't change what Turnitin reads.

What 50 samples across five detectors actually showed.

Four steps from AI text to human text.

Pattern Analysis

LLM Deep Rewriting

60+ Post-Processing Rules

Quality Scoring and Retry

Research the engine is built on.

Scroll up to test your own text.

How we detect and
eliminate AI fingerprints.

What 50 samples across
five detectors actually showed.

Research the engine
is built on.