Half the time you think, “Wow, this sounds so polished,” you’re actually hearing a statistical pattern, not a person. A late‑night email from a manager, a student essay, a glowing product review—some of them are secretly stitched together by an algorithm. Can you tell which?
That polished feel isn’t an accident; it’s a side effect of how the system was trained. Models are rewarded for sounding smooth, safe, and “reasonable,” so their writing starts to fall into familiar grooves. You’ll see paragraphs that glide forward without really stopping to doubt themselves, hedge, or get oddly specific in the way real people do when they’re tired, rushed, or unsure.
In practice, that means a suspiciously consistent rhythm: sentences that seem to march in step, like a band playing at one steady tempo, rarely speeding up for excitement or slowing down to untangle a messy idea. The voice stays remarkably even too—little change in mood, energy, or level of detail, even when the topic should invite digressions, jokes, or oddly personal asides that most humans can’t resist slipping in.
What makes this tricky is that human writing isn’t just “messy” in a general way—it’s messy in patterns. We speed up when we’re excited, slow down when we’re thinking, and drop in oddly specific details the way a traveler collects random ticket stubs and receipts in their pocket. Detection tools lean on that contrast. They quantify how often sentence lengths change, how quickly vocabulary shifts, how sharply a writer zooms from big claims down to concrete facts. Then they compare that fingerprint to huge samples of confirmed human and AI text to see which it resembles most.
Here’s where things get interesting: the cues that feel “obvious” to humans often aren’t the ones detectors actually use, and the cues detectors use are often invisible to us.
Start with the numbers. In OpenAI’s own 2023 tests, a detector looking at neatly written paragraphs still got it wrong a lot: it labeled about 9% of real human text as AI, and failed to flag roughly 26% of AI text. That’s not just a rounding error; it means any yes/no label from a single tool is closer to a weather forecast than a lab result. It gives you a probability under certain conditions, not an absolute truth.
So how are these tools deciding? Under the hood, they lean on measures you rarely think about when you’re reading:
• Perplexity: How “surprised” a language model is by each next word. AI prose tends to keep surprise low; human writers spike it accidentally with odd phrasing, side comments, or niche references. • Burstiness: How much the style bunches and stutters. People often produce clusters of short, blunt sentences and then a long, tangled one. AI tends to smooth those clumps out. • Entropy: How evenly the text draws from its vocabulary. Humans lean harder on favorite words and phrases; models spread their bets more evenly across synonyms.
No single measure is decisive. That’s why more serious systems combine dozens of tiny signals. Some look at revision history: a student pasting in a full page in one second is different from someone typing steadily over an hour. Others check context: is this style wildly different from the person’s previous work, or does it match their past essays and emails?
This is also why myths about “fooling” detectors are shaky. Sprinkling in typos or emojis can bump perplexity, but hybrid systems cross‑check many dimensions at once. Think of a doctor diagnosing an illness: changing your diet the day before a checkup might nudge one lab result, but it won’t erase a pattern of scans, symptoms, and history.
At scale, these patterns show up in the wild. Turnitin claims that, out of 38.5 million submissions in one six‑month window, over 10% contained at least 20% AI‑written sentences. GPTZero advertises extremely high specificity, but independent tests usually find a more modest, though still useful, range. The takeaway isn’t “detectors are useless” or “detectors never lie,” but that they’re best treated as one piece of evidence among many—especially when stakes are high for real people.
Think about a long group chat. One friend writes in bursts—three half-sentences, a meme, then silence. Another drops tidy, complete thoughts every time. Now imagine a “chat detective” scrolling back through months of history, not caring what anyone said, only how they said it. Are the jokes unevenly spaced? Do inside references suddenly vanish? Does one person’s style flip overnight from messy to eerily consistent?
You can do a low-tech version of that with prose in front of you. Take two paragraphs you suspect: • Read them out loud and mark where you naturally pause or stumble. Do all the pauses come at neat clause boundaries, or are there odd mid-sentence hesitations that feel more like someone thinking on the page? • Scan for oddly specific artifacts: a half-remembered date, a misquoted lyric, a place name that doesn’t quite fit. Genuine writers leak these the way travelers come home with mismatched coins in their pockets.
Whenever everything lines up a bit too cleanly—tone, pacing, confidence—you’re not “catching” AI so much as noticing an absence of human clutter.
Soon, “who wrote this?” may become a standard metadata field, not an afterthought. Expect layered proof: cryptographic watermarks baked in at creation, plus public ledgers that store hashes of original drafts. Schools might pivot from hunting cheaters to auditing how students *collaborate* with AI, the way labs track which instrument produced which data. News outlets could quietly log AI involvement the way studios log session musicians—visible if you look, crucial when disputes arise.
Instead of chasing a perfect “AI or not” verdict, treat each text like a song whose producer you’re trying to guess: zoom in on the mix, not just the melody. Your challenge this week: pick three things you read—a policy, a review, a homework answer—and jot why they *might* be AI. Don’t decide. Just practice noticing the seams.

