Uncanny Valley of Text
04 · AI Research & Forensics
StylometryTransformersForensic NLPClassificationAI Attribution

The Uncanny
Valley
of Text

Every AI model has a statistical fingerprint invisible to humans. This identifies which model wrote something, which version, and what prompt structure likely produced it — from the output alone.

Status
Building
Domain
AI Research
Method
Stylometry + NLI
Output
Model Attribution
The Concept

Every model writes
the same way, every time.

GPT-4 has sentence length distributions, punctuation preferences, hedging patterns, transitional phrase tendencies. Claude uses specific discourse markers. Gemini structures paragraphs in identifiable ways. These patterns persist across topics, languages, and styles — because they are baked into the model weights themselves.

Humans can't see these patterns. Statistical models can. The Uncanny Valley of Text is a forensic tool that reads the fingerprint invisible to the naked eye and returns a verdict: which model, which version, which prompt archetype produced this text.

Statistical fingerprint analysis · Sample text
GPT-4o · 94% match
Hedge frequency
Avg sentence len
Enumerative style
Disclaimer rate
Claude 3 Sonnet · 61% match
Hedge frequency
Avg sentence len
Enumerative style
Disclaimer rate
Gemini 1.5 Pro · 28% match
Hedge frequency
Avg sentence len
Enumerative style
Disclaimer rate
Llama 3 70B · 41% match
Hedge frequency
Avg sentence len
Enumerative style
Disclaimer rate
Verdict → GPT-4o · High confidence (94%) · Prompt archetype: Instructional / Expository · System prompt detected: likely includes "be thorough"
Features

What it detects

🔍
Model Attribution
Identifies which AI model produced a piece of text — GPT-4, Claude, Gemini, Llama — with confidence scores across each candidate.
🕰️
Version Detection
Models change with version updates. The fingerprint evolves — and this tracks it. GPT-4 vs GPT-4o have measurably different stylistic distributions.
🎯
Prompt Archetype Inference
Infers the likely prompt structure that produced the output — instructional, conversational, creative, system-prompted — from the output's statistical shape.
📊
Feature Visualiser
Shows the exact stylometric features driving the verdict — sentence length, hedging, discourse markers — making the decision fully interpretable.
🔄
Continuous Model Updates
As new models release, the fingerprint library expands. The system is built to be extended — not a fixed classifier, a living forensic database.
🔬
Research Mode
Exposes raw feature vectors and statistical distances for researchers studying model behaviour, AI safety, and content authenticity.
Stack

How it's built

🐍
Python
Core Analysis
🔬
NLTK / SpaCy
Stylometry
🤗
HuggingFace
Embeddings
🧮
Scikit-learn
Classification
📊
Plotly
Visualisation
FastAPI
Analysis API
⚛️
React
UI
🐘
PostgreSQL
Fingerprint DB

Currently in development

Feature extraction pipeline built · Training data collection in progress · Initial classifier at 78% accuracy on 4 models

← OnboardingOS Next: ERPWhisperer →