The Journey
From “why does every AI conversation start from zero” to a research project about identity compression. Here’s what we built, what we learned, and what we got wrong.
The Question
Started with a simple frustration: every AI conversation begins from zero. Exported 1,892 ChatGPT conversations and asked: can we extract who someone IS from how they talk?
What we learned
- ›Raw conversation data is messy but rich. Behavioral patterns hide in thousands of exchanges
- ›Existing memory tools store facts. Nobody was modeling behavior.
- ›The gap isn't retrieval. It's compression: turning signal into understanding.
Building the Pipeline
Designed a multi-step extraction pipeline: parse conversations into structured facts, classify them by type and commitment depth, separate identity-tier patterns from noise. 47 constrained predicates. Tested local models (Qwen) and API models (Haiku, Sonnet).
What we learned
- ›Structured extraction with constrained predicates beats open-ended summarization
- ›Knowledge tiering matters: only ~30% of facts are identity-relevant
- ›Local models work for extraction but fail at narrative generation
- ›Anonymization is essential. Models pattern-match to pre-training knowledge about named people
The Three-Layer Architecture
Designed a three-layer identity model: ANCHORS (epistemic axioms), CORE (communication guide), PREDICTIONS (behavioral triggers). Added Collective review, four AI personas evaluating each layer. Built the full 14-step pipeline.
What we learned
- ›Facts-only derivation prevents hallucination. Models can't invent what isn't in the data
- ›Blind regeneration (never showing prior output) eliminates anchoring bias by 26%
- ›A Collective review of four AI personas seemed promising, but later proved ceremonial by ablation
- ›The pipeline grew to 14 steps, but we didn't yet know which ones mattered
N=10 Proof
Ran the pipeline on 10 diverse subjects: a founder's conversations, a philosopher's newsletters, personal journals, Benjamin Franklin's autobiography, Frederick Douglass, Mary Wollstonecraft, Theodore Roosevelt, patent filings, Warren Buffett's shareholder letters, Howard Marks' investment memos. All scored 73-82/100.
What we learned
- ›The same pipeline works on conversations, journals, autobiographies, letters, memos, and patents
- ›Document identity IS identity. Implicit worldview can be extracted from any text
- ›Provenance traceability (every claim traces to source facts) became a differentiator
- ›Single-domain corpora need different prompting than multi-domain conversations
Honest Evaluation
Built evaluation frameworks. BCB benchmark (4 metrics, 2 passed, 2 failed). Provenance-traced mechanical evaluation ($0, no LLM judge). Twin-2K external benchmark: 100 participants, 71.83% accuracy at 18:1 compression (p=0.008). The brief matched full persona dumps at a fraction of the tokens.
What we learned
- ›Compressed brief matches 130K-char persona dump at 1/18th the tokens (Twin-2K, p=0.008)
- ›BCB failures were interpretable: faithful compression increases adversarial vulnerability
- ›LLM-as-judge evaluation is circular. Mechanical metrics with vector similarity are auditable and free.
- ›Effect sizes shrink on stronger models. Sonnet's baseline is so good the brief barely helps
Research Findings
Ran ablation studies on every dimension: which facts matter, which formats work, which pipeline steps are load-bearing. 14 conditions on Franklin (~$16). Cross-validated on Sonnet and Qwen. The 14-step pipeline collapsed to 4 steps (Import, Extract, Author, Compose). 10 steps were ceremonial. Discovered compression saturation, temporal stability, and that behavioral facts outpredict biographical ones.
What we learned
- ›Compression saturates at ~20% of facts. Throwing away 80% doesn't hurt
- ›Behavioral patterns are temporally stable: early facts predict late behavior and vice versa
- ›What you avoid and struggle with is more predictive than what you believe
- ›Annotated guide format beats production brief by +24% on downstream tasks at 1/3 the length
- ›The full pipeline is worth ~4 points over a single Opus prompt with raw facts
- ›Scoring, classification, tiering, contradictions, consolidation, anchors extraction, and Collective review are all ceremonial, but the three-layer architecture IS load-bearing
V4 Validation
Locked the V4 compose prompt (false positive guards + tension-action pairs woven into prose). Re-authored and re-composed all 7 public subjects with the simplified pipeline. Total cost: $2.18 for all subjects. Built automated data generation for the website. Full code review: 7-phase audit found 13 bugs (including a quality gate false-positive on every compose run), completed privacy scrub, 0 security blockers for public release.
What we learned
- ›V4 briefs are comparable across pipeline runs: same core patterns, different emphasis. Stability, not fragility.
- ›Full re-author + re-compose costs $0.25–0.37 per subject, cheap enough to iterate freely
- ›Code review found a quality gate bug that silently reported false COHERENCE gaps on every compose
- ›Privacy scrub complete: 0 PII in scripts, 0 hardcoded secrets, subject data directories gitignored
- ›Paul Graham (28 essays, 272 facts extracted) is ready for authoring. Next case study
Where We Are Now
The 4-step pipeline became 5 — we discovered during session 100 that provenance (tracing every claim to its source facts) was completely broken without vector embeddings. Added an Embed step between Extract and Author. The step doesn't improve identity model quality, but it makes the model inspectable and auditable. 47 predicates, 93 architectural decisions, 44 subjects live. Publishing everything: what works, what doesn't, what we don't know yet.
What we learned
- ›The contribution is the research findings, not the engineering
- ›Every agentic workflow is hollow if it doesn't understand who the human is behind the screen
- ›Compression is the hard problem. Memory is solved. Identity is not.
- ›We're not looking for validation. We're looking for feedback
By the Numbers
Open Questions
What we don’t know yet, and what we’re testing next.
Everything is open source. The pipeline, the findings, the failures. We’re not looking for validation. We’re looking for feedback.