The Journey
From “why does every AI conversation start from zero” to a research project about behavioral compression. Here’s what we built, what we learned, and what we got wrong.
The Question
Started with a simple frustration: every AI conversation begins from zero. Exported 1,892 ChatGPT conversations and asked — can we extract who someone IS from how they talk?
What we learned
- ›Raw conversation data is messy but rich — behavioral patterns hide in thousands of exchanges
- ›Existing memory tools store facts. Nobody was modeling behavior.
- ›The gap isn't retrieval. It's compression — turning signal into understanding.
Building the Pipeline
Designed a multi-step extraction pipeline: parse conversations into structured facts, classify them by type and commitment depth, separate identity-tier patterns from noise. 47 constrained predicates. Tested local models (Qwen) and API models (Haiku, Sonnet).
What we learned
- ›Structured extraction with constrained predicates beats open-ended summarization
- ›Knowledge tiering matters — only ~30% of facts are identity-relevant
- ›Local models work for extraction but fail at narrative generation
- ›Anonymization is essential — models pattern-match to pre-training knowledge about named people
The Three-Layer Architecture
Designed a three-layer identity model: ANCHORS (epistemic axioms), CORE (communication guide), PREDICTIONS (behavioral triggers). Added Collective review — four AI personas evaluating each layer. Built the full 14-step pipeline.
What we learned
- ›Facts-only derivation prevents hallucination — models can't invent what isn't in the data
- ›Blind regeneration (never showing prior output) eliminates anchoring bias by 26%
- ›A Collective review of four AI personas seemed promising — later proved ceremonial by ablation
- ›The pipeline grew to 14 steps — but we didn't yet know which ones mattered
N=10 Proof
Ran the pipeline on 10 diverse subjects: a founder's conversations, a philosopher's newsletters, personal journals, Benjamin Franklin's autobiography, Frederick Douglass, Mary Wollstonecraft, Theodore Roosevelt, patent filings, Warren Buffett's shareholder letters, Howard Marks' investment memos. All scored 73-82/100.
What we learned
- ›The same pipeline works on conversations, journals, autobiographies, letters, memos, and patents
- ›Document identity IS identity — implicit worldview can be extracted from any text
- ›Provenance traceability (every claim traces to source facts) became a differentiator
- ›Single-domain corpora need different prompting than multi-domain conversations
Honest Evaluation
Built evaluation frameworks. BCB benchmark (4 metrics, 2 passed, 2 failed). Provenance-traced mechanical evaluation ($0, no LLM judge). Twin-2K external benchmark — 100 participants, 71.83% accuracy at 18:1 compression (p=0.008). The brief matched full persona dumps at a fraction of the tokens.
What we learned
- ›Compressed brief matches 130K-char persona dump at 1/18th the tokens (Twin-2K, p=0.008)
- ›BCB failures were interpretable — faithful compression increases adversarial vulnerability
- ›LLM-as-judge evaluation is circular. Mechanical metrics with vector similarity are auditable and free.
- ›Effect sizes shrink on stronger models — Sonnet's baseline is so good the brief barely helps
Research Findings
Ran ablation studies on every dimension: which facts matter, which formats work, which pipeline steps are load-bearing. 14 conditions on Franklin (~$16). Cross-validated on Sonnet and Qwen. The 14-step pipeline collapsed to 4 steps (Import, Extract, Author, Compose) — 10 steps were ceremonial. Discovered compression saturation, temporal stability, and that behavioral facts outpredict biographical ones.
What we learned
- ›Compression saturates at ~20% of facts — throwing away 80% doesn't hurt
- ›Behavioral patterns are temporally stable — early facts predict late behavior and vice versa
- ›What you avoid and struggle with is more predictive than what you believe
- ›Annotated guide format beats production brief by +24% on downstream tasks at 1/3 the length
- ›The full pipeline is worth ~4 points over a single Opus prompt with raw facts
- ›Scoring, classification, tiering, contradictions, consolidation, anchors extraction, and Collective review are all ceremonial — but the three-layer architecture IS load-bearing
V4 Validation
Locked the V4 compose prompt (false positive guards + tension-action pairs woven into prose). Re-authored and re-composed all 7 public subjects with the simplified pipeline. Total cost: $2.18 for all subjects. Built automated data generation for the website. Full code review: 7-phase audit found 13 bugs (including a quality gate false-positive on every compose run), completed privacy scrub, 0 security blockers for public release.
What we learned
- ›V4 briefs are comparable across pipeline runs — same core patterns, different emphasis. Stability, not fragility.
- ›Full re-author + re-compose costs $0.25–0.37 per subject — cheap enough to iterate freely
- ›Code review found a quality gate bug that silently reported false COHERENCE gaps on every compose
- ›Privacy scrub complete — 0 PII in scripts, 0 hardcoded secrets, subject data directories gitignored
- ›Paul Graham (28 essays, 272 facts extracted) is ready for authoring — next case study
Where We Are Now
Open-sourcing the research. 4-step pipeline, 47 predicates, 76+ architectural decisions, 400+ tests, 10 subjects processed. The honest assessment: this is a research project with strong preliminary results, not a finished product. We're publishing everything — what works, what doesn't, what we don't know yet.
What we learned
- ›The contribution is the research findings, not the engineering
- ›Every agentic workflow is hollow if it doesn't understand who the human is behind the screen
- ›Compression is the hard problem. Memory is solved. Identity is not.
- ›We're not looking for validation — we're looking for feedback
By the Numbers
Open Questions
What we don’t know yet — and what we’re testing next.
Everything is open source. The pipeline, the findings, the failures. We’re not looking for validation — we’re looking for feedback.