The Journey

From “why does every AI conversation start from zero” to a research project about identity compression. Here’s what we built, what we learned, and what we got wrong.

01Sessions 1–15

The Question

Started with a simple frustration: every AI conversation begins from zero. Exported 1,892 ChatGPT conversations and asked: can we extract who someone IS from how they talk?

What we learned

›Raw conversation data is messy but rich. Behavioral patterns hide in thousands of exchanges
›Existing memory tools store facts. Nobody was modeling behavior.
›The gap isn't retrieval. It's compression: turning signal into understanding.

02Sessions 16–40

Building the Pipeline

Designed a multi-step extraction pipeline: parse conversations into structured facts, classify them by type and commitment depth, separate identity-tier patterns from noise. 47 constrained predicates. Tested local models (Qwen) and API models (Haiku, Sonnet).

What we learned

›Structured extraction with constrained predicates beats open-ended summarization
›Knowledge tiering matters: only ~30% of facts are identity-relevant
›Local models work for extraction but fail at narrative generation
›Anonymization is essential. Models pattern-match to pre-training knowledge about named people

03Sessions 40–55

The Three-Layer Architecture

Designed a three-layer identity model: ANCHORS (epistemic axioms), CORE (communication guide), PREDICTIONS (behavioral triggers). Added Collective review, four AI personas evaluating each layer. Built the full 14-step pipeline.

What we learned

›Facts-only derivation prevents hallucination. Models can't invent what isn't in the data
›Blind regeneration (never showing prior output) eliminates anchoring bias by 26%
›A Collective review of four AI personas seemed promising, but later proved ceremonial by ablation
›The pipeline grew to 14 steps, but we didn't yet know which ones mattered

04Sessions 55–70

N=10 Proof

Ran the pipeline on 10 diverse subjects: a founder's conversations, a philosopher's newsletters, personal journals, Benjamin Franklin's autobiography, Frederick Douglass, Mary Wollstonecraft, Theodore Roosevelt, patent filings, Warren Buffett's shareholder letters, Howard Marks' investment memos. All scored 73-82/100.

What we learned

›The same pipeline works on conversations, journals, autobiographies, letters, memos, and patents
›Document identity IS identity. Implicit worldview can be extracted from any text
›Provenance traceability (every claim traces to source facts) became a differentiator
›Single-domain corpora need different prompting than multi-domain conversations

05Sessions 70–77

Honest Evaluation

Built evaluation frameworks. BCB benchmark (4 metrics, 2 passed, 2 failed). Provenance-traced mechanical evaluation ($0, no LLM judge). Twin-2K external benchmark: 100 participants, 71.83% accuracy at 18:1 compression (p=0.008). The brief matched full persona dumps at a fraction of the tokens.

What we learned

›Compressed brief matches 130K-char persona dump at 1/18th the tokens (Twin-2K, p=0.008)
›BCB failures were interpretable: faithful compression increases adversarial vulnerability
›LLM-as-judge evaluation is circular. Mechanical metrics with vector similarity are auditable and free.
›Effect sizes shrink on stronger models. Sonnet's baseline is so good the brief barely helps

06Sessions 77–80 (March 7–8, 2026)

Research Findings

Ran ablation studies on every dimension: which facts matter, which formats work, which pipeline steps are load-bearing. 14 conditions on Franklin (~$16). Cross-validated on Sonnet and Qwen. The 14-step pipeline collapsed to 4 steps (Import, Extract, Author, Compose). 10 steps were ceremonial. Discovered compression saturation, temporal stability, and that behavioral facts outpredict biographical ones.

What we learned

›Compression saturates at ~20% of facts. Throwing away 80% doesn't hurt
›Behavioral patterns are temporally stable: early facts predict late behavior and vice versa
›What you avoid and struggle with is more predictive than what you believe
›Annotated guide format beats production brief by +24% on downstream tasks at 1/3 the length
›The full pipeline is worth ~4 points over a single Opus prompt with raw facts
›Scoring, classification, tiering, contradictions, consolidation, anchors extraction, and Collective review are all ceremonial, but the three-layer architecture IS load-bearing

07Session 81

V4 Validation

Locked the V4 compose prompt (false positive guards + tension-action pairs woven into prose). Re-authored and re-composed all 7 public subjects with the simplified pipeline. Total cost: $2.18 for all subjects. Built automated data generation for the website. Full code review: 7-phase audit found 13 bugs (including a quality gate false-positive on every compose run), completed privacy scrub, 0 security blockers for public release.

What we learned

›V4 briefs are comparable across pipeline runs: same core patterns, different emphasis. Stability, not fragility.
›Full re-author + re-compose costs $0.25–0.37 per subject, cheap enough to iterate freely
›Code review found a quality gate bug that silently reported false COHERENCE gaps on every compose
›Privacy scrub complete: 0 PII in scripts, 0 hardcoded secrets, subject data directories gitignored
›Paul Graham (28 essays, 272 facts extracted) is ready for authoring. Next case study

08Session 100 (March 31, 2026)

Where We Are Now

The 4-step pipeline became 5 — we discovered during session 100 that provenance (tracing every claim to its source facts) was completely broken without vector embeddings. Added an Embed step between Extract and Author. The step doesn't improve identity model quality, but it makes the model inspectable and auditable. 47 predicates, 93 architectural decisions, 44 subjects live. Publishing everything: what works, what doesn't, what we don't know yet.

What we learned

›The contribution is the research findings, not the engineering
›Every agentic workflow is hollow if it doesn't understand who the human is behind the screen
›Compression is the hard problem. Memory is solved. Identity is not.
›We're not looking for validation. We're looking for feedback

By the Numbers

Sessions

of development

14→4

Pipeline steps

after ablation

Subjects processed

diverse sources

Predicates

structured extraction

18:1

Compression

vs full persona

20%

Saturation

of facts needed

$2.18

Full run cost

all 7 subjects

414

Tests passing

across pipeline + eval

Open Questions

What we don’t know yet, and what we’re testing next.

Everything is open source. The pipeline, the findings, the failures. We’re not looking for validation. We’re looking for feedback.

View on GitHub See examples →

Get in Touch

Have questions or want to collaborate?

aarik@base-layer.ai