← base-layer.ai

The Journey

From “why does every AI conversation start from zero” to a research project about behavioral compression. Here’s what we built, what we learned, and what we got wrong.

01Sessions 1–15

The Question

Started with a simple frustration: every AI conversation begins from zero. Exported 1,892 ChatGPT conversations and asked — can we extract who someone IS from how they talk?

What we learned

  • Raw conversation data is messy but rich — behavioral patterns hide in thousands of exchanges
  • Existing memory tools store facts. Nobody was modeling behavior.
  • The gap isn't retrieval. It's compression — turning signal into understanding.
02Sessions 16–40

Building the Pipeline

Designed a multi-step extraction pipeline: parse conversations into structured facts, classify them by type and commitment depth, separate identity-tier patterns from noise. 47 constrained predicates. Tested local models (Qwen) and API models (Haiku, Sonnet).

What we learned

  • Structured extraction with constrained predicates beats open-ended summarization
  • Knowledge tiering matters — only ~30% of facts are identity-relevant
  • Local models work for extraction but fail at narrative generation
  • Anonymization is essential — models pattern-match to pre-training knowledge about named people
03Sessions 40–55

The Three-Layer Architecture

Designed a three-layer identity model: ANCHORS (epistemic axioms), CORE (communication guide), PREDICTIONS (behavioral triggers). Added Collective review — four AI personas evaluating each layer. Built the full 14-step pipeline.

What we learned

  • Facts-only derivation prevents hallucination — models can't invent what isn't in the data
  • Blind regeneration (never showing prior output) eliminates anchoring bias by 26%
  • A Collective review of four AI personas seemed promising — later proved ceremonial by ablation
  • The pipeline grew to 14 steps — but we didn't yet know which ones mattered
04Sessions 55–70

N=10 Proof

Ran the pipeline on 10 diverse subjects: a founder's conversations, a philosopher's newsletters, personal journals, Benjamin Franklin's autobiography, Frederick Douglass, Mary Wollstonecraft, Theodore Roosevelt, patent filings, Warren Buffett's shareholder letters, Howard Marks' investment memos. All scored 73-82/100.

What we learned

  • The same pipeline works on conversations, journals, autobiographies, letters, memos, and patents
  • Document identity IS identity — implicit worldview can be extracted from any text
  • Provenance traceability (every claim traces to source facts) became a differentiator
  • Single-domain corpora need different prompting than multi-domain conversations
05Sessions 70–77

Honest Evaluation

Built evaluation frameworks. BCB benchmark (4 metrics, 2 passed, 2 failed). Provenance-traced mechanical evaluation ($0, no LLM judge). Twin-2K external benchmark — 100 participants, 71.83% accuracy at 18:1 compression (p=0.008). The brief matched full persona dumps at a fraction of the tokens.

What we learned

  • Compressed brief matches 130K-char persona dump at 1/18th the tokens (Twin-2K, p=0.008)
  • BCB failures were interpretable — faithful compression increases adversarial vulnerability
  • LLM-as-judge evaluation is circular. Mechanical metrics with vector similarity are auditable and free.
  • Effect sizes shrink on stronger models — Sonnet's baseline is so good the brief barely helps
06Sessions 77–80

Research Findings

Ran ablation studies on every dimension: which facts matter, which formats work, which pipeline steps are load-bearing. 14 conditions on Franklin (~$16). Cross-validated on Sonnet and Qwen. The 14-step pipeline collapsed to 4 steps (Import, Extract, Author, Compose) — 10 steps were ceremonial. Discovered compression saturation, temporal stability, and that behavioral facts outpredict biographical ones.

What we learned

  • Compression saturates at ~20% of facts — throwing away 80% doesn't hurt
  • Behavioral patterns are temporally stable — early facts predict late behavior and vice versa
  • What you avoid and struggle with is more predictive than what you believe
  • Annotated guide format beats production brief by +24% on downstream tasks at 1/3 the length
  • The full pipeline is worth ~4 points over a single Opus prompt with raw facts
  • Scoring, classification, tiering, contradictions, consolidation, anchors extraction, and Collective review are all ceremonial — but the three-layer architecture IS load-bearing
07Session 81

V4 Validation

Locked the V4 compose prompt (false positive guards + tension-action pairs woven into prose). Re-authored and re-composed all 7 public subjects with the simplified pipeline. Total cost: $2.18 for all subjects. Built automated data generation for the website. Full code review: 7-phase audit found 13 bugs (including a quality gate false-positive on every compose run), completed privacy scrub, 0 security blockers for public release.

What we learned

  • V4 briefs are comparable across pipeline runs — same core patterns, different emphasis. Stability, not fragility.
  • Full re-author + re-compose costs $0.25–0.37 per subject — cheap enough to iterate freely
  • Code review found a quality gate bug that silently reported false COHERENCE gaps on every compose
  • Privacy scrub complete — 0 PII in scripts, 0 hardcoded secrets, subject data directories gitignored
  • Paul Graham (28 essays, 272 facts extracted) is ready for authoring — next case study
08Today

Where We Are Now

Open-sourcing the research. 4-step pipeline, 47 predicates, 76+ architectural decisions, 400+ tests, 10 subjects processed. The honest assessment: this is a research project with strong preliminary results, not a finished product. We're publishing everything — what works, what doesn't, what we don't know yet.

What we learned

  • The contribution is the research findings, not the engineering
  • Every agentic workflow is hollow if it doesn't understand who the human is behind the screen
  • Compression is the hard problem. Memory is solved. Identity is not.
  • We're not looking for validation — we're looking for feedback

By the Numbers

81+
Sessions
of development
14→4
Pipeline steps
after ablation
10
Subjects tested
diverse sources
47
Predicates
structured extraction
18:1
Compression
vs full persona
20%
Saturation
of facts needed
$2.18
Full run cost
all 7 subjects
414
Tests passing
across pipeline + eval

Open Questions

What we don’t know yet — and what we’re testing next.

Everything is open source. The pipeline, the findings, the failures. We’re not looking for validation — we’re looking for feedback.

Get in Touch

Have questions or want to collaborate?

aarik@base-layer.ai