← base-layer.ai

Research

Seven studies across 81+ sessions. Every claim has numbers behind it, every number has a methodology, and every methodology is documented. Here’s what we tested and what we found.

Download Summary
01

20% of facts is enough

Compression saturates early. Throwing away 80% of extracted facts doesn’t hurt brief quality — and often improves it. Adding more content makes things worse.

Franklin: peak at 20%. Marks: ~50%. Coverage remediation: +0.15 (negligible).

02

Avoidance predicts best

Behavioral facts — especially avoidance and struggle patterns — are the strongest predictors. Biographical facts (48% of total) are mediocre. Epistemic beliefs are middle-of-pack.

Avoidance predicates: 26.2 composite. Epistemic: 19.4. Cross-model confirmed.

03

Patterns are temporally stable

Facts from early in someone’s life predict late behavior as well as late facts predict early behavior. Identity is more stable than expected.

Franklin Q4→Q1: 24.9 vs Q1→Q4: 21.9. Direction effect < 2%.

04

Format matters more than content

The same information in annotated guide format dramatically outperforms narrative prose — at one-third the length. The production brief at 9,144 chars scored worst.

Annotated guide: 0.766 (+24%). Production baseline: 0.618. Optimal: ~1,000–2,500 chars.

05

4 steps beat 14

We ablated every pipeline step. Scoring, classification, tiering, contradiction detection, and Collective review are all ceremonial. The three-layer architecture IS load-bearing.

4-step: 87/100. Full 14-step: 83/100. Raw facts → single prompt: 80/100.

06

Fidelity creates vulnerability

The more faithfully the brief captures someone, the more exploitable it becomes. The best downstream format has the worst adversarial resistance. This is correct behavior, not a bug.

Annotated guide: 60% adversarial resistance. Directive format: 100%.

Behavioral Drift

Format > model size

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 12, 2026Download Report

The experiment

We injected a single behavioral fact into a coding agent’s identity brief and measured whether the resulting behavioral change was targeted (changed the right dimension) or diffuse (changed everything equally). Tested across 4 models, 3 identity formats, 5 mechanical coding tasks. Total cost: ~$0.30.

Specificity Ratio by model and format

SR > 1.5 = targeted drift (green). SR ≈ 1.0 = diffuse (yellow). SR < 0.8 = missed (red).

ModelParamsCostBrief (prose)AxiomsAtomic (flat)
Phi-4 Mini3.8B$01.25
Qwen 2.57B$02.622.551.00
DeepSeek-R114B$00.732.490.54
Claude Sonnet~70B~$0.300.872.111.14

What we found

The core insight

An agent that understands WHY you avoid over-engineering routes new engineering lessons to the right place. An agent that just knows you “prefer simple code” can’t. The format of identity representation — not the model size — determines whether an AI can learn precisely from new information about you.

4
models tested
3
identity formats
$0.30
total API cost
SR > 2.0
axiom targeting

Prompt Ablation

31 conditions → V5 brief

V5 BriefV5 is the current brief format — citation-stripped, cleaner prose.
March 11, 2026Download Report

What we did

31 prompt variations across 7 rounds, tested on 3 subjects (Franklin, Buffett, Aarik). We systematically ablated the composition prompt — the final step that determines what the consuming LLM actually sees — to find what makes a behavioral brief effective.

+99%
V4 → V5 score gain
56%
size reduction
31
conditions tested
+217%
epistemic calibration gain

Novel contribution

Epistemic calibration — explicitly marking what the system cannot predict — is the study’s novel contribution. No comparable personalization system tells you where its behavioral model breaks down. An LLM that knows what it doesn’t know is more useful than one that’s confidently wrong everywhere.

* These are preliminary results from N=3 subjects, a single research group, and model-judged scoring (except the blind A/B). We present them as honest first results, not definitive conclusions. The rubric was redesigned mid-study — direct cross-rubric comparison is invalid. Replication on larger, diverse populations is needed.

Pipeline Ablation

Which steps matter?

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 8, 2026Download Report

What we did

We originally built a 14-step pipeline to go from raw text to identity brief. Before shipping, we tested every step: is it load-bearing or ceremony? We ran 14 conditions on Benjamin Franklin’s autobiography (~$16 total) and measured brief quality for each.

14
conditions tested
~$16
total cost
87
best score

Each condition produces a brief from the same source text. The score (0–100) measures how well the brief captures the subject’s behavioral patterns — rated by an independent model that compares the brief against the full source material. Higher = more of the subject’s real patterns are captured accurately.

ConditionDescriptionScore
C0Full 14-step pipeline83Baseline
C1Skip scoring83
C2Skip classification82
C3Skip tiering83
C4Skip contradictions82
C5Skip consolidation81
C6Skip anchors extraction83
C7Skip embedding83
C8Skip ANCHORS layer80
C9Skip CORE layer77
C10Skip PREDICTIONS layer79
C11Author + Compose (no review)87Best
C12Direct fact injection77Worst
C13Single layer (no 3-layer)83

What it means

4 steps beat 14. The simplified pipeline (Import → Extract → Author → Compose) scores 87 vs the full 14-step pipeline at 83.

Compression & Format

How much data is enough?

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 8, 2026Download Report

What we did

We tested how much source data the pipeline actually needs, and whether the format of the output matters as much as the content. Cross-validated on two models (Sonnet API and Qwen local GPU). The consistent finding: less is more, and format matters more than content.

What it means

The pipeline’s value is in compression, not accumulation. The best brief is short, behavioral (not biographical), and formatted as an annotated guide rather than narrative prose.

Twin-2K-500

External validation

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 7, 2026Download Report

What we did

Can a compressed brief predict how someone will actually respond to survey questions? We used the Twin-2K dataset from Columbia/Virginia Tech — 100 real participants, each with detailed persona descriptions (~130K characters). We compressed each into a brief and tested whether models could predict their responses.

N=100
participants
18:1
compression ratio
71.83%
brief accuracy
p=0.008
significance

GPT-4.1-mini

C2 (Base Layer brief)71.83%
C1 (Full persona dump)71.72%
C0 (No persona)68.43%

C2 vs C1: p=0.008

Claude Sonnet

C2 (Base Layer brief)75.07%
C1 (Full persona dump)74.38%
C0 (No persona)73.21%

C2 vs C1: +0.69% (borderline)

What it means

A compressed brief matches a full persona dump at 18:1 compression. On GPT-4.1-mini, the brief actually outperforms the full dump (p=0.008). Compression doesn’t lose signal — it concentrates it.

BCB-0.1

Measuring brief quality

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 7, 2026Download Report

What we did

Five metrics measuring compression quality. Tested on Franklin’s autobiography. Two passed, two failed, one invalid. The failures are as informative as the passes.

99.98%PASS

CR

Claim Recoverability

+0.350PASS

SRS

Signal Retention Score

0.567FAIL

DRS

Drift Resistance Score

0.570FAIL

CMCS

Cross-Model Consistency

N/AN/A

VRI

Variance Reduction Index

What it means

Faithful briefs expose real contradictions in someone’s worldview — more useful AND more vulnerable to adversarial attack. DRS will always penalize fidelity. This is a feature, not a bug.

Provenance Evaluation

Mechanical, not opinion

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 7, 2026Download Report

What we did

Using an LLM to judge another LLM’s output is circular. We built an evaluation framework with four mechanical layers — no LLM judges, zero cost, and every result is human-auditable. The question: can we verify brief quality without relying on model opinions?

$0
evaluation cost
4
mechanical layers
2
subjects tested
8/10
prompts where brief wins (BA)

Phase 1 Results — Howard Marks (74 investment memos)

Layer 1: Brief Activation (BA)

C1 mean similarity to brief0.4030
C5c mean similarity to brief0.4192
Delta+0.016
Prompts where brief wins8/10

Layer 2: Provenance Coverage (PC)

C1 coverage (threshold 0.50)20.4%
C5c coverage (threshold 0.50)23.4%
Delta+3.0%
Prompts where brief wins7/10

Consistent direction across all 7 similarity thresholds tested (0.40–0.70)

What it means

Core principle: if a human can’t audit the claim, it’s not evidence. Every metric in this framework is verifiable without running a model.

Compose Variations

V4 prompt engineering

V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.
March 3, 2026Download Report

What we did

The compose step takes the same extracted facts and authored layers, and synthesizes them into a final brief. The composition prompt controls the output format. We tested six variations to find which format produces the most useful brief for downstream AI interactions.

What it means

Format changes alone improved downstream task performance by +24% (annotated guide vs narrative prose). The same information, restructured, is dramatically more useful to models.

Design Decisions

80 decisions, all public

The full decision log

Every architectural choice is documented with reasoning, alternatives considered, and status. 80 decisions across 81+ sessions. Here are the highlights — grouped by theme. The full log is published in the repository at docs/core/DECISIONS.md.

80
decisions logged
81+
sessions
47
constrained predicates
414
tests passing

Architecture

Extraction & Quality

Evaluation Philosophy

What Didn't Work

Why publish this

Most projects publish their code. We also publish why the code looks the way it does — every wrong turn, every superseded idea, every decision that survived. The prompts are in the code. The reasoning is in the log. Nothing is hidden.