# Base Layer — Behavioral Compression Research

# Prompt Ablation Study: Optimizing Brief Composition

**Date:** March 2026
**Authors:** Aarik Gulaya
**Cost:** ~$18.40 total

---

## Abstract

We conducted a 31-condition prompt ablation study to determine the optimal method for composing behavioral briefs — the final document injected into LLM context that shapes how the model interacts with a specific person. Testing architectural choices, prompt content, structural decisions, and evaluation rubric design across three subjects, we found that: single-pass composition outperforms multi-stage architectures, false positive warnings are load-bearing but prone to fabrication, telling the model what it's scored on is the strongest meta-strategy, and explicitly marking what the system cannot predict is the study's novel contribution to AI personalization. The winning condition (C31) scored 83.7/90 — a 99% improvement over the prior production brief at 56% smaller size. A subsequent blind A/B evaluation by the system's creator confirmed V5 as the preferred brief, with one caveat: concrete facts lost during compression need to be retrieved dynamically at serving time.

## Background

Base Layer's pipeline extracts behavioral facts from text, organizes them into three authored layers (Anchors, Core, Predictions), then composes a unified brief. The brief is the only artifact the consuming LLM sees. Prior to this study, composition used a "V4" prompt refined through informal iteration but never systematically ablated. Session 78 research had already established that V4 was 3-9x longer than optimal and that format alone produces a 24% swing in downstream performance.

## Methodology

31 conditions (C0-C30, plus C31 as the final variant) across 7 rounds, tested on 3 subjects:

- **Benjamin Franklin** — historical figure, autobiography-derived
- **Warren Buffett** — public figure, shareholder letters and interviews
- **Aarik** — system creator, 80+ sessions of conversation history

All briefs composed by Claude Opus, scored by Collective review (Opus with 4-persona rubric assessment). Two rubric versions were used — the study redesigned its own evaluation criteria midway when the original rubric was found to be misweighted.

## Results by Round

### Round 1 (C0-C7): Architecture

**Question:** Does a Planner-Executor (multi-stage) architecture outperform a single Opus pass?

**Finding:** No. Single-pass beats all multi-stage variants. Multi-stage added cost ($0.20+ vs $0.11), latency (~60s vs ~30s), and complexity without improving quality. Composition quality is bounded by prompt content, not planning depth.

### Round 2 (C8-C11): Prompt Content

**Finding 1:** Organizing the brief around "when NOT to apply this pattern" (false-positive-first) outperformed organizing around the patterns themselves. This aligns with the finding from the Compression Study that avoidance predicates are the most predictive behavioral facts.

**Finding 2:** When given complete freedom, the model independently chose the annotated guide format — the same format the Compression Study identified as optimal (+24%). Two independent experiments converging on the same answer.

### Round 3 (C12-C13): False Positive Warnings

FP warnings are load-bearing (+4.6 points when included). But a critical failure mode emerged: the model fabricated plausible-sounding FP warnings for patterns that had no FP conditions in the source layers. Confident-sounding constraints with no grounding in evidence — the most dangerous failure mode in behavioral compression.

### Round 4 (C14): Fixing Fabrication

A single instruction — "only include FP warnings where the source material explicitly provides them" — eliminated fabrication. The faithfulness problem was instructional, not architectural.

### Round 5 (C15-C23): Systematic Gap Closure

- **Completeness vs efficiency is a fundamental tension.** Exhaustive coverage drove briefs to 10,000+ characters — well above optimal.
- **Example phrasings are fabricated content.** Improved actionability but introduced faithfulness risk.
- **Rubric awareness is the strongest meta-strategy.** Including evaluation criteria in the prompt produced the best results. This made rubric design the highest-leverage activity.

### The Rubric Was Wrong

The original rubric over-weighted structural properties (traceability at 35%) and under-weighted purpose (actionability at 12%). A brief that perfectly traces every claim but doesn't change model behavior is useless. The rubric was redesigned from first principles.

**Old Rubric (/85):** Traceability 3x, Faithfulness 2x, Token Efficiency 1x, Completeness 1x, Actionability 1x, FP Grounding +5

**New Rubric (/90):** Provenance 3x (30), Behavioral Change 3x (30), Epistemic Calibration 2x (20), Signal Density 1x (10)

Key shift: Actionability upgraded from 1x to 3x as "Behavioral Change." New dimension added: Epistemic Calibration — does the brief mark what is uncertain or unpredictable? Each primitive grounded in published research (XAI, Information Bottleneck theory, calibration literature, information theory).

### Round 6 (C24-C27): New Rubric

- C26 (rubric awareness) won at 73.7/90.
- C27 (no structural prescription) produced different formats per subject — tension-centered for Franklin, system-coherence for Buffett, imperative for Aarik. Format became signal, not template.
- Top reviewer suggestion across all conditions: include explicit epistemic gaps — places where the behavioral model breaks down.

### Round 7 (C28-C30): The Winner

| Condition | Description | Franklin | Buffett | Aarik | Average |
|---|---|---|---|---|---|
| **C28** | Rubric awareness + cannot predict + temporal markers | 89 | 86 | 86 | **87.0** |
| C29 | Rubric awareness + relational + agency | 73 | 68 | 87 | 76.0 |
| C30 | Full research synthesis (everything) | 86 | 82 | 88 | 85.3 |

C28 won at 87.0/90. More instructions create competing optimization targets — focused additions (2 features) outperform comprehensive ones (all features).

### Rubric Calibration Bug

During scoring, faithful paraphrases of FP warnings were being scored as "fabricated." The fix: provenance-based evaluation — if the citation chain is valid (even paraphrased), score 8-10; if no source, score 0-3. After correction, C28 (84.3) and C31 (83.7) were statistically tied. The original gap was a measurement artifact.

### C31: The Production Brief

C31 = C28 (rubric awareness + temporal awareness + cannot predict) + C27 (format freedom). Chosen unanimously by Collective review across all 3 subjects.

## V4 vs V5 Final Comparison

| Property | V4 (prior production) | V5 (C31) |
|---|---|---|
| Architecture | Detailed structural instructions | Rubric-aware, format-free, temporal-aware |
| Key features | FP guards + tension-action pairs | Rubric-as-prompt + CANNOT PREDICT + citations |
| Avg score (/90) | 42.0 | 83.7 |
| Avg size (chars) | 9,258 | 4,038 |
| Signal per char | 0.0045 | 0.0207 |

| Dimension | V4 | V5 | % Gain |
|---|---|---|---|
| Provenance (/30) | 16.7 | 28.3 | +69% |
| Behavioral Change (/30) | 15.3 | 27.3 | +78% |
| Epistemic Calibration (/20) | 6.0 | 19.0 | **+217%** |
| Signal Density (/10) | 4.0 | 9.0 | +125% |
| **Total (/90)** | **42.0** | **83.7** | **+99%** |

Epistemic Calibration showed the largest gain because V4 had no mechanism to express uncertainty. V5's CANNOT PREDICT section directly addresses this gap — no comparable personalization system includes it.

## V5 Innovations

**Citation Stripping:** V5 generates inline citations ([A1], [P3]) during compose, then strips them for the clean served version. Two files per subject: cited (audit) + clean (serve). Provenance for humans, clean prose for models.

**Format Freedom:** No structural prescription. The model adapts format to each subject's behavioral signature — mode detection for Franklin, decision triggers for Buffett, trigger-response for Douglass.

**Rubric-as-Prompt:** Including evaluation criteria in the prompt makes the model optimize for them directly. Quality becomes self-enforcing.

## Blind A/B Evaluation

After the rubric-based study, we ran a 10-question blind test on the system's creator — the only subject with ground-truth validation. Each question received two paragraph-length responses: one shaped by V4 (detail-rich) and one by V5 (compressed).

**Result: V5 wins 5, V4 wins 2, ties 3.**

V5 won on reasoning structure, brevity, and asking the right questions. V4 won when concrete behavioral details (specific trading parameters, personal history) WERE the response. The pattern: V5 provides better behavioral steering, V4 provides better domain specificity.

Key qualitative findings:
- **Opening lines matter disproportionately.** "Your first instinct is to audit" (behavioral prediction, chosen) vs "This triggers defensibility anxiety" (emotional label, rejected).
- **Axiom labels hurt when cited as rules, work when woven into predictions.** "Your foundational-focus axiom says" breaks immersion. "Your coherence axiom won't let you dismiss this" works because it's embedded in a behavioral prediction.
- **Don't over-predict from single events.** One meeting override doesn't mean emotional disengagement — the brief must respect tier awareness.
- **The subject's final word:** "At my core, I'd rather have honesty than false confidence." The brief that knows what it can't predict is more trustworthy than the one that covers everything confidently.

## Diagnosis and Next Step

V5 is the correct production brief. The only loss is concrete facts compressed away during the compose step — the authored layers preserve them, but compose abstracts them into generic labels. The fix is not a new brief version. It's a serving architecture change: V5 brief (behavioral steering) + dynamic fact retrieval when domain context activates. Brief for shape, facts for substance.

## Key Takeaways

1. **Single-pass beats multi-stage** for composition. Planning depth doesn't improve quality.
2. **False positive warnings are the highest-leverage single feature** (+4.6 avg, +6% downstream) but must be grounded in source material.
3. **Rubric awareness is the meta-strategy.** Tell the model what it's scored on and it optimizes accordingly. Rubric design is the highest-leverage activity.
4. **The rubric must derive from first principles.** Four primitives: Provenance, Behavioral Change, Epistemic Calibration, Signal Density.
5. **Epistemic calibration is the novel contribution.** An LLM that knows where its model breaks down is more useful than one that's confidently wrong.
6. **Different people need different formats.** No structural prescription — let the model adapt.
7. **Focused additions outperform comprehensive ones.** Two new constraints beat four.
8. **Format is a 24% variable.** How a brief is structured matters as much as what it contains.

## Limitations

- Model-judged, not human-judged (except the blind A/B).
- N=3 subjects — historical, public, and personal profiles represented, but limited coverage.
- Two rubric versions used across the study — direct cross-rubric comparison is invalid.
- Single compose model (Opus). May not transfer to other models.
- Scores near ceiling (83.7/90) may reflect reviewer leniency at high quality levels.