# Base Layer: Behavioral Compression Research

# Prompt Ablation Study: Optimizing Brief Composition

**Date:** March 2026
**Authors:** Aarik Gulaya
**Cost:** ~$18.40 total

---

## Abstract

We conducted a 31-condition prompt ablation study to determine the optimal method for composing behavioral briefs, the final document injected into LLM context that shapes how the model interacts with a specific person. Testing architectural choices, prompt content, structural decisions, and evaluation rubric design across three subjects, we found that: single-pass composition outperforms multi-stage architectures, false positive warnings are load-bearing but prone to fabrication, telling the model what it's scored on is the strongest meta-strategy, and explicitly marking what the system cannot predict is the study's novel contribution to AI personalization. The winning condition (C31) scored 83.7/90, a 99% improvement over the prior production brief at 56% smaller size. A subsequent blind A/B evaluation by the system's creator confirmed V5 as the preferred brief, with one caveat: concrete facts lost during compression need to be retrieved dynamically at serving time.

## Background

Base Layer's pipeline extracts behavioral facts from text, organizes them into three authored layers (Anchors, Core, Predictions), then composes a unified brief. The brief is the only artifact the consuming LLM sees. Prior to this study, composition used a "V4" prompt refined through informal iteration but never systematically ablated. Session 78 research had already established that V4 was 3-9x longer than optimal and that format alone produces a 24% swing in downstream performance.

## Methodology

31 conditions (C0-C30, plus C31 as the final variant) across 7 rounds, tested on 3 subjects:

- **Benjamin Franklin:** historical figure, autobiography-derived
- **Warren Buffett:** public figure, shareholder letters and interviews
- **Aarik:** system creator, 80+ sessions of conversation history

All briefs composed by Claude Opus, scored by Collective review (Opus with 4-persona rubric assessment). Two rubric versions were used; the study redesigned its own evaluation criteria midway when the original rubric was found to be misweighted.

## Results by Round

### Round 1 (C0-C7): Architecture

**Question:** Does a Planner-Executor (multi-stage) architecture outperform a single Opus pass?

**Finding:** No. Single-pass beats all multi-stage variants. Multi-stage added cost ($0.20+ vs $0.11), latency (~60s vs ~30s), and complexity without improving quality. Composition quality is bounded by prompt content, not planning depth.

### Round 2 (C8-C11): Prompt Content

**Finding 1:** Organizing the brief around "when NOT to apply this pattern" (false-positive-first) outperformed organizing around the patterns themselves. This aligns with the finding from the Compression Study that avoidance predicates are the most predictive behavioral facts.

**Finding 2:** When given complete freedom, the model independently chose the annotated guide format, the same format the Compression Study identified as optimal (+24%). Two independent experiments converging on the same answer.

### Round 3 (C12-C13): False Positive Warnings

FP warnings are load-bearing (+4.6 points when included). But a critical failure mode emerged: the model fabricated plausible-sounding FP warnings for patterns that had no FP conditions in the source layers. Confident-sounding constraints with no grounding in evidence. This is the most dangerous failure mode in behavioral compression.

### Round 4 (C14): Fixing Fabrication

A single instruction ("only include FP warnings where the source material explicitly provides them") eliminated fabrication. The faithfulness problem was instructional, not architectural.

### Round 5 (C15-C23): Systematic Gap Closure

- **Completeness vs efficiency is a fundamental tension.** Exhaustive coverage drove briefs to 10,000+ characters, well above optimal.
- **Example phrasings are fabricated content.** Improved actionability but introduced faithfulness risk.
- **Rubric awareness is the strongest meta-strategy.** Including evaluation criteria in the prompt produced the best results. This made rubric design the highest-leverage activity.

### The Rubric Was Wrong

The original rubric over-weighted structural properties (traceability at 35%) and under-weighted purpose (actionability at 12%). A brief that perfectly traces every claim but doesn't change model behavior is useless. The rubric was redesigned from first principles.

**Old Rubric (/85):** Traceability 3x, Faithfulness 2x, Token Efficiency 1x, Completeness 1x, Actionability 1x, FP Grounding +5

**New Rubric (/90):** Provenance 3x (30), Behavioral Change 3x (30), Epistemic Calibration 2x (20), Signal Density 1x (10)

Key shift: Actionability upgraded from 1x to 3x as "Behavioral Change." New dimension added: Epistemic Calibration, which measures whether the brief marks what is uncertain or unpredictable. Each primitive grounded in published research (XAI, Information Bottleneck theory, calibration literature, information theory).

### Round 6 (C24-C27): New Rubric

- C26 (rubric awareness) won at 73.7/90.
- C27 (no structural prescription) produced different formats per subject: tension-centered for Franklin, system-coherence for Buffett, imperative for Aarik. Format became signal, not template.
- Top reviewer suggestion across all conditions: include explicit epistemic gaps, places where the behavioral model breaks down.

### Round 7 (C28-C30): The Winner

| Condition | Description | Franklin | Buffett | Aarik | Average |
|---|---|---|---|---|---|
| **C28** | Rubric awareness + cannot predict + temporal markers | 89 | 86 | 86 | **87.0** |
| C29 | Rubric awareness + relational + agency | 73 | 68 | 87 | 76.0 |
| C30 | Full research synthesis (everything) | 86 | 82 | 88 | 85.3 |

C28 won at 87.0/90. More instructions create competing optimization targets. Focused additions (2 features) outperform comprehensive ones (all features).

### Rubric Calibration Bug

During scoring, faithful paraphrases of FP warnings were being scored as "fabricated." The fix: provenance-based evaluation. If the citation chain is valid (even paraphrased), score 8-10; if no source, score 0-3. After correction, C28 (84.3) and C31 (83.7) were statistically tied. The original gap was a measurement artifact.

### C31: The Production Brief

C31 = C28 (rubric awareness + temporal awareness + cannot predict) + C27 (format freedom). Chosen unanimously by Collective review across all 3 subjects.

## V4 vs V5 Final Comparison

| Property | V4 (prior production) | V5 (C31) |
|---|---|---|
| Architecture | Detailed structural instructions | Rubric-aware, format-free, temporal-aware |
| Key features | FP guards + tension-action pairs | Rubric-as-prompt + CANNOT PREDICT + citations |
| Avg score (/90) | 42.0 | 83.7 |
| Avg size (chars) | 9,258 | 4,038 |
| Signal per char | 0.0045 | 0.0207 |

| Dimension | V4 | V5 | % Gain |
|---|---|---|---|
| Provenance (/30) | 16.7 | 28.3 | +69% |
| Behavioral Change (/30) | 15.3 | 27.3 | +78% |
| Epistemic Calibration (/20) | 6.0 | 19.0 | **+217%** |
| Signal Density (/10) | 4.0 | 9.0 | +125% |
| **Total (/90)** | **42.0** | **83.7** | **+99%** |

Epistemic Calibration showed the largest gain because V4 had no mechanism to express uncertainty. V5's CANNOT PREDICT section directly addresses this gap. No comparable personalization system includes it.

## V5 Innovations

**Citation Stripping:** V5 generates inline citations ([A1], [P3]) during compose, then strips them for the clean served version. Two files per subject: cited (audit) + clean (serve). Provenance for humans, clean prose for models.

**Format Freedom:** No structural prescription. The model adapts format to each subject's behavioral signature: mode detection for Franklin, decision triggers for Buffett, trigger-response for Douglass.

**Rubric-as-Prompt:** Including evaluation criteria in the prompt makes the model optimize for them directly. Quality becomes self-enforcing.

## Blind A/B Evaluation

After the rubric-based study, we ran a 10-question blind test on the system's creator, the only subject with ground-truth validation. Each question received two paragraph-length responses: one shaped by V4 (detail-rich) and one by V5 (compressed).

**Result: V5 wins 5, V4 wins 2, ties 3.**

V5 won on reasoning structure, brevity, and asking the right questions. V4 won when concrete behavioral details (specific trading parameters, personal history) WERE the response. The pattern: V5 provides better behavioral steering, V4 provides better domain specificity.

Key qualitative findings:
- **Opening lines matter disproportionately.** "Your first instinct is to audit" (behavioral prediction, chosen) vs "This triggers defensibility anxiety" (emotional label, rejected).
- **Axiom labels hurt when cited as rules, work when woven into predictions.** "Your foundational-focus axiom says" breaks immersion. "Your coherence axiom won't let you dismiss this" works because it's embedded in a behavioral prediction.
- **Don't over-predict from single events.** One meeting override doesn't mean emotional disengagement. The brief must respect tier awareness.
- **The subject's final word:** "At my core, I'd rather have honesty than false confidence." The brief that knows what it can't predict is more trustworthy than the one that covers everything confidently.

## Diagnosis and Next Step

V5 is the correct production brief. The only loss is concrete facts compressed away during the compose step. The authored layers preserve them, but compose abstracts them into generic labels. The fix is not a new brief version. It's a serving architecture change: V5 brief (behavioral steering) + dynamic fact retrieval when domain context activates. Brief for shape, facts for substance.

## Key Takeaways

1. **Single-pass beats multi-stage** for composition. Planning depth doesn't improve quality.
2. **False positive warnings are the highest-leverage single feature** (+4.6 avg, +6% downstream) but must be grounded in source material.
3. **Rubric awareness is the meta-strategy.** Tell the model what it's scored on and it optimizes accordingly. Rubric design is the highest-leverage activity.
4. **The rubric must derive from first principles.** Four primitives: Provenance, Behavioral Change, Epistemic Calibration, Signal Density.
5. **Epistemic calibration is the novel contribution.** An LLM that knows where its model breaks down is more useful than one that's confidently wrong.
6. **Different people need different formats.** No structural prescription. Let the model adapt.
7. **Focused additions outperform comprehensive ones.** Two new constraints beat four.
8. **Format is a 24% variable.** How a brief is structured matters as much as what it contains.

## Limitations

- Model-judged, not human-judged (except the blind A/B).
- N=3 subjects. Historical, public, and personal profiles represented, but limited coverage.
- Two rubric versions used across the study. Direct cross-rubric comparison is invalid.
- Single compose model (Opus). May not transfer to other models.
- Scores near ceiling (83.7/90) may reflect reviewer leniency at high quality levels.
