# Base Layer — Behavioral Compression Research

# Compose Variations Study

**Date:** March 2026
**Authors:** Aarik Gulaya

---

## Abstract

We tested six brief formats to determine how the final composition step affects downstream task performance. All variations received identical input (same extracted facts, same 3-layer authored content) and differed only in output structure. The annotated guide format with false positive guards and tension-action pairs (V4) produced the best results, outperforming narrative prose (V1) by 24% on downstream behavioral tasks. Format is a first-order variable in behavioral compression — how a brief is structured matters as much as what it contains.

## Methodology

Six composition formats were tested, each producing a brief from identical source material:

- **V1 — Narrative Prose:** Continuous paragraphs describing the subject's behavioral patterns. Standard essay format.
- **V2 — Bullet-Point:** Flat list of behavioral claims. No hierarchy or grouping.
- **V3 — Annotated Guide (no guards):** Structured sections with behavioral claims, supporting evidence annotations, and contextual notes. No false positive mitigation.
- **V4 — Annotated Guide (with guards):** Same as V3, plus false positive guards ("do not over-apply this pattern when...") and tension-action pairs that describe how the subject navigates internal contradictions.
- **V5 — Maximally Compressed:** Shortest possible brief. Telegram-style. Every word load-bearing.
- **V6 — Structured JSON:** Machine-readable format with typed fields for each behavioral dimension.

Downstream evaluation measured three capabilities: behavioral prediction accuracy (given a scenario, predict the subject's choice), in-character response generation (produce a response the subject would plausibly write), and out-of-character detection (identify responses inconsistent with the subject's patterns).

## Results

| Format | Downstream Score (Relative to V1) |
|--------|----------------------------------|
| V1 — Narrative Prose | Baseline |
| V2 — Bullet-Point | +8% |
| V3 — Annotated Guide (no guards) | +18% |
| V4 — Annotated Guide (with guards) | **+24%** |
| V5 — Maximally Compressed | +6% |
| V6 — Structured JSON | +11% |

## Key Findings

1. **False positive guards prevent over-application.** The gap between V3 (+18%) and V4 (+24%) is entirely attributable to guards and tension-action pairs. Without guards, models over-apply behavioral patterns — treating tendencies as absolutes. Guards like "do not assume this applies in technical contexts" constrain the model's application of the brief.

2. **Tension-action pairs are directive.** Rather than listing isolated traits, V4 describes how the subject navigates contradictions: "values X but prioritizes Y when Z." This gives the consuming model actionable decision logic, not just descriptive labels.

3. **Narrative prose is the worst structured format.** V1 performs below every alternative. Continuous prose buries behavioral signals in connective tissue. Models extract structured information more reliably from structured formats.

4. **Maximum compression overshoots.** V5 (+6%) underperforms V2 (+8%). Below a threshold, brevity sacrifices the contextual cues models need to correctly scope behavioral claims. There is an optimal compression level, and "as short as possible" is past it.

5. **JSON is decent but not optimal.** V6 (+11%) outperforms prose and bullet-points but falls short of the annotated guide. Machine-readable structure helps models parse claims, but the lack of natural language context and guards limits appropriate application. JSON also has poor human utility — briefs should be readable by both humans and models.

6. **Format is a 24% variable.** Identical content, reformatted, produces a 24% swing in downstream performance. This exceeds the effect of most content manipulations tested in the Compression and Format Study. Pipeline developers who optimize extraction and ignore composition are leaving significant performance on the table.

## Limitations

- Downstream evaluation aggregates three task types. Format effects may vary across tasks — V6 might excel at prediction but fail at generation.
- Tested on two subjects. Format preferences could interact with subject complexity in ways this study did not capture.
- V4's guards were hand-crafted based on known failure modes. Automated guard generation has not been tested and may not achieve the same quality.
- The +24% improvement was measured against V1 (prose), which may be an artificially low baseline. Against a stronger baseline, the V4 improvement would be smaller.
- Human evaluation of brief readability and utility was informal. A structured human evaluation would strengthen the V4 recommendation.