# Base Layer: Behavioral Compression Research

# Compression and Format Study

**Date:** March 2026
**Authors:** Aarik Gulaya
**Models:** Claude Sonnet (API), Qwen (local GPU)

---

## Abstract

We investigated two questions: how much source material is needed to produce an effective behavioral brief, and how should that brief be formatted for downstream use. Cross-validated on Sonnet and Qwen, we found that 20% of extracted facts are sufficient for simple subjects, behavioral facts (15% of total) are the strongest predictors, and format changes alone produce a +24% improvement on downstream tasks. The optimal brief length is 1,000–2,500 characters, but our production brief at 9,144 characters was 3–9x over-budget. The most counterintuitive finding: adding more data consistently hurts performance.

## Methodology

**Compression experiments:** We varied the percentage of extracted facts included in brief generation (10%, 20%, 30%, 50%, 75%, 100%) across two subjects: Franklin (autobiography, lower complexity) and Marks (74 investment memos, higher complexity). We also tested temporal ordering: Q1 facts only, Q1+Q2, Q1+Q2+Q3, and all four quarters, evaluated against Q4 held-out predictions.

**Format experiments:** We tested six brief formats (detailed in the Compose Variations study) and measured downstream task performance: the ability of a brief-conditioned model to predict behavioral choices, generate in-character responses, and identify out-of-character statements.

**Fact type analysis:** Extracted facts were categorized by type (behavioral, biographical, positional, relational) and evaluated independently for predictive power.

All experiments were cross-validated on both Sonnet (cloud API) and Qwen (local GPU) to control for model-specific effects.

## Results

| Finding | Franklin | Marks |
|---------|----------|-------|
| Sufficient facts | 20% | ~50% |
| Optimal brief length | 1,200 chars | 2,100 chars |
| Best fact type | Behavioral | Behavioral |
| Worst fact type | Positional | Positional |
| Annotated guide vs prose | +24% | +21% |

**Temporal results (Franklin):**

| Training Set | Q4 Prediction Accuracy |
|-------------|----------------------|
| Q1 only | Highest |
| Q1+Q2 | Lower |
| Q1+Q2+Q3 | Lower |
| All quarters | Lowest |

## Key Findings

1. **More data hurts.** Q1 facts alone outperform Q1+Q2+Q3 for predicting Q4 behavior. Additional data introduces noise, contradictions, and contextual details that dilute core behavioral patterns. This held across both models.

2. **Behavioral facts are the best predictors.** Behavioral facts comprise only 15% of extracted facts but produce the highest downstream accuracy. What someone does, their patterns of action, avoidance, and preference, is more predictive than what they know, believe, or have experienced.

3. **Avoidance predicates are strongest.** Within behavioral facts, avoidance patterns ("actively avoids," "refuses to," "never") are the most predictive single category. What someone consistently avoids reveals more stable behavioral patterns than what they pursue.

4. **Positional facts are consistently worst.** Opinions, stances, and declared positions are the least predictive fact type. Positions change; behaviors persist.

5. **Format matters more than content.** A +24% improvement from format change alone (narrative prose to annotated guide) exceeds the improvement from any content manipulation we tested. How information is structured for the consuming model is at least as important as what information is included.

6. **Production briefs are too long.** Our production brief at 9,144 characters was 3–9x the optimal range. Longer briefs do not produce better downstream performance. This is consistent with the "more data hurts" finding: verbosity is a form of noise.

7. **Complexity scales the data requirement.** Franklin (single autobiography, coherent worldview) needs 20% of facts. Marks (74 memos spanning decades, evolving positions) needs ~50%. More complex subjects require more facts to capture behavioral variance, but the relationship is sublinear.

## Limitations

- Two subjects. The complexity-scaling relationship (20% vs 50%) is a two-point observation, not a curve.
- Qwen and Sonnet may share training data biases that inflate cross-model agreement.
- "Downstream task performance" aggregates multiple task types. Some formats may excel at prediction but fail at generation, or vice versa.
- The temporal finding (Q1 beats Q1+Q2+Q3) could reflect Franklin's autobiography structure rather than a general principle. Autobiographies are typically written in retrospect, meaning early chapters may already encode mature patterns.
- Optimal length range (1,000–2,500 chars) was determined empirically on two subjects. The range may shift for subjects with fundamentally different behavioral complexity.
