# Base Layer — Behavioral Compression Research

# BCB-0.1: Behavioral Compression Benchmark

**Date:** March 2026
**Authors:** Aarik Gulaya
**Subject:** Benjamin Franklin (autobiography)

---

## Abstract

We introduce BCB-0.1 (Behavioral Compression Benchmark), a five-metric framework for evaluating whether a compressed behavioral brief faithfully represents its source material. Applied to Benjamin Franklin's autobiography, the benchmark produced two passes, two failures, and one invalid result. Claim Recoverability (99.98%) and Signal Retention (+0.350) passed decisively. Drift Resistance (0.567) and Cross-Model Consistency (0.570) failed. Variance Reduction was invalid due to Franklin's presence in pre-training data. The failures reveal measurement issues rather than pipeline failures — faithful compression surfaces real tensions that adversaries can exploit, and different models interpret the same brief through different lenses.

## Methodology

BCB-0.1 evaluates compressed briefs across five dimensions:

1. **CR (Claim Recoverability):** Can each claim in the brief be traced back to a source fact? Measured by vector similarity between brief claims and extracted facts.

2. **SRS (Signal Retention Score):** Does the brief retain behavioral signal from the source? Measured by comparing brief-conditioned predictions against baseline predictions on held-out behavioral scenarios.

3. **DRS (Drift Resistance Score):** Can an adversary manipulate the brief to produce out-of-character responses? Measured by red-team attacks on brief-conditioned models.

4. **CMCS (Cross-Model Consistency Score):** Do different models produce consistent behavioral predictions when given the same brief? Measured by inter-model agreement on prediction tasks.

5. **VRI (Variance Reduction Index):** Does the brief reduce variance in model outputs compared to no-brief baseline? Requires the subject to be unknown to the model's pre-training data.

## Results

| Metric | Score | Threshold | Result |
|--------|-------|-----------|--------|
| CR | 99.98% | >95% | PASS |
| SRS | +0.350 | >0.000 | PASS |
| DRS | 0.567 | >0.700 | FAIL |
| CMCS | 0.570 | >0.700 | FAIL |
| VRI | N/A | N/A | INVALID |

## Key Findings

1. **Provenance is near-perfect.** 99.98% of claims in the compressed brief trace to source facts. The pipeline does not hallucinate behavioral patterns — it compresses real ones.

2. **Compression amplifies signal.** SRS exceeded its ceiling at +0.350. The brief is more predictive than the raw source material. This supports the broader finding that compression filters noise and concentrates behavioral signal.

3. **Fidelity enables adversarial exploitation.** DRS failed because the brief faithfully represents Franklin's real tensions and contradictions. An adversary who understands someone's genuine internal conflicts can exploit them. This is not a pipeline failure — it is an inherent tradeoff between fidelity and security.

4. **Models interpret briefs differently.** CMCS failed because GPT and Sonnet weight different aspects of the same brief. Cross-model portability of behavioral briefs is an open problem. The brief may need model-specific formatting or emphasis markers.

5. **Pre-training contamination invalidates VRI.** Franklin's autobiography is in every major model's training data. VRI requires subjects unknown to the model, which necessitates private individuals (raising consent issues) or synthetic personas (reducing ecological validity).

## Limitations

- Single subject. Franklin is well-documented and historically distant. Results may not generalize to contemporary individuals with messier, more contradictory behavioral records.
- DRS failure conflates two phenomena: brief vulnerability and subject complexity. A simple subject with few tensions might pass DRS trivially.
- CMCS threshold (0.700) was set a priori. The appropriate threshold for cross-model consistency is not well-established in the literature.
- VRI requires subjects outside pre-training data. For public figures, this metric may be permanently invalid. The benchmark needs a protocol for handling this.
- BCB-0.1 has been run on Franklin only. Marks DRS was generated but judge scoring was paused. Additional subjects are needed before drawing general conclusions.