# Base Layer — Behavioral Compression Research

# Twin-2K-500 External Benchmark

**Date:** March 2026
**Authors:** Aarik Gulaya
**Dataset:** Columbia/Virginia Tech Twin-2K

---

## Abstract

We evaluated Base Layer's behavioral compression against the Twin-2K dataset, an external benchmark from Columbia University and Virginia Tech containing detailed persona descriptions for 500 participants. Using a random sample of N=100 participants, we tested whether a compressed brief (18:1 compression ratio) could match or exceed the predictive accuracy of the full persona description. On GPT-4.1-mini, the compressed brief achieved 71.83% accuracy versus 71.72% for the full persona (p=0.008). On Claude Sonnet, the brief scored 75.07% versus 74.38%. Compressed behavioral models match full persona dumps at a fraction of the token cost.

## Methodology

Each Twin-2K participant has approximately 130,000 characters of persona description covering demographics, preferences, behavioral tendencies, and personal history. We processed each through Base Layer's pipeline to produce compressed briefs, then evaluated three conditions:

- **C0 (No persona):** Model receives no persona information
- **C1 (Full persona):** Model receives the complete ~130K character description
- **C2 (Brief):** Model receives the Base Layer compressed brief

Evaluation used the Twin-2K behavioral prediction task: given a scenario, predict which of two options the participant would choose. We ran all conditions on two models (GPT-4.1-mini and Claude Sonnet 3.5) to test cross-model robustness. Statistical significance was assessed via paired comparison.

## Results

| Condition | GPT-4.1-mini | Claude Sonnet |
|-----------|-------------|---------------|
| C0 (No persona) | 68.43% | 73.21% |
| C1 (Full persona) | 71.72% | 74.38% |
| C2 (Brief) | **71.83%** | **75.07%** |

- **Compression ratio:** 18:1 (average ~130K chars to ~7K chars)
- **GPT C2 vs C1:** +0.11%, p=0.008
- **Sonnet C2 vs C1:** +0.69%, borderline significant

## Key Findings

1. **Compressed briefs match or beat full persona dumps.** On GPT-4.1-mini, the brief outperformed the full persona by a statistically significant margin at 18:1 compression. This is the core claim: compression does not lose predictive signal.

2. **Model upgrade dominates format.** Sonnet outperformed GPT across all conditions. The gap between C0-Sonnet (73.21%) and C2-GPT (71.83%) is larger than the gap between C1 and C2 on either model. A better model matters more than better persona information.

3. **Persona information helps, but modestly.** The C0-to-C2 lift was 3.40% on GPT and 1.86% on Sonnet. Persona context provides real but bounded improvement. Stronger base models have less room for persona lift.

4. **The full persona may introduce noise.** C1 underperforms C2 on both models. 130K characters of unstructured description likely includes irrelevant or contradictory information that dilutes the behavioral signal. Compression acts as a filter.

## Limitations

- N=100 from a 500-person dataset. Larger samples would tighten confidence intervals.
- Twin-2K personas are synthetic descriptions, not real conversation histories. Performance on naturalistic data may differ.
- The benchmark measures binary choice prediction. Real-world personalization involves open-ended generation where behavioral fidelity is harder to measure.
- Brief quality depends on extraction pipeline quality. Poor extraction on a given participant could degrade C2 without reflecting on the compression approach itself.
- Cross-model comparison uses different model generations. Direct architectural comparison is not possible.