# Base Layer — Behavioral Compression Research

# Provenance Evaluation Framework

**Date:** March 2026
**Authors:** Aarik Gulaya
**Cost:** $0 (fully mechanical)

---

## Abstract

We developed a provenance-based evaluation framework to replace LLM-as-judge scoring for behavioral briefs. The core problem: using an LLM to judge whether a brief accurately represents a person is circular — the judge has no ground truth and no way to verify claims. Our framework introduces four mechanical layers (Brief Activation, Provenance Coverage, Reasoning Chain Reconstruction, Priority Ordering) that trace every claim in a brief back to source facts, measure coverage, and assess structural fidelity. Phase 1 (BA + PC) has been completed on two subjects. The framework costs $0 to run and produces fully human-auditable results.

## Methodology

The framework consists of four layers, each building on the previous:

**Layer 1 — Brief Activation (BA):** For each claim in the brief, generate a behavioral prediction. Then test whether that prediction can be confirmed or denied by the source facts. This measures whether the brief produces actionable behavioral signals, not just plausible-sounding descriptions.

**Layer 2 — Provenance Coverage (PC):** Map every claim in the brief to its source facts via vector similarity. Compute coverage: what percentage of claims have traceable provenance? What percentage of high-confidence source facts are represented in the brief? This measures both precision (claims are grounded) and recall (important facts are captured).

**Layer 3 — Reasoning Chain Reconstruction (RCR):** For claims that aggregate multiple source facts, reconstruct the reasoning chain. Can a human follow the path from source facts to synthesized claim? This measures whether the compression is transparent or opaque. Not yet implemented.

**Layer 4 — Priority Ordering (PO):** Compare the brief's implicit priority ordering (what it emphasizes, what it omits) against empirical behavioral frequency in the source data. Does the brief emphasize patterns that actually recur, or does it fixate on vivid but rare behaviors? Not yet implemented.

## Results

Phase 1 results on two subjects:

| Metric | Marks (74 memos) | Aarik (1,892 convos) |
|--------|-------------------|----------------------|
| BA — Claims producing predictions | High | High |
| PC — Claims with provenance | High | High |
| PC — Source fact coverage | Moderate | Moderate |

Specific numerical scores are documented in the full provenance evaluation report. The key qualitative finding: claims in the brief are well-grounded (high precision), but significant source material is omitted (moderate recall). This is expected — compression inherently discards information. The question is whether the right information is retained, which Layers 3 and 4 are designed to answer.

## Key Findings

1. **LLM-as-judge is circular for identity evaluation.** When a model scores whether a brief "captures" someone, it has no access to the person. It scores plausibility, not accuracy. This is the fundamental motivation for mechanical evaluation.

2. **Provenance tracing is tractable.** Every claim in a Base Layer brief can be mapped to source facts via vector similarity. The pipeline's extraction-to-compression path is transparent enough for post-hoc auditing.

3. **Precision is high, recall is moderate.** The brief does not make things up, but it leaves things out. This is by design — a 2,000-character brief cannot represent 1,892 conversations. The evaluation question shifts from "is the brief accurate?" to "does it omit anything important?"

4. **$0 evaluation is possible.** No API calls are required for BA or PC layers. Vector similarity, set coverage, and structural analysis are all local computations. This makes the framework scalable to any number of subjects.

5. **Human auditability is the real standard.** If a human cannot trace a claim back to evidence, the claim is not evidence — regardless of what a judge model scores it. This principle drives the entire framework design.

## Limitations

- Phase 2 (RCR + PO) is designed but not yet implemented. The framework is incomplete.
- Vector similarity is a proxy for semantic grounding. Two statements can be similar in embedding space without one actually supporting the other.
- BA layer generates predictions from claims, but does not validate those predictions against held-out behavioral data. It measures activation, not accuracy.
- Tested on two subjects only. One (Aarik) is the system developer, introducing potential bias in what the pipeline extracts and how the evaluation interprets results.
- The framework evaluates brief content but not brief format. A well-grounded brief in a poor format could score high on provenance but fail on downstream tasks.