# Do Project Axioms Improve AI Engineering Performance?
## A Controlled Experiment on SWE-Bench Verified

**Date:** March 13, 2026 | **Status:** Complete — Null Result | **Cost:** $524 (actual API billing)

## Key Finding

Domain axioms extracted from a project's design rationale do not improve AI agent performance on that project's engineering tasks. The bare baseline (36.7%) outperformed every treatment condition. The most diagnostic finding: Django axioms and sklearn axioms (entirely wrong domain) produced identical performance on Django bugs. The content of what you inject doesn't matter. The injection mechanism doesn't work for this task class.

## What This Means

Behavioral drift study (E1) proved that axioms change HOW an agent approaches problems (SR = 3.30). This study tested whether that change improves WHETHER the agent solves them. It doesn't. These are orthogonal dimensions. Axioms change the surface of model behavior: style, framing, vocabulary. They don't change the depth of reasoning needed to locate and fix a specific bug in a test suite.

**This clarifies the product boundary honestly:** Base Layer's value is in human understanding across continuous interactions, not in making coding agents better at isolated engineering tasks. SWE-Bench tests decontextualized bug fixes. Base Layer is for relationships where accumulated understanding compounds over time.

## Experimental Setup

**Framework:** OpenHands (industry standard, same as ETH Zurich SWE-Bench study)
**Model:** Claude Haiku 4.5, temperature 0, reasoning_effort "none"
**Problems:** 30 hard Django problems from SWE-Bench Verified (avg 95 patch lines, 2.3 files)

### 7 Conditions

| ID | Name | Description |
|----|------|-------------|
| C0 | Bare baseline | Default OpenHands behavior |
| C1 | Generic expert | "You are an expert Django developer" |
| C2 | Django axioms (TREATMENT) | 5 causal axioms from Django design docs |
| C3 | Wrong-domain axioms | 5 sklearn axioms on Django tasks |
| C4 | Same info, flat bullets | C2 content without causal structure |
| C5 | Stacked (C1 + C2) | Generic prompt + axioms |
| C7 | Raw Django docs | Unstructured documentation text |

### Preregistered Hypotheses

- **H1 (primary):** C2 > C0 — axioms beat baseline (McNemar's, α=0.05)
- **H2 (secondary):** C2 > C4 — causal format matters (Bonferroni α=0.025)
- **H3 (secondary):** C2 > C3 — domain specificity matters (Bonferroni α=0.025)

## Results

| Condition | Description | Solved | Rate |
|-----------|-------------|--------|------|
| **C0** | **Bare baseline** | **11/30** | **36.7%** |
| C1 | Generic expert prompt | 10/30 | 33.3% |
| C2 | Django axioms (TREATMENT) | 9/30 | 30.0% |
| C3 | Wrong-domain (sklearn) axioms | 9/30 | 30.0% |
| C4 | Same info, flat bullets | 8/30 | 26.7% |
| C5 | C1 + C2 stacked | 8/30 | 26.7% |
| C7 | Raw Django design docs | 10/30 | 33.3% |

### Hypothesis Tests

| Hypothesis | Comparison | p-value | Result |
|------------|------------|---------|--------|
| H1 (primary) | C2 vs C0 | 0.625 | NOT SIGNIFICANT |
| H2 (secondary) | C2 vs C4 | 1.0 | No format effect |
| H3 (secondary) | C2 vs C3 | 1.0 | No domain specificity |

### Problem Distribution

- 17/30 (57%) unsolvable by any condition
- 4/30 (13%) solved by every condition
- 9/30 (30%) in the swing zone, where conditions differentiate

### Subgroup Analysis (High Relevance)

Axioms hurt MORE on the problems where they were most applicable. C0: 40% vs C2: 30% on high-relevance problems. All 5 relevance-4 problems were unsolvable by every condition.

## Why Axioms Didn't Work: Conversation Log Analysis

We manually reviewed discordant pairs (4 problems where C0 and C2 diverged):

1. **Haiku doesn't reference axioms.** It acknowledges them in an opening message, then proceeds with default code-reading behavior.
2. **Word overlaps are incidental.** When axiom vocabulary (e.g., "migration") appears in reasoning, it's because the code uses that word, not because the axiom guided behavior.
3. **Axioms create narration overhead.** On problems C2 lost, it spent turns generating "axiom-alignment summaries" instead of reading code.

## Reconciling with E1

E1 measured behavioral drift: style, vocabulary, framing, priorities. T4 measured task outcomes: pass/fail on a test runner. These dimensions are orthogonal:

- **Behavioral compression affects:** style, framing, what the model emphasizes, how it describes reasoning
- **Behavioral compression does not affect:** whether a specific code change passes a test suite

Axioms change behavior. Behavior change doesn't always change outcomes. Both facts are true.

## The H3 Finding

Django axioms and sklearn axioms produced identical performance (30.0%) on Django bugs. They solved different problems: 6 discordant pairs, 3 each way. The model is equally unaffected by relevant and irrelevant domain knowledge. This is a clean falsification of the domain-specificity hypothesis.

## Cost Discovery

| Source | Amount |
|--------|--------|
| OpenHands reported | $146.52 |
| **Actual API billing** | **$524** |
| Gap (prompt caching) | $377 |

OpenHands doesn't track prompt caching costs. For any study using Anthropic models: budget against raw API billing, not framework dashboards.

## Limitations

1. **Model-specific.** Haiku 4.5 may lack capability to integrate axioms. A frontier model might show different results.
2. **Task type mismatch.** SWE-Bench = mechanical bug fixes. Architecture decisions, API design, code review (tasks where design philosophy matters) were not tested.
3. **Sample size.** N=30. Only 4 discordant pairs. Powered to detect ~20pp effects. Small effects undetectable.
4. **Axiom source.** Extracted from official docs, not actual developer behavior (code reviews, commit messages).

## What We Would Do Differently

1. Test on design-heavy tasks (architecture reviews, API design, naming decisions)
2. Use a more capable model (Sonnet 3.5+)
3. Extract axioms from actual developer behavior, not official documentation
4. Increase N to 100+ for tighter confidence intervals
5. Budget against API billing from the start

## Connection to Base Layer

The product thesis is unaffected. Base Layer's core claim, "every agentic workflow is hollow if it doesn't understand who the human is behind the screen," is about the human-AI interface, not about making coding agents better at isolated engineering tasks. T4 tested a legitimate adjacent hypothesis. It doesn't hold. That's worth knowing and worth saying clearly.

---

*Base Layer Research | Study T4 | March 2026*
*Research direction: Aarik Gulaya | Execution: Claude Haiku 4.5 via OpenHands | Analysis: Claude Sonnet 4.6*
