Essay

AI that represents you can’t be neutral

I believe there is a factor beyond model training or model constitution that may significantly shape how an AI system reasons about moral or strategic questions.

by Aarik — June 2026

Read on LessWrong

Model training implements a prior for a model’s behavior, constitutions set guardrails, but we are not tracking the interpretation through which an AI system reasons. Through different interpretations, you can come to completely different conclusions.

A few straightforward examples.

Two doctors are given the same set of symptoms; they come to different diagnoses.
Two customer success managers are given the same set of business metrics; they come to different strategies.
Two lawyers given the same set of facts, come to different fact patterns and case strategies.
Two supreme court judges given the same constitution, come to different legal positions and comments.
Two VCs given the same deck, one evangelizes the other rejects.

What’s the difference? Their lived experience, and their interpretation. Simplified, they must possess the knowledge (training), know how to deploy the knowledge (constitution), and realize that the knowledge is even significant (interpretation).

I believe interpretation is an understudied concept, especially in relation to the deployment of AI systems that may act on your individual behalf.

Operationalization of an interpretive layer could be a verifiable static document or construct, that can be used by an AI to apply a relevant interpretation. No different than providing context via RAG, Context Windows, System Prompting, or Basic prompting. This requires the immediate assumption that you can capture an individual’s unique interpretation. But more importantly, how do you measure it?

Assuming you can faithfully capture an individual’s interpretation, verifying it is difficult. There are very few datasets that contain longitudinal data on an individual’s line of reasoning, with verifiable ground truth. The level of reasoning being referred to is deeper than high-level demographic data, but lesser than an individual’s raw reflections and conversations over their life. This level of reasoning could be called a Behavioral Specification; a compressed document that is composed by a structured and traceable pattern extraction pipeline, that attempts to faithfully encode your operating principles. For all intents and purposes, it is a compressed representation of your interpretive reasoning, not a “digital twin”.

I have found a few datasets to test this, but I will lead with a recent pre-print, where I propose a prototype benchmark to measure how well a model can capture an individual’s Interpretive Reasoning (arXiv:2605.28969).

Take an autobiography. Split it in half. The first half is training data; second half is held out text. Generate behavioral prediction questions based on the ground truth of the second half. The first half is given to the model in a variety of context conditions; raw corpus, fixed set of facts, top-k facts from leading memory systems. The interpretive layer or behavioral specification is tested separately and with all context conditions.

All context conditions are given the held-out questions and asked to respond. Responses are judged based on how well the answer predicted the individual’s actual ground truth behavior or responses in the second half, not a test of agreement. An exhaustive number of cross conditions were run to verify directional confidence that adding an interpretive layer increases “representational accuracy”, a measurement of how faithfully a system captures a person’s interpretation.

I’ve provided an exceptionally condensed summary. The paper is open-source and agent-friendly if you want to verify any of this. I am in no way suggesting that understanding a human’s interpretation is as simple as a compression, but I am proposing that providing context that embodies your interpretation fundamentally changes how a model reasons and responds to a user, far past what facts or pre-training can provide.

Worked Example: P38, Beyond Recall

Yukichi Fukuzawa was a leading figure in Japan’s transition from a feudal nation to the modern era. From the held-out portion of his autobiography: the following question was asked “How would Fukuzawa characterize someone who studied naval arts under the Dutch and later became instrumental in preventing military conflict?”

All base context conditions identified the most relevant figure as Captain Kimura Settsu-no Kami, a Dutch-trained naval officer who Fukuzawa served under. This is incorrect; the text itself states that Katsu Rintaro, the second-in-command under Kimura, is the correct reference character.

When an interpretive layer or behavioral specification was added to the base context conditions, it correctly applied Katsu Rintaro as the primary reference. The difference was the specification enabled the model to apply a specific interpretive lens that looked not just at the most relevant figure, the naval captain he served with, but at which figure aligned with Fukuzawa’s captured interpretive patterns, the second-in-command.

To distinguish between the captain and the second in command requires a nuance that is difficult to describe. Facts are important, they serve a purpose, but when interpretation of facts is required, a whole new dimension of human AI alignment and interaction unlocks. The only person who can verify the interpretation’s representational accuracy is the individual that interpretation represents.

A more interesting test would be a living study. If you build a behavioral specification of a person based on their writing, their thoughts, can you increase representational accuracy when analyzing an individual’s behavior? There are a few other ways this can be tested, but I will not cover those for now. This affects human AI interaction in profound ways; understanding the scale of how is difficult.

In terms of strategic implications, how to implement something that can measure and enforce this would not take the path of policy reform, or regulation, but a foundational infrastructure or organization that verifies AI is acting faithfully on its user’s behalf. Orthogonal to guardrails, and pre-training, this focuses on the individual user’s protections regarding agentic AI.