What is behavioral compression for AI?

Behavioral compression extracts patterns from text (conversations, essays, letters) and compresses them into a portable behavioral specification that encodes how a person reasons, decides, and communicates. Base Layer produces these specifications using 47 constrained behavioral predicates across a three-layer architecture: decision foundations (anchors), operational constraints (core), and behavioral triggers (predictions). At 18:1 compression, the specification matches the predictive accuracy of raw uncompressed profiles. The result is alignment infrastructure for autonomous agents.

How do I make AI agents align with how I think?

Base Layer generates a 3,000-5,000 token behavioral specification from your conversation history or writing. Inject it into any AI agent's context (Claude system prompt, ChatGPT project files, Cursor rules, or any agentic framework) or use the MCP server for always-on integration. The specification encodes your decision patterns and behavioral constraints, persists across sessions, and works with any AI provider. Install with pip install baselayer.

What is a behavioral specification for AI agents?

A behavioral specification is a structured representation of how a person reasons, decides, and communicates — designed to constrain autonomous agent behavior. Unlike user profiles that list preferences, a behavioral specification encodes decision foundations (what someone reasons FROM), operational constraints (how to engage in different contexts), and behavioral triggers (situation-response patterns with directives). Base Layer produces these specifications from any text corpus with full provenance tracing back to source evidence.

How does Base Layer compare to ChatGPT Memory or Claude Projects?

ChatGPT Memory and Claude Projects store facts opaquely — you can't inspect, correct, export, or use the representation with a different provider. Base Layer produces an inspectable, portable behavioral specification that works across any AI agent. Every claim traces to source evidence. The specification is locally owned, provider-agnostic, and structured as behavioral constraints rather than flat preference lists. Research shows format determines agent behavior — axiom-structured specifications outperform flat profiles.

What is the MCP protocol for AI memory?

MCP (Model Context Protocol) is an open protocol that lets AI models connect to external tools and data sources. Base Layer's MCP server provides always-on identity context plus semantic fact retrieval, keyword search, and provenance tracing. Install with pip install baselayer, then add to Claude Desktop, Claude Code, or Cursor with: claude mcp add --transport stdio base-layer -- baselayer-mcp

How many facts does AI need to understand a person?

Base Layer research found that 20% of extracted facts are sufficient for behavioral identification. Compression saturates early — adding more content can actually make results worse. The system extracts facts using 47 constrained predicates, then compresses them through a three-layer architecture. Validated on 44+ subjects: the compressed guide (3,000-5,000 tokens) predicts behavior at 71.83% accuracy, matching full 130K-character persona dumps.

Research | Base Layer

Retrieval Divergence

Coming soon

Coming Soon

Does a behavioral specification change what information an AI retrieves, not just how it responds? We are running a study across multiple memory systems (Mem0, Letta, Supermemory) and Base Layer to measure retrieval divergence on the same fact store. Results forthcoming.

Known Failure Modes

Where the system breaks

April 3, 2026Download

Every system has failure modes. We document ours publicly so you know what to expect, how we caught each problem, and what we did about it. Hiding failures doesn’t make them go away. Showing them builds the trust that lets you use the system seriously.

›

8 documented failure modes. Topic skew (fixed via 73-word guard), sycophancy amplification (verified via stacking study, mitigated architecturally), thin data overconfidence (partially fixed, density matters), cognitive anchoring (fixed via blind authoring), pronoun effects (open research question), extraction positional bias (fixed via chunking), ceremonial pipeline steps (10 of 14 cut after ablation), and provenance gap (fixed by re-adding Embed step).

GPT Memory Stacking

Coming soon

Coming Soon

Does a Base Layer identity model improve AI interaction quality when stacked on top of platform memory (ChatGPT’s built-in memory)? We logged 100 responses across 5 conditions — GPT memory only, full model, granular files, fresh context, and no memory — to measure the interaction quality difference.

›

100 responses, 5 conditions, scoring in progress. Results will include per-condition quality scores and analysis of whether platform memory complements or conflicts with identity model injection.

Authoring Prompt Ablation

73 words changed everything

March 27, 2026Download

Identity models were skewing toward dominant topics in the source data. A subject who wrote extensively about prediction markets had their entire model framed around prediction markets — even though their actual identity is about probabilistic reasoning and institutional skepticism. The authoring prompts (~1,000 words each) had no guard against topic-specific positions being elevated to identity axioms.

We ran 4 rounds of testing across 10 prompt conditions on two subjects with known skew problems. A 73-word instruction eliminated topic skew entirely. 78% of the original prompt was ceremonial.

›

73 words changed everything. “How someone reasons IS identity. What they reason ABOUT is not.” This single guard reduced topic mentions from 9 to 0, cut prompt size by 78%, and produced tighter, more universal identity models.

Behavioral Grammar

46 predicates, formally specified

March 18, 2026Download

Before predicates, 57% of extracted facts started with “The user is...”, generic LLM artifacts that inflated scores and wasted tokens. The fix: a constrained vocabulary of 46 verbs that force the extraction model to classify every fact into a structured triple: {subject, predicate, object}.

This is not a knowledge graph. It is a behavioral grammar, a finite set of verbs that can describe how any human thinks, acts, values, fears, builds, and relates. The vocabulary is organized into five categories: cognitive (believes, values, fears), behavioral (practices, avoids, builds), relational (collaborates, mentors, trusts), contextual (works_at, lives_in), and experiential (experienced, struggled_with).

›

46 predicates. Epistemic precision over convenience: “attended” is not “graduated_from,” “wants_to” is not “aspires_to.” Behavioral predicates (values, avoids, fears) are the most predictive for identity compression. Biographical predicates (works_at, lives_in) provide context but rarely discriminate.

Coding Agent Study

An honest null result

March 13, 2026Download

The drift study showed that structured descriptions change how an AI approaches problems. The natural next question: does that actually make it better at solving them?

We tested this on real software engineering tasks. We took 30 hard bug reports from a well-known benchmark (SWE-Bench) and gave an AI coding agent different kinds of help: design principles from the project, generic encouragement, principles from a completely unrelated project, or no help at all. Then we measured: did it fix the bug?

The AI with no extra help performed best. The bare baseline solved 37% of problems. Every condition where we injected design knowledge performed worse, including our best treatment at 30%.

Relevant and irrelevant knowledge produced identical results. Django principles on Django bugs, and machine learning principles on Django bugs, both scored 30%. The AI was equally unaffected by relevant and irrelevant information. This is the cleanest finding. It rules out “our principles were just poorly written.”

The AI didn’t actually use the information. When we reviewed the AI’s step-by-step reasoning, it acknowledged the design principles in its first message and then completely ignored them. On some problems, it wasted time writing summaries about how its approach aligned with the principles instead of actually fixing the bug.

7

conditions tested

30

real bug reports

$524

total cost

0%

improvement

›

Understanding how someone thinks helps AI work with people. It doesn’t help AI fix code. Those are different problems, and this study proved it. Base Layer is built for humans, not coding agents.

Behavioral Drift

Does format matter?

V4 BriefV4 was used in this study. V5 is the current version.

March 12, 2026Download

When you teach an AI something new about a person, does it update the right behavior, or does it change everything randomly? Imagine telling your assistant “this person once got burned by over-engineering a project.” Ideally, that should change how the AI approaches software architecture decisions, but it shouldn’t change how it helps with debugging or security reviews.

We tested this across four different AI models, from free local models to expensive frontier APIs. We described the same person three different ways:

Flat preferences

“Likes simple code, prefers TypeScript, wants tests.” How most AI memory systems work today.

Structured reasoning

“Avoids premature abstraction because they’ve seen it fail. Requires three concrete cases before extracting a pattern.”

Narrative prose

A flowing description of the person’s approach. Same information, written as paragraphs.

The structured format won decisively. When the AI was given structured reasoning about why someone thinks a certain way, new information was routed to the correct behavior. An architecture lesson changed architecture behavior specifically, not debugging, not security, not everything at once.

Flat preferences produced random change. A list of likes and dislikes gave the AI no way to figure out which behavior a new piece of information should update. The change was scattered across every dimension equally, or missed the target entirely.

A free 7B model with the right format outperformed a frontier API model with the wrong format. The way you describe someone to an AI matters more than which AI model you use. This was the most surprising finding.

4

models tested

3

description formats

$0.30

total API cost

7B > 70B

with right format

›

An AI that understands why you avoid over-engineering routes new lessons to the right place. An AI that just knows you “prefer simple code” can’t. How you describe someone to AI determines whether the AI can actually learn from new information about them.

Brief Optimization

31 versions tested

V5 BriefV5 is the current brief format — citation-stripped, cleaner prose.

March 11, 2026Download

The final step of the pipeline takes everything we know about someone and writes a summary that other AI systems will read. How you write that summary, the instructions you give the AI that writes it, dramatically affects how useful the result is. We tested 31 different versions across 7 rounds to find what works.

31

versions tested

56%

shorter than V4

7

rounds of testing

›

The best summary is shorter, marks where it’s uncertain, and tells the AI when NOT to apply a pattern. Less confident, more useful.

Pipeline Simplification

Less is more

V4 BriefV4 was used in this study. V5 is the current version.

March 8, 2026Download

We originally built a 14-step pipeline to turn conversations into an identity summary. Before shipping, we asked: which of these steps actually matter? We tested every single one by removing it and measuring what happened to quality.

10 of the 14 steps were unnecessary. Scoring, classification, contradiction detection, adversarial review. They all sounded rigorous. None of them improved the final output. Removing them actually made it better.

The 3-layer structure is essential. We split identity into three layers: what you reason from (your foundations), how you behave (your patterns), and testable predictions (things we can verify). Combining all three into one pass scored lower.

Raw facts without synthesis don’t work. Just dumping extracted facts into the AI without organizing them scored worst. The synthesis step, where facts become structured patterns, is where real compression happens.

14→5

steps simplified

87

simplified score

83

original score

~$16

total test cost

›

Simpler is better. The ablation reduced 14 steps to 4 — quality went up. We later added a 5th step (Embed) for traceability, not quality. Most of the complexity we built was ceremony, not substance.

Compression & Format

How much data is enough?

V4 BriefV4 was used in this study. V5 is the current version.

March 8, 2026Download

How much of someone’s conversation history does the system actually need? And does it matter whether the output is written as prose, bullet points, or a structured guide? We tested both questions.

20%

of facts needed

+24%

structure vs prose

1-2.5K

optimal characters

›

The pipeline’s value is in compression, not accumulation. The best summary is short, behavioral (not biographical), and structured rather than narrative.

Prediction Benchmark

Can it predict real people?

V4 BriefV4 was used in this study. V5 is the current version.

March 7, 2026Download

Can the system actually predict how a real person would respond to questions? We used a dataset of 100 real people, each with detailed descriptions of who they are. We compressed each description into a short summary and tested whether an AI could use that summary to predict the person’s actual survey responses.

The result

Our compressed summary (18x shorter) matched or outperformed giving the AI the entire description. On one model, the compressed version actually predicted better than the full dump, statistically significant at p=0.008.

Why compression works

A 130,000-character description contains a lot of noise: irrelevant details, repetition, tangents. Compressing it to 7,000 characters forces the system to keep only what actually predicts behavior. Less noise, more signal.

100

real people

18:1

compression

71.8%

prediction accuracy

p=0.008

statistically significant

›

A compressed summary predicts real human responses better than giving the AI everything. Compression doesn’t lose signal. It concentrates it.

Quality Measurement

Testing our own work

V4 BriefV4 was used in this study. V5 is the current version.

March 7, 2026Download

How do you measure whether a behavioral summary is actually good? We built five tests and ran them on a summary of Benjamin Franklin (extracted from his autobiography). Two passed, two failed, one couldn’t be measured. The failures taught us as much as the passes.

What passed

Claim traceability (99.98%): Nearly every claim in the summary traces back to something Franklin actually said or did. The system doesn’t make things up.

Signal retention: After heavy compression, the summary actually captures more of what matters than the raw source. Forcing brevity makes the system prioritize better.

What failed (and why that’s informative)

Adversarial resistance: The more accurately the summary captures someone, the easier it is to exploit their real contradictions. Accuracy and security are in tension. This is a real tradeoff, not a fixable bug.

Cross-model consistency: Different AI models interpret the same summary differently. Portability across models needs work.

›

Faithful summaries expose real tensions in someone’s worldview, and that makes them both more useful and more vulnerable. You can’t have perfect accuracy and perfect security. We think accuracy is the right trade.

Traceability

Can we prove it?

V4 BriefV4 was used in this study. V5 is the current version.

March 7, 2026Download

Using an AI to judge another AI’s output is circular. You’re trusting the same kind of system you’re trying to evaluate. We built an evaluation framework where every result can be checked by a human, costs nothing to run, and produces the same answer every time. No AI judges.

$0

evaluation cost

4

evaluation layers

2

layers implemented

8/10

prompts improved

›

If a human can’t check the claim, it’s not evidence. Every metric in this framework is verifiable without running an AI model.

Output Format

How should a brief look?

V4 BriefV4 was used in this study. V5 is the current version.

March 3, 2026Download

The last step of the pipeline writes the final summary that other AI systems will read. We tested six different formats, from flowing prose to structured guides to dense shorthand, to find what makes a summary most useful.

6

formats tested

+24%

structure vs prose

V4

production version

›

Same information, restructured, is dramatically more useful. The winning format doesn’t just describe patterns. It tells the AI when not to apply them and how to resolve contradictions.

Design Decisions

80 decisions, all public

Ongoing

Every architectural choice is documented with reasoning, alternatives considered, and status. 78 decisions across 97 sessions. Most projects publish their code. We also publish why the code looks the way it does: every wrong turn, every superseded idea, every decision that survived. The full log is in the repository.

78

decisions logged

97

sessions

47

fact types

414

tests passing

Architecture

Quality & Privacy

Evaluation Philosophy

What Didn't Work

›

Nothing is hidden. The prompts are in the code. The reasoning is in the log. We publish what didn’t work alongside what did.

Research

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

Retrieval Divergence

Known Failure Modes

GPT Memory Stacking

Authoring Prompt Ablation

Behavioral Grammar

Coding Agent Study

Behavioral Drift

Brief Optimization

Pipeline Simplification

Compression & Format

Prediction Benchmark

The result

Why compression works

Quality Measurement

Traceability

Output Format

Design Decisions