Version Control for Beliefs

In 2005, Linus Torvalds built Git because code has a fundamental problem: it regresses.

You change one function. Something unrelated breaks. Without version history, you can’t trace what happened, when it happened, or who made the decision that caused it. Every meaningful concept in Git — commits, blame, bisect, branching, revert — exists to answer a single question: what changed, and what did it break?

Thirty years of software engineering discipline has been built around this reality. We version everything. We track provenance. We write tests that catch regressions before they ship. We don’t do this because it’s convenient. We do it because the alternative — deploying code with no history, no accountability, and no way to roll back — is engineering malpractice.

AI memory has the same fundamental problem. And almost nobody is treating it that way.

Memory Regresses

When an AI system updates its understanding of a topic — a client’s processing volume changed, a recommendation became outdated, a market condition shifted — that’s a commit. The system’s beliefs have changed.

When two contradictory claims coexist in memory with no mechanism to determine which is current — that’s a merge conflict. The system doesn’t know which version of reality to serve.

When the system retrieves a stale recommendation and presents it with the same confidence as a fresh one — that’s a regression. Something that was once correct has silently become wrong, and nothing in the system caught it.

If you’re building software without version control, you’d be laughed out of the room. But right now, the entire AI memory ecosystem operates this way. Beliefs get overwritten or appended with no history. Contradictions accumulate with no resolution mechanism. Stale information gets served with no provenance trail. And when something goes wrong, there’s no git log to consult — because nobody built one.

This isn’t a missing feature. It’s a missing engineering discipline.

The Accountability Gap

Here’s why this matters beyond technical elegance.

When a human analyst makes a recommendation based on stale data, there’s a name attached to that mistake. There are professional consequences. The analyst can be asked: what were you looking at? When did you last verify this? Why didn’t you catch the change? That accountability — imperfect, sometimes unfair, but real — is what keeps the system honest. Not the accuracy. The traceability.

When you automate that same process — and the trajectory of the industry says we will, across financial services, legal analysis, healthcare, operations — the accountability doesn’t transfer automatically. It just disappears.

The system serves a stale recommendation. The client acts on it. When it goes wrong, there’s no trail, no versioning, no way to reconstruct what the system believed at the time of the decision and why. The mistake is untraceable. And an untraceable mistake made by a system that’s trusted more than a human — because it’s “the AI,” because it seems authoritative, because it doesn’t hedge the way a nervous junior analyst might — is categorically worse than a traceable mistake made by a person.

Automation without accountability trails isn’t efficiency. It’s an abdication of responsibility that we’ll regret at scale.

The Multi-Agent Blast Radius

This problem compounds dramatically when you move from single-agent to multi-agent systems — which is where the industry is heading and where billions of dollars are currently pointed.

Here’s the failure mode nobody has solved cleanly: Agent A is tasked with research. It hallucinates a data point — not egregiously, just a minor confabulation. It passes its output to Agent B, which is tasked with analysis. Agent B treats Agent A’s output as ground truth, because it has no mechanism to evaluate provenance. Agent B builds a structured analysis on that foundation and passes it to Agent C for action planning.

By the time you’re three agents deep, a minor confabulation in step one has become a confident, detailed, internally-consistent action plan built on something that was never true. Each agent in the chain added structure, detail, and coherence to the original error. The output looks more authoritative the further it gets from the source — because fluent language and logical structure are exactly what language models are good at producing, regardless of whether the underlying claims are grounded.

That’s the blast radius problem. It’s not linear. It’s exponential. Every agent that acts on unverified output adds a layer of false confidence. And there’s no circuit breaker — no point in the chain where an agent stops and asks: what is the provenance of what I just received? How confident should I be in this input? Has this information been verified against any ground truth?

This is why most serious multi-agent deployments are stuck in demo mode. Not because the individual agents aren’t capable. They demonstrably are. But because there’s no shared epistemic layer — no common substrate of verified, attributed, confidence-scored claims — that lets agents evaluate what they’re working with before they act on it.

The orchestration layer exists. CrewAI, AutoGen, LangGraph — the industry has built sophisticated systems for routing tasks between agents. What it hasn’t built is the infrastructure that makes those handoffs trustworthy. We’ve built the highways but skipped the traffic lights.

What Version Control for Beliefs Actually Looks Like

If you accept the premise — that AI memory needs the same engineering discipline we gave code — then the requirements become clear.

Commits, not overwrites. When a belief changes, the old version doesn’t get deleted. It gets linked. The new claim carries a reference to the old one. The old claim carries a reference forward. The full chain is walkable. You can trace how the system’s understanding of any topic evolved, through every version of every belief it has held. In code, this is git log. In memory, it’s a supersession chain.

Blame and provenance. Every claim in the system knows where it came from — which source document, which conversation, which data feed. When a downstream decision goes wrong, you don’t search through logs hoping to find the root cause. You trace the claim back to its origin. In code, this is git blame. In memory, it’s source attribution with artifact linking.

Test coverage for knowledge. The system doesn’t just return results — it classifies every query as covered, partially covered, or uncovered. It explicitly reports what it doesn’t know. This is the equivalent of code coverage reporting: not a guarantee that what’s covered is correct, but a clear signal of where the gaps are. Most AI systems return their best guess and let confidence scores do the hedging. That’s like shipping code with no test suite and hoping the type system catches everything.

Regression detection. Claims have time-to-live values. When they expire, they become ineligible for recall — not deleted, but gated. The system forgets on purpose, on schedule, with a record of what it forgot and why. Stale information doesn’t silently persist. It ages out through a mechanism that’s inspectable and auditable. In code, this is automated deprecation. In memory, it’s TTL-based lifecycle management.

Merge conflict resolution. When contradictory claims exist, the system doesn’t silently pick one. The contradiction is visible, the supersession chain documents the resolution, and the confidence differential between competing claims is explicit. In code, Git forces you to resolve merge conflicts before committing. Memory systems should do the same.

The Gap Between Chat and Engineering

There’s a spectrum of AI interaction that the industry hasn’t clearly articulated yet.

On one end, you have frontier chat models. They’re impressive. They’re stateless. You can’t manage context. You have limited control over memory handling. You have zero visibility into why the system said what it said. For consumer use cases — asking questions, drafting content, exploring ideas — this is fine. The stakes are low and the cost of a wrong answer is negligible.

On the other end — where the industry needs to go for anything consequential — you have AI systems operated with genuine engineering discipline. Versioned memory. Provenance tracking. Confidence scoring. Coverage reporting. Accountability trails. Domain isolation. Regression detection.

The gap between these two ends is where most of the value and most of the risk currently lives. Businesses are making real decisions — financial, operational, strategic — based on AI systems that operate closer to the chat end of the spectrum than the engineering end. They’re writing production code in a text editor with no version control and hoping it works out.

For a consumer asking recipe questions, that’s acceptable. For a consultancy making recommendations that affect a client’s financial decisions, it’s negligent. For multi-agent systems handling complex workflows with real-world consequences, it’s a liability that hasn’t materialized at scale yet — but will.

The Governance Problem Nobody Wants to Talk About

There’s a growing class of AI consultancy that deploys multi-model pipelines, agentic workflows, and client-facing automation without having invested in any of the infrastructure described above. Client data — including personally identifiable information, financial records, operational details — gets fed into frontier models with no domain isolation, no provenance tracking, no retention controls, and no audit trail.

The consultants building these systems often have no concept of what “blast radius” means in this context. They’re optimizing for capability — what the system can do — without engineering for accountability — what happens when it does the wrong thing. The assumption, usually unstated, is that the model is reliable enough that governance can come later.

That assumption will be tested. And when it is, the organizations that fed client PII into uncontrolled pipelines won’t be able to reconstruct what data went where, what the system did with it, or how a bad output got generated. The accountability trail doesn’t exist because nobody built one. The blast radius is unknown because nobody measured it. The data governance is a policy document, not an architecture.

We made a different choice. We’re investing in the governance and integrity layers — domain isolation, PII locality, sanitization pipelines, supersession chains, coverage classification — before deploying agentic capabilities. Not because we can’t build the flashy stuff. Because deploying the flashy stuff without the foundation underneath it is how you create liabilities that compound silently until they become crises.

This is slower. It means we have less to show on a demo screen right now. It also means that when we do deploy, we can trace every belief the system holds, explain where it came from, prove that client data stayed where it was supposed to, and reconstruct the reasoning behind any output. That’s not a feature. That’s the minimum standard for responsible deployment — and almost nobody is meeting it.

Where We Are

We’re building this. Not theoretically — practically, for our own operations, as the epistemic foundation for every AI capability we plan to deploy.

We’re not naive about the difficulty. Memory regression is a harder problem than code regression because beliefs are fuzzier than functions, confidence is harder to measure than test pass rates, and the equivalent of “this build is broken” is much less obvious when the output is natural language instead of a stack trace.

But the alternative — deploying AI systems that can’t trace their own reasoning, can’t tell you what they used to believe, can’t report on what they don’t know, and can’t contain the blast radius of a single hallucination across a multi-agent chain — isn’t an acceptable long-term position.

The industry will arrive at this conclusion eventually. The question is whether it arrives there before or after a consequential failure makes the lesson unavoidable.

The Nugget: “Automation without accountability trails isn’t efficiency. It’s an abdication of responsibility that we’ll regret at scale.”

We’d rather build the discipline now than explain why we didn’t later.

For the technical implementation details of how we’re building this, read Build Note 02: A System That Knows What It Knows →

Hatim Zavery is the founder of Telos One, a Canadian consultancy specializing in payment processing, cybersecurity, and web development.