RESEARCH

Agent observability is not governance.

The research thread that shaped this framework: why log files don't satisfy regulators, why rubber-stamping is the failure mode that matters, and what changes when you treat the brief as the artifact.

The problem we set out to solve

In late 2025 we interviewed operators at six mid-market companies running production agent systems — supply chain, underwriting, claims adjudication, programmatic media buying. Every team had observability. LangSmith traces, dashboards, error budgets, the works. Every team had also been asked, at least once, to explain a decision to a regulator, auditor, or board. Not one of them found their observability stack useful for that conversation.

The mismatch was structural. Observability is built for the engineer debugging an agent at 3 AM. It surfaces the path the agent took. A regulator is asking a different question: was a human aware of this decision at the time it was made, and did that human exercise judgment, and is there an artifact that proves it?

Rubber-stamping is the failure mode

In every team we spoke to, the pattern was the same. An agent surfaced a proposal in a Slack thread or a dashboard card. An operator clicked Approve. The approval was logged. Three months later, somebody looked at the record and realised the operator had clicked Approve in less than two seconds — less time than it takes to read the brief.

No regulator who asks the next question is satisfied with "the system recorded their approval". The question is whether a human exercised judgment. That requires a system that makes not exercising judgment harder than exercising it. Hence the gate.

Proposal card with Approve button disabled until the strategic context block is opened — The gate. Approve is disabled on every consequential card until the user has opened the Strategic Context block. The dot on CONSEQUENTIAL is amber. The red 'Required to enable approve' label is the visible promise the system makes.

The brief is the artifact

The thing audit needs six years later isn't the agent's logs. It's the brief — the exact framing the human was looking at when they decided. What the agent proposed. What it claimed were the alternatives. Which risk flags it surfaced. What confidence it gave itself.

In the Accountability Interface, that brief is captured verbatim and sealed in a SHA-256 hash at the moment of decision. Six years later, the record renders the same page the operator saw. Not a reconstruction. Not a summary. The exact bytes.

The four manifest principle

An agent's reasoning is only as defensible as the rules it reasoned against. Records that don't carry the rule-version at decision time aren't defensible — you can always argue that the rules were different then.

Every decision record in this system carriesmanifest_version_at_decision. When the procurement manifest moves from v2026.04 to v2026.05, the decisions made under v2026.04 stay attached to v2026.04. No retroactive reinterpretation.

Multi-system attestation

In a typical agent stack, multiple systems contribute to a single decision. TX-1 detects the infeasibility. SS-1 confirms the manifest version. Swarm Lite rejects an alternative the other two systems didn't consider.

Most observability stacks attribute the decision to whichever system surfaced the final card. The Accountability Interface records all three with their hashes — so an auditor can verify independently that each system co-signed the proposal at the moment it was presented.

What we still don't know

We don't yet know how the framework behaves at the volumes real production systems generate — tens of thousands of proposals per day. We don't yet know what the right authority-tier escalation rules look like for industries outside procurement. We don't yet know whether the brief-snapshot approach holds when proposals reference real-time data that may have changed by the time audit looks.

The Phase 3 MVP is built to expose these questions, not to hide them.