Last updated

February 28, 2026

How AI can work in regulated environments

Why 'it worked' isn't enough in regulated industries — and what end-to-end traceability looks like for AI agents

A healthcare provider processes 500 insurance claims per day. Ninety percent of them are routine: patient ID matches records, coverage is active, claim amount is within bounds. An agent handles those. For the last 10 percent—the edge cases, the odd diagnosis, the claim that looks suspicious—a human reviews it and either approves or escalates.

Now imagine something goes wrong. A claim is denied that should have been approved. The patient appeals. The provider's compliance team needs to prove what happened. Why was the decision made? Who made it—the agent or a human? What information was available at the time? Did the system behave as intended?

That's not a hypothetical. It's the regulatory reality that most organizations talk themselves out of using AI for anything important.

Because here's the hard part: proving an AI decision isn't like proving a human decision. When a human reviewer approves a claim, we can ask them to explain their reasoning. When an agent approves a claim, you need to show, step by step, what data it saw, what rules it applied, what internal reasoning led to that decision. The audit trail is everything.

The difference between deterministic and probabilistic, and why it matters

Traditional automation—RPA, workflow rules, business logic engines—is deterministic. You feed it input X, it follows rule Y, it produces output Z, every single time. If something goes wrong, you can trace the exact rule that fired, check whether the rule was correct, and fix it.

Computer-use AI is probabilistic. It's using LLMs—language models—that excel at handling ambiguity. They see context humans see, like an unusual diagnosis that might warrant closer review. But by their nature, they can sometimes hallucinate. They trade the rigid if-then certainty of traditional rules for flexibility and judgment, but that flexibility means you can't simply trace a decision back to "rule 47 fired."

In regulated industries—healthcare, financial services, insurance, government—that probabilistic nature creates a problem. Regulators want "repeatability." If I run the same claim through the system twice, I should get the same decision. LLMs don't guarantee that.

So the question becomes: how do you use AI's judgment while maintaining the auditability that regulators demand?

Golden data: the audit trail's foundation

One term that's starting to emerge in compliance circles is "golden data"—data that is unfiltered, unadulterated, and complete. Not summarized. Not pre-processed. Not filtered through someone's assumptions about what matters. The raw event logs, the full context, the complete decision artifact.

The reason this matters: if an agent makes a decision and later that decision is questioned, you need to reconstruct the exact information the agent had at that moment. Not information from three months later when someone updated the system. Not a summary of the case. The actual, timestamped, immutable record of what the agent saw and what it chose to do.

This is where most organizations currently fail. They run agents in production, they get results, and they keep minimal records. When an audit happens, they realize they can't actually explain what the system did.

Confidence scoring as part of the chain

Here's an emerging best practice: agents shouldn't just make decisions. They should rate their own confidence.

When an agent approves a claim, it should emit not just "approved" but "approved with 94% confidence." When it sees something unusual, it says "this request is outside my normal pattern; 47% confidence; flagging for review." This confidence score becomes part of the audit chain.

This serves two purposes. First, it's a practical signal for routing. Decisions with 95%+ confidence can be processed automatically. 70-95% goes to a human reviewer. Below 70% goes to escalation. The system routes based on its own assessment of risk.

Second, it's an audit artifact. Three years from now, when someone questions why a decision was made, you can show: "The agent reviewed this with standard inputs, assessed high confidence, and processed it accordingly."

The gap between "it worked" and "we can prove it worked"

Many organizations deploying agents are currently operating in a blind spot. They measure success by operational metrics: cases processed, time saved, error rate. Those are real benefits. But "it worked operationally" is not the same as "we can defend this decision in a regulatory audit."

The companies winning here are the ones separating two concerns. Operational success is one layer. Auditability is another. Both require deliberate design.

Continuous auditing replaces the annual checklist

Traditionally, compliance happens in bursts. You run an annual audit. An external firm reviews a sample of decisions. You either pass or you don't. Then you wait another year.

With agents, that model breaks. You're making decisions constantly. Your audit can't be annual—it has to be continuous. Every decision gets logged, every decision is subject to review, every decision creates an artifact.

This is actually powerful if you design for it. Instead of an annual "did we mess up?" moment, you have continuous "here's what happened and here's why." Anomalies surface automatically.

Building for auditability now prevents crises later

If you're planning to run agents in regulated environments, this isn't something to figure out when auditors show up. It's a design requirement from day one.

That means: decide what "golden data" means for your domain. Instrument your agents so they produce audit artifacts. Not just outcomes, but reasoning. Not just decisions, but confidence. Plan for continuous auditing. Build systems that assume every decision will be scrutinized.

The paradox: agents that are built with auditability in mind are often easier to operate at scale, because you catch problems earlier. The ones built to move fast and ask questions later are the ones that tend to blow up.

Recent Articles

Computer-use AI: how vision models are changing browser automation

April 2, 2026

Practical Guides

Self-healing automation: how computer-use agents handle change

March 31, 2026

Practical Guides

Human-in-the-loop for portal automation: when your browser agent should stop and ask

March 25, 2026

Practical Guides

Beyond RPA: moving towards computer-using agents

March 18, 2026

Practical Guides

How AI can work in regulated environments

The difference between deterministic and probabilistic, and why it matters

Golden data: the audit trail's foundation

Confidence scoring as part of the chain

The gap between "it worked" and "we can prove it worked"

Continuous auditing replaces the annual checklist

Building for auditability now prevents crises later

Recent Articles

Computer-use AI: how vision models are changing browser automation

Self-healing automation: how computer-use agents handle change

Human-in-the-loop for portal automation: when your browser agent should stop and ask

Beyond RPA: moving towards computer-using agents

Hand over the mouse

Enterprise AI
browser automation

How AI can work in regulated environments

The difference between deterministic and probabilistic, and why it matters

Golden data: the audit trail's foundation

Confidence scoring as part of the chain

The gap between "it worked" and "we can prove it worked"

Continuous auditing replaces the annual checklist

Building for auditability now prevents crises later

Recent Articles

Computer-use AI: how vision models are changing browser automation

Self-healing automation: how computer-use agents handle change

Human-in-the-loop for portal automation: when your browser agent should stop and ask

Beyond RPA: moving towards computer-using agents

Hand over the mouse

Enterprise AI browser automation

Enterprise AI
browser automation