AI & Automation

The Governance Gap: Why AI Ethics Frameworks Don't Survive Contact with Production

Dan M 13 May 2025 14 min read

Every enterprise we studied had an AI ethics framework. None had successfully translated it into production-level decision-making. The gap isn't cynicism — it's structural. Ethics frameworks and production systems operate on different planes.

Frameworks everywhere, governance nowhere

The AI ethics industry is booming. Every consultancy publishes principles. Every major enterprise has adopted a framework. Responsible AI, Trustworthy AI, Ethical AI — the labels vary, the structure is similar: a set of principles (fairness, transparency, accountability, privacy), a governance committee, a review process, and a set of documentation requirements.

On paper, the frameworks look comprehensive. In production, they’re largely inert. We studied AI governance practices across 12 enterprises with published AI ethics frameworks. In every case, the framework existed. In no case had it meaningfully altered a production decision that would have been made differently without it.

This isn’t because the organisations are cynical about ethics. Most of the people we spoke with genuinely cared about responsible AI. The problem is structural: ethics frameworks and production systems operate at different speeds, different levels of abstraction, and in different organisational locations.

The three structural gaps

1. The abstraction gap

Ethics frameworks operate at the principle level: “our AI systems should be fair,” “decisions should be explainable,” “bias should be mitigated.” These are meaningful commitments. But they’re not actionable by the engineer writing a feature, the data scientist training a model, or the product manager prioritising a backlog.

The translation from “AI should be fair” to “this specific model, trained on this specific data, deployed in this specific context, should produce outputs that satisfy these specific fairness criteria, measured in this specific way” requires an enormous amount of contextual judgment. And the framework doesn’t help with any of it.

The gap between the principle and the implementation is where governance actually happens. Most organisations haven’t built anything to fill it.

2. The velocity gap

Ethics reviews operate on committee cadence: monthly meetings, quarterly assessments, annual audits. Production operates on sprint cadence: daily deployments, weekly iterations, continuous model updates. By the time the ethics committee reviews a model, the model may have been retrained, the training data may have changed, and the deployment context may have shifted.

We observed a pattern we call “governance by snapshot” — the ethics review captures the state of the system at a point in time, approves it, and moves on. The system continues to evolve. The approval doesn’t. The governance applies to a version of the system that may no longer exist.

3. The location gap

Ethics committees typically sit outside the engineering organisation — in legal, compliance, or a dedicated responsible AI function. The decisions they need to influence happen inside the engineering organisation — in code reviews, architecture decisions, data selection, threshold settings.

The people making production decisions and the people making governance decisions are separated by organisational boundaries, reporting lines, and professional vocabularies. The ethics committee speaks in principles. The engineering team speaks in code. Neither has built the translation layer between them.

The governance gap isn’t a gap in intentions. It’s a gap in organisational architecture. The ethics framework and the production system exist in different parts of the organisation, operate at different speeds, and speak different languages.

What production-grade governance looks like

The two organisations in our sample that had effective AI governance shared structural characteristics that the others lacked:

Embedded governance. Rather than centralising ethics review in a committee, they embedded governance-trained engineers in production teams. These people participated in daily standups, reviewed pull requests, and influenced architecture decisions in real time. Governance happened in the flow of work, not in a separate review cycle.

Operationalised principles. They translated each principle into specific, testable criteria for each deployment context. “Fairness” became a set of statistical tests applied to model outputs before each deployment. “Transparency” became a documentation standard that was part of the definition of done for every model release. “Accountability” became a logging requirement that created an audit trail for every automated decision.

Continuous monitoring. They treated governance as a runtime concern, not a design-time concern. Models were monitored in production for drift, bias emergence, and performance degradation. Alerts triggered reviews. The governance didn’t apply to the model that was approved — it applied to the model that was running.

Failure budgets. They adopted the concept of error budgets from site reliability engineering. Each AI system had an explicit tolerance for governance-relevant failures — a threshold below which issues were addressed in normal workflow, and above which deployment was paused. This made governance quantitative and actionable rather than aspirational.

The common thread: governance was built into the production system, not layered on top of it. It operated at the same speed, in the same location, and in the same language as the engineering decisions it was meant to influence. This required investment in people, tooling, and process — not just a framework document and a committee.