
Dark Factories: How the Five Levels of AI Automation Apply Beyond Software
Dark factories - where a specification goes in and finished work comes out autonomously - started in software. This article explains the concept and maps how the same pattern could apply to audit, banking operations, legal, and compliance.
Dark Factories: How the Five Levels of AI Automation Apply Beyond Software
A three-person company called StrongDM is shipping production software - thousands of lines of Rust and Go, tested and deployed - without anyone on the team writing a line of it. No one reviews it either. Their principle is simple: code must not be written by humans, and code must not be reviewed by humans. Three engineers write specifications and evaluate the outcomes. Machines build, test, and ship everything in between.
This model is called a dark factory. The term comes from manufacturing, where a lights-out factory runs with no workers on the floor. In software, it means a specification goes in and working, tested software comes out - autonomously.
The dark factory concept, and the five-level framework that underpins it, was developed for software. In this article we look at how the same pattern could apply to audit, banking operations, legal, and compliance.
Five levels of AI-assisted work
Entrepreneur Dan Shapiro developed a framework for software development showing the five levels of AI autonomy ranging from level zero, where the human does everything, to level five, where the AI operates entirely independently. The levels are a useful way to give a precise way to locate where an organisation actually is, rather than relying on vague claims about being "AI-enabled".
Level zero - AI suggests the next line as you type. The developer is still writing the software; the AI reduces keystrokes.
Level one - The developer gives AI a discrete task and reviews what comes back.
Level two - AI handles changes across multiple files, understanding how parts of the codebase connect. The developer still reads all the output.
Level three - The developer is no longer writing code. They direct AI and review its output at the feature level. Most developers hit their ceiling here, finding it difficult to stop reviewing line by line.
Level four - The developer writes a specification and steps away. AI builds the software and tests it against criteria it can see. The developer reviews outcomes, not code.
Level five - The dark factory. AI is evaluated against scenarios it has never seen - stored separately so it cannot optimise for passing them during development. The evaluation is genuinely independent. No human writes or reviews code.
Most organisations sit between levels one and three. Very few operate at level four or five.
Why most organisations using AI tools are getting slower
A 2025 randomised control trial by METR found that experienced developers using AI tools took 19% longer to complete tasks than those working without them. The same developers estimated AI had made them 24% faster. They were wrong about the direction and the magnitude.
The reason is not that AI tools do not work. Adding a new tool to an existing workflow creates friction before it creates speed. Developers spent time evaluating suggestions, correcting almost-right code, and debugging errors that looked correct but were not. The tool changed. The workflow did not.
Researchers call this the J-curve: productivity falls before it rises. Organisations seeing gains of 25 to 30% or more are not the ones that installed a tool and ran a training day. They redesigned their process around AI from the ground up - how they write specifications, how they review output, how they catch new categories of errors. That kind of change is slow and expensive, which is why it remains uncommon.
StrongDM - the first dark factory
StrongDM built their factory around two principles. First, the specification has to be complete - it is the only instruction the machine receives, and any ambiguity in it produces ambiguity in the output. Second, the evaluation has to be independent. If the AI can see the test criteria while it is building, it will optimise for passing them rather than building something that genuinely works - the same problem as teaching to the test.
To solve this they store their evaluation scenarios separately from the codebase. The AI never sees them during development. When the software is complete, the scenarios test it from the outside, the same way a holdout set in machine learning verifies a model has genuinely learned rather than memorised. They also built simulated versions of every external service the software interacts with - a fake Slack, a fake Jira, a fake Google Drive - so agents can run realistic end-to-end tests without touching real systems or data.
A dark factory like StrongDM has no sprints, no standups, no project management boards. Specifications go in, outcomes come out. The coordination layer is absent not because it was cut, but because it no longer serves a purpose.
Companies exhibiting Dark Factory skills
At Anthropic, 90% of the code in Claude Code was written by Claude Code itself. OpenAI's Codex 5.3 was involved in creating itself: earlier builds analysed training logs, flagged failures, and suggested fixes to the process that produced the next version. The tools are improving the tools.
Cursor, the AI code editor, passed $500m in annual revenue with a team of a few dozen people. Midjourney reached a similar revenue level with around 100 employees. Lovable scaled to hundreds of millions in revenue within months of launch. These companies were built around AI from the start - small teams, minimal coordination overhead, and revenue per employee several times the industry average.
Software: the first industry to change
Software is furthest along because the work is already digital, feedback loops are fast, and the signal for whether something works is unambiguous. But there is an important distinction between new systems and existing ones.
Building from scratch is relatively straightforward - the specification describes what needs to exist and the agent builds it. Existing systems are harder. Most production codebases carry a decade of undocumented decisions, workarounds, and institutional knowledge held by the people who wrote them. There is no complete specification because the system itself is the specification - the only full record of what it does is what it does. Before AI agents can take over development of those systems, someone has to reconstruct what exists and why. That work is human, and it is harder than building something new.
As AI handles more implementation, coordination roles lose their purpose. The engineering manager whose job is keeping a team in sync, the scrum master, the release manager, the technical programme manager - these roles were built for a world where humans are doing the building, and human work requires coordination. Remove the humans from implementation and the coordination overhead largely goes with them.
What becomes more valuable is the ability to define clearly what needs to be built, and to evaluate whether what was built is correct. The bottleneck moves from execution to specification. Writing a specification precise enough for an AI agent to execute correctly - without a human filling gaps through conversation - requires a deep understanding of the problem that traditional workflows distributed across many people and many meetings. That understanding now needs to exist, explicitly, before the work begins.
What this means for other industries
The same pattern applies to any industry where structured inputs are processed according to known rules to produce a defined output.
Approval chains, review cycles, and management layers all exist because humans are doing the implementation. As AI handles more of that, the question of which structures remain necessary - and which exist only because of limitations that no longer apply - becomes relevant for every knowledge-work organisation.
Audit
Junior audit work follows the same apprenticeship structure as junior software development. Associates test transactions against controls, vouch source documents, perform analytical procedures, and draft working papers. Seniors review, coach, and sign off. Repeat that across three or four busy seasons and you produce a manager. Repeat it again and you produce a partner. The audit opinion at the top is built on thousands of hours of that structured, procedural work at the bottom.
Audit has one structural difference from software development. A human auditor must sign the opinion. Regulators and professional standards require it - the audit report is a licensed professional's attestation, not a software output. But that requirement does not protect the work that produces the opinion. A partner can sign a report while the testing, evidence-gathering, and documentation underneath it is largely automated. The signature requirement protects the final judgement. It does not require the work below it to be done by people.
The major firms are already moving in this direction. Deloitte has integrated agentic AI into its Omnia audit platform, with machine-learning tools reported to verify 100% of vendor invoices against ledgers rather than relying on sampling - a fundamental shift from the statistical sampling that has defined substantive testing for decades. KPMG uses AI to scan millions of accounting entries and flag anomalies for human review, extending coverage far beyond what a team of associates could achieve manually. EY's AI is reported to assist 80,000 tax professionals across more than 3 million compliance cases annually.
A dark factory model for audit applies most directly to the procedural work: controls testing, transaction sampling, document vouching, analytical procedures, and working paper preparation. The audit team specifies the audit programme - the client's risk profile, the applicable standards, the materiality thresholds, the control framework, and the areas of focus. AI agents execute the testing, assemble the evidence, and prepare the documentation. The scenarios test whether the output meets the standard: does this controls test provide sufficient appropriate evidence? Does this analytical procedure explanation address the variance adequately? Does this working paper support the conclusion reached?
The constraint is the quality of the risk assessment that drives the specification. Audit standards require the auditor to understand the entity, its environment, and the risks of material misstatement. That understanding - of how the client's business actually works, where the pressures on management are, which balances carry the most judgement - is what determines whether the factory is testing the right things. A well-specified audit programme produces reliable output. A poorly specified one produces documented work that misses the real risks.
What stays outside the dark factory is the same judgement layer that stays outside legal and banking operations. Assessing whether a management estimate is reasonable, evaluating the implications of an unexpected audit finding, deciding whether a control deficiency is significant - these require the kind of contextual reasoning that cannot be reduced to a test.
The apprenticeship problem is acute. Associates have historically developed audit judgement by doing the procedural work - the repetition of testing transactions, reviewing documents, and seeing what passes and what does not builds an intuition for where risk sits and what good evidence looks like. As AI handles that work, the path from graduate to partner loses several of its rungs. The firms that figure out how to rebuild that development pipeline will have a structural advantage - not just in cost, but in the quality of judgement at the top.
Banking operations
Banking operations has been subject to automation for longer than most industries - basic payments processing, trade settlement, and reconciliation have used rules-based systems for decades. What is changing is the scope of what can be handled without human involvement, and in particular how far into the exception queue AI can now reach.
The volume and complexity of the underlying work is easy to underestimate. An operations analyst at a large bank might process thousands of transactions a day across multiple asset classes and jurisdictions. A single cross-border payment involves checking the beneficiary and originator against multiple sanctions lists - OFAC, the UN, the EU, and others simultaneously - applying cut-off times for the relevant currency, routing through the appropriate correspondent banking chain, applying any jurisdiction-specific restrictions on the transaction type, and reconciling the bank's nostro account position at the end of the day. A securities trade involves matching the trade details against the counterparty's confirmation, monitoring for settlement failure, managing collateral movements, and reporting the position to the relevant trade repository. Each of these steps has documented rules. Most of them can already be handled by AI at level three or four.
A dark factory model for banking operations has substantial scope. The operations team specifies the processing rules, the sanctions screening logic, the settlement instructions, the reconciliation tolerances, and the escalation criteria. AI agents process the transaction flow, apply the rules, generate the audit trail, and surface exceptions for human review. The scenarios test whether the output is correct: does this payment route correctly given these counterparty and jurisdiction characteristics? Does this reconciliation break fall within tolerance or require investigation? Does this trade confirm match within the agreed parameters? For standard transaction flows, most of this is automatable today.
The constraint, as with legal, is specification completeness. The rules interact with each other, change frequently as regulations evolve, and produce edge cases that sit at the intersection of multiple rule sets simultaneously. A transaction that triggers both a sanctions flag and a jurisdiction restriction and involves a correspondent bank under regulatory scrutiny requires a specification that has anticipated all three conditions applying at once. Most operations manuals have not been written to that level of precision, because historically a human could fill the gap.
What does not automate easily is genuine novelty. A sanctions alert where the matched name is shared by thousands of individuals - a common occurrence with names from certain regions - requires contextual judgement about the specific transaction, the counterparty's profile, and the jurisdiction's risk classification. A settlement failure involving a counterparty showing signs of financial distress requires someone who understands the bilateral relationship, the broader market context, and the potential for contagion. A payment instruction that appears valid but has characteristics consistent with a new fraud typology requires reasoning about what is being seen, not application of a known rule.
Historically, operations teams were large because the volume of standard transaction processing required significant headcount. As AI handles that volume, the function does not disappear - it concentrates. The roles that remain are defined by the ability to handle situations the specification did not anticipate, and to maintain the specification well enough that those situations arise as rarely as possible.
Legal
AI automation in legal applies most directly to transactional work - the structured, document-intensive work that makes up the bulk of a junior lawyer's early years.
In a large commercial transaction, a team of associates might spend weeks reviewing hundreds of contracts in a data room - a secure online repository where a company being acquired or financed shares its key documents with the other side's legal team. They check each contract against a checklist of key terms: change of control clauses, assignment restrictions, termination rights, liability caps. They draft standard clauses from a precedent library, produce first drafts of share purchase agreements or facility agreements based on an agreed term sheet, and compile due diligence reports summarising their findings. AI is already capable of all of this. Contract review tools can process a data room in hours, extracting and flagging key terms across hundreds of documents simultaneously. AI drafting tools produce first drafts of standard transaction documents from a structured set of inputs. These are in use at larger firms now.
A dark factory model for transactional legal work has significant scope. The deal team specifies the transaction structure, the applicable jurisdiction, the client's key commercial positions, and the negotiating parameters. AI agents review the data room, produce the due diligence report, draft the transaction documents, and flag deviations from standard positions. The scenarios test whether the output is complete and correct: does the due diligence report cover all required categories? Does the draft agreement reflect the agreed commercial terms? Does it comply with the jurisdiction's mandatory requirements? Most of a standard transaction could in principle run through a well-specified factory.
The constraint is the quality of the specification. Legal documents interact with jurisdiction-specific rules, deal-specific commercial context, and client-specific risk tolerance in ways that require explicit choices to be made upfront. Judgements that experienced lawyers currently make in the course of a transaction - whether a liability cap should be linked to deal value or set as a fixed figure, how aggressively to push back on an indemnity, where a client's real commercial priorities lie - need to be captured in the specification before the factory can run. That specification work is skilled, and in most firms it is currently not written down anywhere.
What stays outside the dark factory is the advisory layer. Understanding what a client is actually trying to achieve - as opposed to what they have asked for - requires conversation and experience of how similar deals have played out. Assessing litigation risk on an ambiguous clause, advising on a novel regulatory structure, or navigating a dispute with no clear precedent requires reasoning that cannot be reduced to a checklist.
Historically, trainees developed their understanding of deal structures through two or three years of contract review and due diligence work - the repetition of seeing hundreds of agreements built an intuition for what normal looks like, what matters, and where risk sits. As AI handles that work, that learning mechanism disappears. The question for firms is how they replace it - because the judgement that sits above what the factory can do still needs to be developed somewhere.
Compliance
Compliance in financial services follows a clear logic: identify regulatory requirements, map them to controls, gather evidence those controls are working, and report. Take transaction monitoring as an example. A compliance analyst receives alerts generated by the bank's systems flagging transactions that match predefined patterns - unusual amounts, high-risk jurisdictions, structuring behaviour. They review each alert, assess whether it represents genuine suspicious activity, document their rationale, and either close it or escalate it to a suspicious activity report. The inputs are structured, the rules are documented, and the output is a decision with an evidence trail.
The 2008 financial crisis triggered a wave of new regulation across financial services - Basel III, Dodd-Frank, MiFID II, EMIR, and enhanced AML requirements among others. Banks and financial institutions responded by substantially expanding their compliance functions to meet the new obligations, creating large teams to handle the increased volume of monitoring, reporting, and documentation. Much of that headcount is engaged in exactly the kind of structured, rule-governed work described above.
In principle, this work maps well onto a dark factory model. The specification would define the regulatory ruleset - which transaction characteristics trigger a review, what evidence is required to close an alert, what constitutes a reportable event, what a completed SAR must contain. The scenarios would test whether the AI's decisions are correct: does this alert closure decision match the expected outcome given these transaction characteristics? Does this regulatory return reconcile to the source data? Does this SAR meet all required fields and thresholds? These are all testable against known criteria, which is exactly what makes compliance a strong candidate for higher levels of automation.
The harder question is who, if anyone, needs to review the output. Currently, most regulatory frameworks require a named, accountable individual to attest to returns and certain decisions. The FCA, PRA, and equivalent bodies hold specific senior managers personally liable under frameworks like the Senior Managers and Certification Regime. That is not going away soon. But personal liability for the process does not necessarily mean the person has to perform the underlying analytical work - it means they need to be satisfied the process was sound and the outputs are correct.
The dark factory model for compliance therefore looks more like level four than level five: AI executes the monitoring, documentation, and reporting; scenarios test whether outputs meet the regulatory criteria; and a compliance officer reviews exception flags, handles genuinely ambiguous cases, and attests to the overall process. Whether regulators eventually accept AI-generated outputs with human attestation based on process rather than individual review is an open question - but the direction of travel in regulatory thinking, particularly around model risk and algorithmic decision-making, suggests it is a question they are already starting to grapple with.
Writing the specification for this is not straightforward. Regulatory rules interact with each other, change frequently, and contain deliberate ambiguity - legislators cannot anticipate every edge case, so rules are written with judgement in mind. A specification precise enough for an AI agent to execute correctly would need to resolve that ambiguity explicitly, which requires deep regulatory expertise and a willingness to make calls that compliance teams currently leave to individual judgement.
The skill that changes most
Across all of these industries, the most important shift is not learning specific AI tools. It is the ability to write specifications precise enough for a machine to execute correctly, without a human asking a clarifying question.
When humans work with other humans, gaps get filled through conversation and judgement. An AI agent builds exactly what it is told. If the description is ambiguous, the output reflects that ambiguity - and the agent will not flag it.
The traditional apprenticeship model never required junior-level precision in instructions, because a human on the other end could always ask. That is changing. Specification writing is becoming a core professional skill - and it demands a depth of understanding of the process, the risk, and the client that the apprenticeship model usually developed over years.
The work that gets automated first is the work that required the least judgement. What remains requires more, not less.
References
Dark factories and the five levels framework
Dan Shapiro - The Five Levels: from Spicy Autocomplete to the Dark Factory (January 2026) https://www.danshapiro.com/blog/2026/01/the-five-levels-from-spicy-autocomplete-to-the-software-factory/
Simon Willison - How StrongDM's AI team build serious software without even looking at the code (February 2026) https://simonwillison.net/2026/Feb/7/software-factory/
AI productivity research
METR - Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (2025) https://metr.org/blog/2025-07-10-early-2025-ai-development/
AI in audit and professional services
Deloitte - Omnia Audit Technology and Agentic AI https://www.deloitte.com/us/en/services/audit-assurance/about/audit-technology-solutions.html
Auditbrew - How the Big 4 Are Navigating Modern Auditing Standards in 2025 https://www.auditbrew.com/post/how-the-big-4-are-navigating-modern-auditing-standards-in-2025
Olivier Khatib - AI and the Collapse of the Big Four? (August 2025) https://olikhatib.substack.com/p/ai-and-the-collapse-of-the-big-four
AI in banking and financial services
Morgan Stanley - AI Could Help Eliminate 200K EU Banking Jobs By 2030, reported by PYMNTS (December 2025) https://www.pymnts.com/artificial-intelligence-2/2025/ai-could-help-eliminate-200k-eu-banking-jobs-by-2030/
Banking Exchange - Compliance for AI Agents: What Financial Services Organizations Need to Know (November 2025) https://www.bankingexchange.com/news-feed/item/10465-compliance-for-ai-agents-what-financial-services-organizations-need-to-know
AI-native company benchmarks
Figures for Cursor, Midjourney, and Lovable revenue per employee are widely reported across technology press as of early 2026 and reflect the period in which this article was written.

