You can't saturate your way into chaos

Who this is for: Engineers and technical leaders designing multi-agent systems and wondering when to parallelise and when to stay sequential.

Key points:

Parallel agents hit diminishing returns faster than parallel compute does, because coordination costs compound at the model level.
The Universal Scalability Law quantifies this: throughput peaks, then falls. The drop-off is predictable if you measure contention and coherency overhead.
Google Research found that multi-agent systems on software engineering tasks improve results up to a point, then plateau or regress due to conflicting edits and context fragmentation.
The pattern that actually scales: one agent per independent task, not N agents on the same task.

The instinct when you first see an AI agent work quickly is to add more of them. If one agent can analyse a codebase in ten minutes, surely ten agents can do it in one. This is the same reasoning that led engineers to throw more threads at serialised workloads and wonder why performance got worse. The reasoning is wrong in the same way, for the same reasons.

Multi-agent parallelisation is real and useful. Stripe runs over 1,300 automated pull requests per week using one agent per independent task.^[1] The key phrase is "independent task". The moment agents share state, coordinate on overlapping work, or need to agree on intermediate outputs, the cost structure changes completely. The overhead is not linear. It compounds. And it compounds faster for language model agents than it does for traditional distributed systems, because the coordination layer itself runs on expensive, latency-sensitive inference.

What the data shows

A Google Research paper published in December 2025 ran a systematic study of multi-agent systems on software engineering benchmarks.^[2] The results are instructive. Multi-agent approaches improved task completion rates on SWE-bench from roughly 65% (single agent) to 72.2% when agents were assigned non-overlapping subtasks with clear interfaces. That is a real gain. What the paper also documents is the failure mode: when agents were given overlapping scope on the same codebase, performance degraded below the single-agent baseline. More agents, worse outcome. The inflection point was earlier than the researchers expected.

The failure mode has a name in distributed systems theory: coherency cost. Amdahl's Law captures the serialisation constraint (the sequential fraction of work bounds your speedup). Neil Gunther's Universal Scalability Law goes further, adding a coherency term that accounts for the overhead of keeping shared state consistent across N workers.^[3] In traditional systems, coherency overhead is a cache invalidation problem. In multi-agent AI systems, it is a context problem: agents need to know what other agents have done, decided, or changed. That context doesn't pass for free.

Anthropic's multi-agent guidance published in early 2026 makes the same point operationally: the systems that perform best assign agents to tasks that can be completed independently, with human-readable handoff points between them, not continuous shared state.^[4] The architecture that scales is closer to a pipeline than a swarm.

Universal Scalability Law curve for a representative multi-agent workload. Throughput peaks around 5 agents and declines as coherency overhead dominates. The exact peak depends on your contention (σ) and coherency (κ) parameters, but the shape is always the same.

What actually works

The systems that scale well with multiple agents share a structural property: the agents are not coordinating during execution, they are coordinating at handoff points. Stripe's architecture assigns each agent a complete, bounded task: write and test one pull request. The agent doesn't share a working branch with other agents. It doesn't need to know what other agents are doing. The parallelism is real because the independence is real. When the task is done, a human reviews it. That review gate is not overhead; it is the coordination mechanism that keeps coherency cost near zero.

The same principle holds for the scanner learning loop on this platform. Five ops agents run on separate EventBridge schedules, each writing to a separate partition in DynamoDB. They do not call each other. They do not share intermediate state. The aggregation happens at read time, not write time. This is not an elegant architectural preference; it is the only way to keep the system predictable at low cost. A design where the cost agent queries the SRE agent during its own run would add latency, coupling, and failure modes for marginal benefit. Independence is the feature.

The architecture that scales (right) eliminates in-flight coordination. Agents work on independent tasks and hand off clean outputs. The architecture that doesn't (left) compounds coherency cost with every agent added.

The honest caveat

The USL curve above uses parameters that produce a pessimistic-but-realistic shape for AI agent workloads. The actual inflection point in your system depends on your specific contention and coherency values, which you have to measure, not assume. Some workloads genuinely parallelise well beyond five agents: document processing where each document is fully independent, code generation where each file is isolated, analysis tasks with no shared output requirements. The law doesn't say parallelism is bad; it says the relationship between agents and throughput is not linear, and the deviation from linearity arrives earlier than most people expect. Measure it in your system before you design for it.

The design principle

Structure determines whether parallelism helps or hurts before you write a line of agent code. If tasks are independent and outputs are separable, parallelism is free throughput. If tasks share state, coordinate on intermediate work, or need consensus, you are building a distributed system with language models as the coordination mechanism. That system will behave exactly as distributed systems theory predicts: coherency overhead will dominate at scale, performance will peak earlier than you expect, and adding more agents will make things worse, not better. The fix is not better agents. It is task decomposition that makes coordination unnecessary.

References

↩ "Minions: How Stripe scales automated pull requests to 1,300 per week using one agent per task" — ByteByteGo / InfoQ, 2025. source
↩ "Scaling LLM Test-Time Compute Does Not Improve Performance on Multi-step Agentic Tasks" — Google Research, December 2025. arxiv.org/abs/2512.08296
↩ "A Simple Capacity Model for Massively Parallel Transaction Systems" — Neil J. Gunther, CMG Conference Proceedings, 1993. Extended as the Universal Scalability Law. perfdynamics.com
↩ "Building effective agents: design principles for multi-agent systems" — Anthropic, 2026. anthropic.com

Found this useful? Support the project.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →