Platform architecture patterns

Summary

Who it's for CTOs, principal engineers, and technical architects deciding which direction to take a growing platform, or evaluating whether to migrate an existing one. Also useful for engineering leaders being asked to explain a past architectural decision to a board or investor.

Key observations

Monolith-first is not a compromise. It is the correct starting point for most products, and the teams that skip it pay for it later in distributed-system complexity without the user base to justify the cost.
Strangler fig is the least glamorous pattern and the one most likely to actually work in a large existing system. It works precisely because it does not require a big-bang migration.
Cell-based architecture is a scaling pattern for the operationally mature. Teams that adopt it early, before they have the blast-radius problem it solves, create complexity without benefit.
Composable mesh is the right answer for a narrow set of situations. It is routinely over-applied by organisations that have read the case studies without the org structure to support them.

The pattern conversation usually starts with a migration. A monolith has become genuinely painful: deployments take 40 minutes, a change to the billing module requires a full regression suite on the recommendation engine, and three teams are blocked waiting for one team's release. The question becomes: what do we move to?

What follows is not a textbook overview. It is an account of what these patterns actually feel like to operate, and what the inflection points look like when they stop working. The interactive diagrams below show each pattern's component structure. The text is about the decisions behind them.

All domains in one codebase, one database, one deployment pipeline. Fast to build, fast to change early on.

Best for0 to product-market fit

StrengthFast iteration, low ops cost

Failure modeCoupling growth outpaces team growth

Migration triggerDeployment contention across teams

A facade routes traffic. Domains migrate incrementally. The legacy system shrinks as new services absorb its responsibilities.

Best forMigrating large existing systems

StrengthNo big-bang cutover, reversible

Failure modeFacade becomes permanent complexity

Migration triggerLegacy system is the constraint

Identical cells serve partitioned customer sets. A failure in Cell A cannot affect Cell C. The control plane handles routing and capacity.

Best forMulti-tenant SaaS at scale

StrengthBlast radius containment

Failure modeOperational overhead before scale justifies it

Migration triggerOne tenant's failure affecting others

Capability nodes communicate via an event bus. The API gateway composes responses. Each node owns its data and publishes domain events.

Best forMature platforms with distinct domains

StrengthIndependent deployment, clear ownership

Failure modeDistributed data consistency, org must match

Migration triggerTeams need independent release cadences

Pattern one: monolith-first

The monolith-first argument is simple and frequently ignored. You do not know which parts of your system will be the scaling bottleneck, the performance hotspot, or the most-changed module, until users tell you. Building microservices before you have that signal means drawing service boundaries based on guesswork, which is exactly how you get the wrong boundaries. Changing service boundaries after the fact is harder than changing module boundaries in a monolith.

The monolith-first pattern works because it is cheap to iterate. A product assumption turns out to be wrong: you change the module, redeploy the single artefact, done. In a distributed system, the same change requires coordinated API versioning, backward-compatibility windows, and potentially a migration script for data that is now in the wrong service's database. The early-stage product team that spends two sprints on service decomposition when they should be testing whether anyone wants the product at all is making a real trade-off, not a principled architectural one.

The monolith starts to fail when teams own different parts of it and start blocking each other on deployments. That is the signal, not a line count threshold or a module count. The fix at that point is not necessarily a full decomposition. A modular monolith, with strict internal boundaries and isolated testing, can sustain a surprisingly large engineering organisation if the deployment pipeline is fast. The key question is: are teams actually blocked, or is "microservices" being invoked as an aspiration?

Pattern two: strangler fig

The strangler fig is named after a vine that grows around a host tree, gradually replacing it. In architecture terms: a facade receives all traffic, routes part of it to the legacy system, and part to newly extracted services. As each domain is extracted, the route for it moves from legacy to new. Eventually the legacy system handles nothing and is decommissioned.

This is the pattern most likely to actually work for a large, live system with users on it. The reason is that it does not require a big-bang cutover. At any point, the facade can route traffic back to the legacy system if a newly extracted service has a problem. The migration is incremental and reversible, which makes it fundable: you can show progress at each stage, and you can stop if the business decides the investment is better spent elsewhere.

The failure mode is also well-documented: the facade becomes permanent complexity. Teams add routing logic, feature flags, and conditional behaviour to the facade over time, and it becomes a second monolith, this one hosting routing decisions rather than business logic. The discipline required is to keep the facade thin: it routes, it does not transform. Any business logic that ends up in the facade belongs in one of the services, and moving it later is expensive.

A second failure mode: the legacy system never fully goes away. The team extracts the easy services and declares success. The remaining legacy functionality is too risky, too poorly understood, or too entangled to touch. You end up running two systems indefinitely. This is a resourcing and prioritisation failure more than an architecture failure, but the pattern creates the conditions for it.

Pattern three: cell-based architecture

Cell-based architecture is a response to a specific problem: in multi-tenant SaaS at scale, a noisy or failing tenant degrades everyone else. The solution is to partition tenants into isolated cells, each with its own stack, compute, and data. A failure in one cell cannot propagate to another. You can roll out changes to one cell before rolling to all of them. You can isolate large tenants into dedicated cells for compliance or performance reasons.

The operational cost of cell-based architecture is high. You are not running one system; you are running N systems that happen to be identical. That means N databases to back up, N certificate renewals, N upgrade pipelines, N monitoring streams. Without significant automation and a mature platform engineering capability, this creates more risk than it solves.

The inflection point where cell-based architecture pays off: when a single tenant's workload can cause a service degradation that appears in your SLA reports. Until you have reached that scale, or until compliance requirements force tenant isolation, the operational overhead is not justified. Teams that adopt cell-based architecture as an aspiration, before they have the blast-radius problem it solves, spend years building infrastructure complexity for a problem they have not yet encountered.

One practical note: the control plane is the part that is hardest to get right. Routing, capacity management, cell creation, and health checking all live there. Building this well is a substantial engineering investment. There are managed platforms that provide it; whether to use them or build is a version of the same build-vs-buy question that applies to everything else.

Pattern four: composable mesh

The composable mesh is the end state that most modern architecture diagrams converge on: discrete capability nodes, each owning its data, communicating via events, composed at the API gateway layer. This pattern works well. It also requires organisational conditions that many teams underestimate.

The technical requirement is well-understood: strong service boundaries, event-driven integration, robust observability, distributed tracing, and a mature approach to data consistency across service boundaries. These are real skills that take time to develop. Teams migrating from a monolith typically need twelve to eighteen months before they are genuinely productive in this model, not because they are slow, but because the failure modes are different and the debugging patterns are unfamiliar.

The organisational requirement is less often discussed. The composable mesh works best when team ownership maps cleanly to service ownership: one team, one bounded context, one deployment pipeline. In practice, many organisations have team structures that do not map to domain boundaries, and migrating both the architecture and the team structure simultaneously is a significant change management challenge. Getting the architecture right while the org is wrong tends to produce services with unclear ownership, which degrades over time.

The specific failure mode to watch for: distributed data consistency. Events are asynchronous. A business process that spans multiple services, such as placing an order, charging for it, and reserving inventory, requires careful design to remain correct when any of those services fails mid-flow. The patterns exist (sagas, compensating transactions, outbox pattern), but they are not trivial to implement correctly, and the consequences of getting them wrong are subtle. The monolith had a database transaction. The mesh has a distributed coordination problem.

Rough decision flow. Starting position: monolith. Triggers determine which pattern to move toward. Patterns are not mutually exclusive at different scales.

Migrating between patterns

The most important thing to understand about these patterns is that migration between them is costly. Every migration requires the team to think carefully about two systems simultaneously, which is cognitive and operational overhead on top of normal product delivery. The teams that do this well are the ones that treat the migration as a first-class engineering project with its own roadmap, resourcing, and success criteria, not as background activity alongside the normal sprint.

The question of when to migrate is not purely technical. A system that is architecturally suboptimal but operationally stable and well-understood by the team is often a better investment target than the same amount of time spent on an architectural migration. The migration is worth pursuing when the current architecture is the primary constraint on delivery velocity, reliability, or team scalability, not before.

One pattern that reliably creates problems: migrating to a more complex architecture as a retention strategy or a modernisation signal rather than a response to a real constraint. Teams that build microservices because they want to attract engineers who build microservices, rather than because the architecture solves a real problem, tend to end up with the operational costs of a distributed system without the organisational benefits.

The conversation with the board

When architecture comes up with a board or non-technical leadership team, the framing that works is constraints, not patterns. "We are moving to microservices" is a technology answer. "Our current architecture means three engineering teams are blocked on every deployment, which is costing us approximately two sprints of capacity per quarter" is a business problem with a cost attached.

The investment case for a migration needs a current-state constraint, a future-state outcome, a cost range, a timeline, and a decision point. The same capital allocation discipline that applies to AI investment applies here. Architecture migrations that cannot articulate a measurable outcome tend to expand indefinitely, which is how you end up with a three-year modernisation programme that consumes half the engineering headcount and produces no user-visible value until year two.

The honest summary: Start with a monolith. Extract when teams are genuinely blocked, not when the pattern is fashionable. Use strangler fig when you have a large live system that cannot afford a cutover. Move to composable mesh when team ownership and service ownership can align. Add cell-based isolation when tenant blast radius is a real problem, not a hypothetical one. The pattern is not the goal. Delivery velocity, operational reliability, and team autonomy are the goals. The pattern serves those goals, or it does not belong.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →