The single-model assumption — "pick the best model and use it for everything" — made sense when there was one clearly dominant model. That era is over. Today, different models have meaningfully different strengths, weaknesses, latency profiles, and cost points. Using the right model for each sub-task is not premature optimisation; it's basic engineering.
Multi-model reasoning is the practice of routing tasks to models based on their demonstrated competence for that task type, and using model disagreement as a signal to escalate rather than silently proceeding.
Why models disagree — and why that's valuable
Two well-calibrated models, given the same input, should reach the same conclusion on unambiguous questions. When they disagree, one of three things is true:
- The question is ambiguous and both interpretations are defensible
- One model has outdated or incorrect training data on this topic
- One model is better calibrated for this task type
In all three cases, disagreement is a useful signal — it means the answer deserves more scrutiny before being acted on. A quorum pattern (majority rules, minority dissent surfaced to a human) is more reliable than any single model on tasks with real-world consequences.
The quorum pattern
The quorum pattern, as we implement it, works like this:
- Submit the same question to N models (typically 3)
- Parse structured responses (JSON, not prose, to enable programmatic comparison)
- If N-1 or N agree: proceed with the majority answer, log the dissent
- If models are evenly split: surface to a human with both positions and rationale
# Simplified quorum implementation
from typing import TypeVar, Callable
import asyncio
T = TypeVar('T')
async def quorum(
prompt: str,
models: list[str],
parse: Callable[[str], T],
threshold: float = 0.67,
) -> tuple[T | None, list[T], bool]:
"""
Run prompt across multiple models, return majority answer.
Returns (majority_answer, all_answers, consensus_reached).
"""
responses = await asyncio.gather(*[call_model(m, prompt) for m in models])
parsed = [parse(r) for r in responses]
# Count occurrences of each unique answer
from collections import Counter
counts = Counter(str(p) for p in parsed)
majority_str, majority_count = counts.most_common(1)[0]
consensus = (majority_count / len(models)) >= threshold
majority_answer = next(p for p in parsed if str(p) == majority_str)
return majority_answer, parsed, consensus
Task routing by model strength
Beyond quorum voting, routing tasks to the right model for that task type reduces cost and latency without sacrificing quality. The general pattern:
| Task type | Characteristics | Model preference |
|---|---|---|
| Code generation | Requires precise syntax, knowledge of APIs | Strong coding models (Claude Sonnet, GPT-4o) |
| Code review | Adversarial, needs to find flaws | Different model from generator (avoids blind spots) |
| Summarisation | High throughput, cost-sensitive | Smaller, faster models (Haiku, GPT-4o-mini) |
| Reasoning / planning | Multi-step, needs chain-of-thought | Reasoning-specialised models (o3, claude-3-7-sonnet) |
| Classification | Structured output, few-shot | Any capable model; smaller is fine |
Structured outputs: the foundation of multi-model pipelines
Multi-model pipelines only work reliably if each model returns structured, parseable output. Asking a model to "answer in JSON" is fragile — models add preamble, vary key names, and occasionally produce invalid JSON under load.
The robust approach is to use structured output mode (available in the OpenAI, Anthropic, and Google APIs) with a defined schema. Every model in the pipeline returns an instance of the same Pydantic/dataclass schema, validated before downstream use.
from pydantic import BaseModel
from enum import Enum
class Verdict(str, Enum):
APPROVE = "approve"
REJECT = "reject"
DEFER = "defer"
class GovernanceDecision(BaseModel):
verdict: Verdict
confidence: float # 0.0 – 1.0
rationale: str
concerns: list[str] # Specific issues that informed the verdict
When to use a reasoning model
Reasoning-specialised models (those trained with extended chain-of-thought) excel at tasks with:
- Multiple interdependent constraints (e.g. "design an architecture that meets these 8 requirements")
- Problems where the wrong answer is confidently wrong (high-stakes classification)
- Long multi-step derivations where errors compound
They are slower and more expensive than standard models. Use them for planning and decision-making; use standard models for generation and summarisation. A pipeline that uses a reasoning model for architecture decisions and a standard model for generating the code that implements those decisions is both fast and accurate.
Multi-model reasoning in the ticketyboo scanner
The scanner's finding analysis uses a single model for the initial scan pass (optimised for throughput — analysing potentially hundreds of files). The remediation recommendation generation uses a second, more capable model with a longer context window to reason about the full file before suggesting a fix.
For the governance decision tree tool on this site, a lightweight model classifies the decision type and routes it to the appropriate GATEKEEP checklist. The full reasoning model is only invoked if the decision touches security or cost — the two categories where confident wrong answers have real consequences.