The single-model assumption — "pick the best model and use it for everything" — made sense when there was one clearly dominant model. That era is over. Today, different models have meaningfully different strengths, weaknesses, latency profiles, and cost points. Using the right model for each sub-task is not premature optimisation; it's basic engineering.

Multi-model reasoning is the practice of routing tasks to models based on their demonstrated competence for that task type, and using model disagreement as a signal to escalate rather than silently proceeding.

Why models disagree — and why that's valuable

Two well-calibrated models, given the same input, should reach the same conclusion on unambiguous questions. When they disagree, one of three things is true:

  1. The question is ambiguous and both interpretations are defensible
  2. One model has outdated or incorrect training data on this topic
  3. One model is better calibrated for this task type

In all three cases, disagreement is a useful signal — it means the answer deserves more scrutiny before being acted on. A quorum pattern (majority rules, minority dissent surfaced to a human) is more reliable than any single model on tasks with real-world consequences.

The quorum pattern

The quorum pattern, as we implement it, works like this:

  1. Submit the same question to N models (typically 3)
  2. Parse structured responses (JSON, not prose, to enable programmatic comparison)
  3. If N-1 or N agree: proceed with the majority answer, log the dissent
  4. If models are evenly split: surface to a human with both positions and rationale
# Simplified quorum implementation
from typing import TypeVar, Callable
import asyncio

T = TypeVar('T')

async def quorum(
    prompt: str,
    models: list[str],
    parse: Callable[[str], T],
    threshold: float = 0.67,
) -> tuple[T | None, list[T], bool]:
    """
    Run prompt across multiple models, return majority answer.
    Returns (majority_answer, all_answers, consensus_reached).
    """
    responses = await asyncio.gather(*[call_model(m, prompt) for m in models])
    parsed = [parse(r) for r in responses]

    # Count occurrences of each unique answer
    from collections import Counter
    counts = Counter(str(p) for p in parsed)
    majority_str, majority_count = counts.most_common(1)[0]

    consensus = (majority_count / len(models)) >= threshold
    majority_answer = next(p for p in parsed if str(p) == majority_str)

    return majority_answer, parsed, consensus

Task routing by model strength

Beyond quorum voting, routing tasks to the right model for that task type reduces cost and latency without sacrificing quality. The general pattern:

Task type Characteristics Model preference
Code generation Requires precise syntax, knowledge of APIs Strong coding models (Claude Sonnet, GPT-4o)
Code review Adversarial, needs to find flaws Different model from generator (avoids blind spots)
Summarisation High throughput, cost-sensitive Smaller, faster models (Haiku, GPT-4o-mini)
Reasoning / planning Multi-step, needs chain-of-thought Reasoning-specialised models (o3, claude-3-7-sonnet)
Classification Structured output, few-shot Any capable model; smaller is fine

Structured outputs: the foundation of multi-model pipelines

Multi-model pipelines only work reliably if each model returns structured, parseable output. Asking a model to "answer in JSON" is fragile — models add preamble, vary key names, and occasionally produce invalid JSON under load.

The robust approach is to use structured output mode (available in the OpenAI, Anthropic, and Google APIs) with a defined schema. Every model in the pipeline returns an instance of the same Pydantic/dataclass schema, validated before downstream use.

from pydantic import BaseModel
from enum import Enum

class Verdict(str, Enum):
    APPROVE = "approve"
    REJECT  = "reject"
    DEFER   = "defer"

class GovernanceDecision(BaseModel):
    verdict: Verdict
    confidence: float       # 0.0 – 1.0
    rationale: str
    concerns: list[str]     # Specific issues that informed the verdict
Practical note: Start with two models, not three. The marginal reliability gain from a third model is real but small. The cost is proportional. For most production use cases, a generator model plus a reviewer model is the right starting point.

When to use a reasoning model

Reasoning-specialised models (those trained with extended chain-of-thought) excel at tasks with:

They are slower and more expensive than standard models. Use them for planning and decision-making; use standard models for generation and summarisation. A pipeline that uses a reasoning model for architecture decisions and a standard model for generating the code that implements those decisions is both fast and accurate.

Multi-model reasoning in the ticketyboo scanner

The scanner's finding analysis uses a single model for the initial scan pass (optimised for throughput — analysing potentially hundreds of files). The remediation recommendation generation uses a second, more capable model with a longer context window to reason about the full file before suggesting a fix.

For the governance decision tree tool on this site, a lightweight model classifies the decision type and routes it to the appropriate GATEKEEP checklist. The full reasoning model is only invoked if the decision touches security or cost — the two categories where confident wrong answers have real consequences.

Related tools and articles

→ Governance decision tree (interactive) → Agentic development patterns → AI-assisted remediation → Scan your repository