Multi-model reasoning

The single-model assumption ("pick the best model and use it for everything") made sense when there was one clearly dominant model. That era is over. Today, different models have meaningfully different strengths, weaknesses, latency profiles, and cost points. Using the right model for each sub-task is not premature optimisation; it's basic engineering.

Multi-model reasoning is the practice of routing tasks to models based on their demonstrated competence for that task type, and using model disagreement as a signal to escalate rather than silently proceeding.

Why models disagree, and why that's valuable

Two well-calibrated models, given the same input, should reach the same conclusion on unambiguous questions. When they disagree, one of three things is true:

The question is ambiguous and both interpretations are defensible
One model has outdated or incorrect training data on this topic
One model is better calibrated for this task type

In all three cases, disagreement is a useful signal: it means the answer deserves more scrutiny before being acted on. A quorum pattern (majority rules, minority dissent surfaced to a human) is more reliable than any single model on tasks with real-world consequences.

The quorum pattern

The quorum pattern, as we implement it, works like this:

Submit the same question to N models (typically 3)
Parse structured responses (JSON, not prose, to enable programmatic comparison)
If N-1 or N agree: proceed with the majority answer, log the dissent
If models are evenly split: surface to a human with both positions and rationale

# Simplified quorum implementation
from typing import TypeVar, Callable
import asyncio

T = TypeVar('T')

async def quorum(
    prompt: str,
    models: list[str],
    parse: Callable[[str], T],
    threshold: float = 0.67,
) -> tuple[T | None, list[T], bool]:
    """
    Run prompt across multiple models, return majority answer.
    Returns (majority_answer, all_answers, consensus_reached).
    """
    responses = await asyncio.gather(*[call_model(m, prompt) for m in models])
    parsed = [parse(r) for r in responses]

    # Count occurrences of each unique answer
    from collections import Counter
    counts = Counter(str(p) for p in parsed)
    majority_str, majority_count = counts.most_common(1)[0]

    consensus = (majority_count / len(models)) >= threshold
    majority_answer = next(p for p in parsed if str(p) == majority_str)

    return majority_answer, parsed, consensus

Task routing by model strength

Beyond quorum voting, routing tasks to the right model for that task type reduces cost and latency without sacrificing quality. The general pattern:

Task type	Characteristics	Model preference
Code generation	Requires precise syntax, knowledge of APIs	Strong coding models (mid-tier and large models)
Code review	Adversarial, needs to find flaws	Different model from generator (avoids blind spots)
Summarisation	High throughput, cost-sensitive	Smaller, faster models (small model tier)
Reasoning / planning	Multi-step, needs chain-of-thought	Reasoning-specialised models (o3, claude-3-7-sonnet)
Classification	Structured output, few-shot	Any capable model; smaller is fine

Structured outputs: the foundation of multi-model pipelines

Multi-model pipelines only work reliably if each model returns structured, parseable output. Asking a model to "answer in JSON" is fragile: models add preamble, vary key names, and occasionally produce invalid JSON under load.

The robust approach is to use structured output mode (available in the OpenAI, Anthropic, and Google APIs) with a defined schema. Every model in the pipeline returns an instance of the same Pydantic/dataclass schema, validated before downstream use.

from pydantic import BaseModel
from enum import Enum

class Verdict(str, Enum):
    APPROVE = "approve"
    REJECT  = "reject"
    DEFER   = "defer"

class GovernanceDecision(BaseModel):
    verdict: Verdict
    confidence: float       # 0.0 – 1.0
    rationale: str
    concerns: list[str]     # Specific issues that informed the verdict

Practical note: Start with two models, not three. The marginal reliability gain from a third model is real but small. The cost is proportional. For most production use cases, a generator model plus a reviewer model is the right starting point.

When to use a reasoning model

Reasoning-specialised models (those trained with extended chain-of-thought) excel at tasks with:

Multiple interdependent constraints (e.g. "design an architecture that meets these 8 requirements")
Problems where the wrong answer is confidently wrong (high-stakes classification)
Long multi-step derivations where errors compound

They are slower and more expensive than standard models. Use them for planning and decision-making; use standard models for generation and summarisation. A pipeline that uses a reasoning model for architecture decisions and a standard model for generating the code that implements those decisions is both fast and accurate.

Multi-model reasoning in the ticketyboo scanner

The scanner's finding analysis uses a single model for the initial scan pass (optimised for throughput, analysing potentially hundreds of files). The remediation recommendation generation uses a second, more capable model with a longer context window to reason about the full file before suggesting a fix.

For the governance decision tree tool on this site, a lightweight model classifies the decision type and routes it to the appropriate Gatekeep checklist. The full reasoning model is only invoked if the decision touches security or cost: the two categories where confident wrong answers have real consequences.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →

Why models disagree, and why that's valuable

The quorum pattern

Task routing by model strength

Structured outputs: the foundation of multi-model pipelines

When to use a reasoning model

Multi-model reasoning in the ticketyboo scanner

Related tools and articles

Working on something like this?