Brood: worker queues for governance agents

The governance persona pattern is well established: you give an AI agent a name, a role, and a set of constraints, and it reviews decisions through that lens. A security persona flags risk. A cost persona flags spend. An architecture persona flags drift from the design. This is useful.

What I noticed over time is that personas in a chat interface do a lot of answering and not very much doing. You ask them to review a Terraform plan and they tell you what's wrong. You then go and fix it yourself. The persona's output is advisory — a second opinion, not a completed task.

Brood is the execution engine I built for that: a distributed worker system that executes tasks in parallel, with governance baked into every stage. In production it delivered a 35x speedup over sequential execution across 125 tasks — not because the AI got faster, but because the work stopped being serialised through a single human bottleneck.

Where Brood fits: the Trinity stack

Brood is one piece of a three-part platform I call Trinity: Gatekeep (governance rules and review personas), Brood (distributed task execution with those personas as workers), and ticketyboo.dev (the scanner and articles you're reading now). They work together, but each is useful independently.

Above Brood sits Hudson — the orchestrator that breaks large work packages into individual tasks, submits them to the Brood queue, and monitors progress. Hudson is the "white glove" layer: it handles task prioritisation, retry logic, and reporting back to whoever commissioned the work. Brood doesn't know or care about orchestration; it just picks up tasks and executes them. This separation is intentional — it means you can swap out the orchestrator without changing the worker architecture.

The queue model

The core idea is simple: a task queue where each task has a type, a payload, and a designated worker persona. The persona that picks up the task isn't chosen at random — it's matched to the task type. Security audit tasks go to the security worker. Cost review tasks go to the cost worker. PR review tasks go to the engineering worker.

This sounds obvious, but it changes how you think about agent design. Instead of "what can I ask this agent?", the question becomes "what types of work does this agent own?" Ownership changes accountability. A worker that owns security audit tasks is responsible for producing consistent, structured output on every task — not just a thoughtful paragraph when prompted.

Task type	Worker persona	Expected output
security-audit	Security	Findings list with severity, rule ID, suggested fix
terraform-plan-review	Cost + Security	Approval/rejection with rationale per gate
pr-review	Engineering	Review comments, approval status
health-check	SRE	Health score, anomalies, recommended actions
cost-report	Cost	Spend breakdown, budget variance, alerts

Workers, not chat sessions

Each worker is a long-running process, not a stateless function. It holds its persona definition in memory, maintains a conversation thread per task (so multi-step reasoning works correctly), and writes structured output to a results store when the task completes.

The difference from a Lambda function or a one-shot API call is that the worker accumulates context over a run. If the security worker processes twenty audit tasks in sequence, by the end of the batch it has seen patterns across all of them. You can ask it to produce a summary of the session's findings — something a stateless function can't do without going back to the database for everything.

This is deliberate. One of the things I wanted to test was whether agents that process batches produce better outputs than agents that process items in isolation. Early results suggest yes — pattern recognition across a batch catches things that per-item analysis misses.

Infrastructure choices

The queue itself is straightforward SQS: a main task queue with 15-minute visibility timeout (enough time for a slow task), a dead-letter queue for tasks that fail repeatedly, and a DynamoDB table for results and task state.

Workers run on Spot instances rather than Lambda. This is a cost choice (Spot is much cheaper for sustained compute) and a capability choice (some task types benefit from local model inference rather than API calls). The tradeoff is that Spot can be interrupted — which is fine for async work but means task design needs to handle partial progress gracefully.

# Worker loop — simplified
while True:
    task = queue.receive(visibility_timeout=900)
    if not task:
        time.sleep(poll_interval)
        continue

    try:
        result = worker.execute(task)
        results_table.put(task.id, result)
        queue.delete(task)
    except RetryableError as e:
        logger.warning("Task %s retryable: %s", task.id, e)
        # Visibility timeout expires; task returns to queue
    except FatalError as e:
        logger.error("Task %s fatal: %s", task.id, e)
        queue.delete(task)
        results_table.put(task.id, {"status": "failed", "error": str(e)})

The governance binding

What makes a Brood worker a governance worker rather than just a task processor is the persona binding. Each worker type loads a persona definition at startup: the role, the constraints, the review criteria, the output schema. Here's an SRE agent definition illustrating the pattern (on ticketyboo.dev the ops agents have since migrated to pure Lambda functions, but the declarative shape is the same):

{
  "id": "sre",
  "role": "Site Reliability Engineer",
  "schedule": "hourly",
  "goals": [
    "Monitor CloudWatch alarms for Lambda errors, CloudFront cache ratio, DynamoDB capacity",
    "File tickets via GitHub Issues when anomalies are detected",
    "Escalate to CTO Agent if anomaly severity is critical"
  ],
  "skills": ["aws-health-check"],
  "anomaly_thresholds": {
    "lambda_error_rate_pct": 5,
    "cloudfront_cache_ratio_pct": 80,
    "dynamodb_consumed_capacity_pct": 80
  }
}

That definition is the persona binding. The SRE worker applies these thresholds every time — not because the task payload tells it to, but because the constraints are baked in at the process level. A heartbeat that finds Lambda error rate at 6% always results in a ticket, regardless of which agent happens to be processing the run. The constraints don't drift based on the phrasing of the input.

Schema-first outputs: Each worker type has a defined output schema. The security worker always produces findings in the same format — same fields, same severity scale, same rule ID pattern. This makes downstream processing (aggregation, trending, AutoDev intake) predictable. Free-text outputs from agents are useful for reading; structured outputs are useful for building on.

What this is still learning

I want to be honest about where this pattern is still rough. Worker-level context accumulation is useful but also a failure mode: a worker that's seen a lot of unusual patterns can start seeing them where they don't exist. Resetting worker state between runs (or at batch boundaries) is something I'm still tuning.

The persona binding also means errors are consistent — if the persona definition has a blind spot, every task it processes will have that blind spot. Human review of worker outputs isn't optional; it's how you catch systematic drift before it compounds across a large batch.

I think of Brood as scaffolding for AI ops work that I haven't fully figured out yet. It's working well enough that I keep building on it, but not well enough that I'd describe it as solved. That feels like an honest place to be.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →

Brood: worker queues for governance agents

Where Brood fits: the Trinity stack

The queue model

Workers, not chat sessions

Infrastructure choices

The governance binding

What this is still learning

Related tools and articles

Working on something like this?