Replacing Paperclip with Lambda

Summary

Who it's for Engineers running AI agent infrastructure who've hit the point where the orchestrator is more complex than the work it orchestrates.

3 key takeaways

A 5-link dependency chain (Lightsail → Paperclip API → LLM → ticket system → proxy) was the single biggest reliability risk on the platform. Removing it fixed the reliability problem entirely.
The ops agents don't need LLMs. They check CloudWatch metrics, query Cost Explorer, scan IAM policies. Deterministic Python functions do this better than prompted models.
The governance model (declarative rules, approval gates, escalation) survived the migration intact. The rules don't care what executes them.

~5 min read

What Paperclip was

Paperclip was a hosted AI agent orchestrator. It ran on a Lightsail instance, managed agent definitions, scheduled heartbeats, routed LLM calls, and maintained a ticket system for agent-created work items. The ticketyboo.dev ops team — CTO, SRE, Security, and Cost agents — ran through it.

The architecture looked like this:

EventBridge schedule
  → Lightsail (Paperclip API)
    → OpenRouter (LLM call)
      → Paperclip ticket system
        → Lambda proxy (team_proxy.py)
          → Browser (team dashboard)

Five links. Each one a failure point. The Lightsail instance needed SSH access for debugging. The Paperclip API needed its own auth tokens. The LLM calls needed an OpenRouter key and had variable latency. The ticket system was Paperclip-internal, so board data couldn't be accessed without the API being up. The Lambda proxy had to translate Paperclip's response format into the frontend contract.

What failed

Nothing dramatic. The system just wasn't reliable enough for what it was doing. The Lightsail instance would occasionally become unreachable. LLM calls would time out or return unexpected formats. The Paperclip API had its own deployment cycle that didn't align with the platform's. When any link in the chain broke, the team dashboard showed stale data or nothing at all.

The deeper problem: the ops agents don't need LLMs. The SRE agent checks CloudWatch metrics against thresholds. The Cost agent queries Cost Explorer and compares against a budget. The Security agent scans IAM policies for known anti-patterns. These are deterministic operations. Wrapping them in an LLM call added latency, cost, and unpredictability without adding capability.

What replaced it

EventBridge schedule
  → Lambda (agent function)
    → DynamoDB (team-activity table)
      → Lambda proxy (team_proxy.py)
        → Browser

Three links instead of five. No Lightsail. No LLM. No external ticket system. Each agent is a Lambda function that runs on a schedule, does its checks, and writes results directly to DynamoDB. The proxy reads from DynamoDB and GitHub Issues (for the board). Done.

The agent functions share a base module (agent_base.py) that handles DynamoDB writes, TTL management, and the agent ID registry. Each agent implements a single handler function:

# agents/sre_handler.py — simplified
from agents.agent_base import write_run, write_activity, write_status

def handler(event, context):
    """SRE heartbeat: check CloudWatch, write results."""
    checks = run_health_checks()  # CloudWatch, Lambda errors, cache ratio

    for service, result in checks.items():
        write_status(service, result["status"], result.get("details"))

    status = "succeeded" if all_healthy(checks) else "failed"
    write_run("sre", status, source="scheduled", summary=summarise(checks))
    write_activity("sre", "heartbeat.invoked")

What stayed the same

The frontend contract didn't change. team.js still calls /api/team/runs, /api/team/activity, etc. The field allowlist in team_proxy.py still filters every response. The agent UUIDs were preserved so the dashboard's agent registry didn't need updating.

The governance model survived intact. The approval gates, escalation rules, and auto-reject conditions are the same declarative JSON they always were. The rules don't care whether a Paperclip process or a Lambda function evaluates them — they define what's allowed, not how it's enforced.

Board data moved from Paperclip's internal ticket system to GitHub Issues. This is better: issues are visible without the ops infrastructure being up, they have a public URL, and they integrate with the existing PR workflow.

What it costs

Before: Lightsail instance ($3.50/month, the cheapest tier) plus OpenRouter LLM calls ($2–5/month depending on heartbeat frequency).

After: Lambda invocations and DynamoDB writes, both well within Free Tier. The four agents run a combined ~150 invocations/day. At 128MB memory and sub-second execution, that's nowhere near the 1M free requests/month. DynamoDB writes with 30-day TTL auto-expire, keeping the table small.

Net saving: ~$5–8/month. Not life-changing, but on a platform that's committed to Free Tier, every dollar matters.

The pattern

If your AI ops agents are doing deterministic work — checking metrics, scanning configs, comparing values against thresholds — you probably don't need an agent orchestrator. You need scheduled functions and a results table. The orchestrator adds value when agents need to reason, plan multi-step actions, or coordinate with each other through natural language. For everything else, it's overhead.

The declarative governance model (rules as JSON, approval gates, escalation conditions) is worth keeping regardless of execution engine. Separate the rules from the runtime. The rules are the valuable part.

The old Paperclip config files are still in the repo at tools/paperclip-config/ — agent definitions, governance rules, skills. They're useful as documentation of the declarative pattern even though the runtime that consumed them is gone.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →

Replacing Paperclip with Lambda

What Paperclip was

What failed

What replaced it

What stayed the same

What it costs

The pattern

Related tools and articles

Working on something like this?