Teaching a scanner to learn from its mistakes

The ticketyboo.dev scanner checks GitHub repositories for governance, dependency, security, and IaC issues. Like most rule-based scanners, it started with fixed heuristics: if no README exists, raise a finding. If a .env file is committed, raise a finding. Reliable, but static. Once written, the accuracy is frozen.

The problem with frozen accuracy is that it compounds over time. A scanner that flags "Missing CODEOWNERS file" for every single-contributor hobby repo will train users to ignore it. A scanner that misses hardcoded AWS account IDs in Terraform because that pattern wasn't in the original rules will keep missing them. The signal degrades until no one trusts it.

This demo adds a learning loop. It has four components.

The four components

Four-component learning loop. The dashed arrow is the feedback cycle: lessons from yesterday's aggregation improve tomorrow's scans.

Component 1: Confidence scores on findings

The scanner's Finding dataclass gets a confidence field (0.0 to 1.0, default 0.7). On each scan, after findings are deduplicated, the scanner applies confidence scores based on what it knows from the lessons document: suppressed patterns get 0.3, boosted patterns get 0.95, everything else gets category-level defaults (security 0.85, governance 0.75, etc.).

Confidence doesn't filter findings. It annotates them. A finding with 0.3 confidence still appears, but the UI can surface it differently. More importantly, it becomes a signal for the feedback loop: if a low-confidence finding is consistently marked accurate, that's a boost candidate.

Component 2: Human feedback API

One endpoint: POST /api/scan/{scan_id}/feedback. Body contains the finding ID (a stable hash of category + file + title), a verdict (accurate or false_positive), and an optional note. Rate-limited to prevent abuse. No authentication: the scanner is already public.

The finding ID is deterministic. The same finding (same category, file, title) will produce the same ID across different scans of the same repo. This makes it possible to correlate feedback across scan runs without storing links between them.

Feedback is stored in a separate DynamoDB table (scanner-feedback) with a GSI on verdict type. 90-day TTL. The aggregator queries both verdict types via that GSI and merges them.

Component 3: Nightly aggregator

An EventBridge cron triggers a Lambda at 02:00 UTC daily. It reads all feedback from the last 30 days, calculates accuracy metrics, identifies patterns, calls a mid-tier model to synthesise a lessons-learned document, writes it to S3, and writes the day's accuracy and false positive rates to a metrics table.

The aggregator logic is explicit about what it looks for:

Suppress candidates:   categories where FP rate > 60% (at least 5 samples)
Boost candidates:      findings with confidence < 0.6 that were marked accurate
Watch-for patterns:    recurring findings across multiple repos (from reviewer notes)

The model call takes roughly 4K tokens in, 1K out. At current pricing that's about $0.01. With CloudWatch and DynamoDB, the full daily run costs under $0.02.

The prompt instructs the model to produce structured markdown with three sections: Suppress, Boost, Watch for. The structure is rigid by design. The scanner's _apply_confidence() function parses these sections line by line to extract finding titles for the suppress and boost lists.

Component 4: RAG injection

At the start of each scan, the scanner calls _load_lessons(), which does a single S3 GetObject for scanner/lessons-learned.md. If the file doesn't exist, the call returns None and the scan proceeds with defaults. Graceful degradation is the whole point: the learning loop is additive, not required.

The lessons document is passed into ScanContext as lessons_context. The scanner's deep analysis layers (SAST, secret detection, IaC) can optionally inject the relevant section into their LLM prompts. Even for shallow scans, the confidence scoring logic reads the lessons document to adjust per-finding confidence values.

What the lessons document looks like

# Scanner Lessons Learned
Generated: 2026-03-27 02:00 UTC
Based on: 47 scans, 89 feedback items

## Suppress (commonly false positive)
- "Missing CODEOWNERS file" in repos with < 3 contributors (FP rate: 78%)
- "No branch protection" on personal/demo repos (FP rate: 85%)

## Boost (commonly missed but accurate)
- Hardcoded AWS account IDs in terraform (accuracy: 92%, often missed)
- Missing rate limiting on public API endpoints (accuracy: 88%)

## Watch for (emerging patterns)
- Repos using deprecated Node.js 16 runtime (seen in 4 of last 10 scans)
- Missing .gitignore entries for .env files (seen in 6 of last 10 scans)

This is injected verbatim into the scan prompt. The model sees it as context before making any decisions. The cost of the S3 read is negligible. The benefit compounds over weeks as patterns accumulate.

What this isn't

This is not a training loop. The model weights don't change. The scanner doesn't get smarter in a deep learning sense. What changes is the context it reasons from. That's a much smaller, cheaper, more controllable mechanism, and for a tool at this scale, it's the right mechanism.

It's also not a fully automated loop. The feedback requires a human to mark findings. That's intentional. Fully automated feedback (using one model's output to improve another's input) collapses into an echo chamber faster than you'd expect. A human signal, even occasional, anchors the loop to reality.

Simulated accuracy improvement. Real numbers will vary, but the trend is the point.

The accuracy dashboard

The scanner-learning demo page shows the metrics written by the aggregator: accuracy rate trend, false positive rate, and the current lessons document. It also provides the feedback form for submitting verdicts directly.

With 30 days of feedback, the suppression and boost lists become meaningful. Before that, the aggregator runs but produces a mostly empty document. That's fine. The loop doesn't need to produce value on day one.

Patterns at work

This demo implements five patterns from the agentic design pattern taxonomy:

Memory (8): scan history and feedback persisted in DynamoDB, lessons in S3
Learning and adaptation (9): nightly aggregator improves future scan prompts from feedback
Human-in-the-loop (13): humans mark findings; that signal anchors the learning cycle
RAG (14): lessons document injected as context before scanning
Evaluation and monitoring (19): accuracy and false positive rates tracked daily in DynamoDB

Pattern taxonomy from Antonio Gulli, "Agentic Design Patterns: A Hands-On Guide to Building Intelligent Systems" (Springer, 2025). All examples are original implementations from the ticketyboo.dev platform.

Found this useful? Support the project.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →