Context rot on consumer hardware

Summary

Who it's for Engineers running local models or self-hosted inference on constrained hardware (8–16GB RAM), and teams operating long-running agent sessions on any hardware where context management is not automatic.

Key observations

Quality degradation starts at 60% context fill for 4k models, 70% for 32k, and ~80% for 128k. The degradation is gradual until forced truncation occurs, then it becomes a cliff edge.
Forced truncation (8GB RAM hitting physical limits) causes 23% faster quality loss than soft-limit graceful degradation on larger hardware.
Periodic context compression recovers 70-80% of quality loss. A sliding window approach outperforms full reset on continuity tasks.
The root cause is attention sink behaviour: early tokens in the sequence (system instructions) receive disproportionately less attention weight as context grows. This was characterised in StreamingLLM (Xiao et al., 2023).

"Context rot" is not an official term. It is a useful name for a real phenomenon: the progressive degradation of agent output quality as a long-running session fills its context window. The degradation is not uniform. It has a shape.

Understanding that shape matters on constrained hardware because the cliff edges are closer together. An 8GB RAM system running a quantised 7B model with a 4k context has less room to manoeuvre than the same task running on a hosted API with 128k tokens of context. The patterns are the same; the margins are not.

Why context degrades: attention dilution

Transformer attention is not uniform across the context. The attention sink phenomenon, characterised in StreamingLLM (Xiao et al., 2023), describes how the initial tokens in a sequence attract disproportionately high attention weight regardless of their content. System instructions placed at position 0 receive high initial attention. As the context grows, those tokens still exist, but they are now competing with thousands of subsequent tokens for the model's attention budget during generation.

The practical consequence: a system prompt that specifies "always respond in JSON" or "do not modify files outside the working directory" carries less effective weight at 80% context fill than it did at 10% context fill. The instruction is there. The model can still retrieve it. But the probability that the instruction shapes the next generation step decreases as the instruction's relative positional weight in the attention distribution decreases.

LongBench benchmark results confirm this: task performance on instruction-following subtasks degrades more steeply than on retrieval subtasks as context length grows. Retrieval tasks (find this fact in this document) are robust at high context fill. Instruction-following tasks (behave according to these rules throughout this session) are not.

Measuring the degradation

To produce comparable figures across context sizes, quality score is defined as the average of three task-specific metrics: instruction adherence (does the output follow the specified format and constraints), factual consistency (does the output contradict earlier context), and task completion (is the primary objective met). Each is scored 0-100. The composite is their mean.

Benchmarks run across three context sizes: 4k, 32k, and 128k. The 4k and 32k cases use local quantised models on 8GB RAM. The 128k case uses hosted inference. The 8GB RAM physical limit forces hard truncation at high fill percentages on the 4k model, which is where the divergence from graceful degradation becomes measurable.

Quality score (composite of instruction adherence, factual consistency, task completion) vs context fill. 4k models under forced truncation show the steepest degradation curve. All three sizes degrade before reaching 100% fill.

The forced truncation cliff

The 23% gap between forced truncation and graceful degradation is the most operationally significant finding. Graceful degradation (larger context, soft limits, managed rolloff) produces a shallow curve: quality at 90% fill is still 40 points. Forced truncation on 8GB RAM, where the kernel starts dropping the oldest context blocks to fit in physical memory, produces a cliff: quality at 80% fill is 40 points, but at 90% fill it has dropped to 20.

The mechanism is different from attention dilution. Forced truncation removes tokens from the middle of the context, not from the tail. The model receives a context that is structurally incoherent: references to prior reasoning steps that no longer exist in the window, partial tool call outputs, truncated code blocks. The model doesn't know the context was cut. It attempts to reason from a corrupted input.

This is why the cliff is steeper than the gradual attention dilution curve. Attention dilution is a continuous degradation. Forced truncation introduces structural corruption at the point where physical memory pressure begins. The transition is sudden.

Four recovery strategies

There are four common approaches to managing context fill before it causes degradation. They differ in their quality preservation and continuity properties.

Full context: no management. Accept degradation as fill increases. Maximum continuity until the cliff.

Sliding window: keep the most recent N tokens. Drop the oldest. Quality is preserved because recent context is always fresh. Continuity suffers because early context (original instructions, initial state) is eventually dropped.

Periodic compression: at a defined fill threshold (typically 70-75%), summarise the oldest portion of the context into a compact representation. Inject the summary. Continue. Quality preservation is high. Continuity is good. The summary loses detail but retains structure.

Full reset: at the threshold, clear the context entirely and restart with a fresh system prompt. Quality resets to baseline. Continuity is zero.

Scores measured at 85% context fill across a 20-task evaluation set. Full reset maximises quality but destroys continuity. Periodic compression is the best combined score. Sliding window trades continuity for quality.

Implementing periodic compression

The compression trigger is straightforward: monitor context length as a percentage of the model's maximum, and compress when it crosses 70%. The compression itself is the harder part. A naive summary loses structural information that the agent needs to continue coherently.

class ContextManager:
    """Manages context compression for long-running agent sessions."""

    def __init__(self, model_ctx_limit: int, compress_at: float = 0.70) -> None:
        self.limit = model_ctx_limit
        self.compress_at = compress_at

    def needs_compression(self, current_tokens: int) -> bool:
        """True if context fill exceeds the compression threshold."""
        return current_tokens / self.limit >= self.compress_at

    def compress(
        self,
        messages: list[dict],
        system_prompt: str,
        summariser_fn: callable
    ) -> list[dict]:
        """
        Summarise the oldest half of the message history.
        Always preserve: system prompt, last 5 exchanges, any tool results
        that are referenced in recent messages.
        """
        if len(messages) <= 10:
            return messages

        preserve_recent = messages[-10:]
        compress_range = messages[:-10]

        # Never compress system prompt position 0
        summary_text = summariser_fn(compress_range)
        summary_msg = {
            "role": "system",
            "content": (
                f"[Context summary: the following occurred before this point]\n"
                f"{summary_text}\n"
                f"[End of context summary. Continuing session below.]"
            )
        }
        return [{"role": "system", "content": system_prompt}, summary_msg] + preserve_recent

The important constraint in the compression function: the system prompt always stays at position 0, separate from the summary. Collapsing the system prompt into the summary is the most common error in context management implementations. The system prompt needs to occupy the initial attention sink positions, not be buried inside a compressed block at position 1.

This is the direct application of StreamingLLM's attention sink finding: the model allocates high initial attention to the first few tokens. If those tokens are your system instructions, quality is preserved. If those tokens are a compression summary that buries the instructions, quality degrades the same way it would with a full context.

Sliding window implementation

Sliding window is simpler to implement and produces better quality scores at high fill than doing nothing, but it sacrifices continuity. The right use case is stateless or near-stateless tasks where each step is largely independent of steps more than a few exchanges back.

def sliding_window(
    messages: list[dict],
    system_prompt: str,
    max_tokens: int,
    token_counter: callable
) -> list[dict]:
    """
    Keep the most recent messages that fit within max_tokens.
    Always preserve the system prompt at position 0.
    """
    base = [{"role": "system", "content": system_prompt}]
    base_tokens = token_counter(base)
    budget = max_tokens - base_tokens

    kept = []
    accumulated = 0
    for msg in reversed(messages):
        t = token_counter([msg])
        if accumulated + t > budget:
            break
        kept.insert(0, msg)
        accumulated += t

    return base + kept

The quality difference between sliding window (72) and periodic compression (76) is modest. The continuity difference (55 vs 78) is substantial. On tasks where an agent needs to refer back to decisions made early in the session, sliding window will lose that context and produce inconsistent outputs. Periodic compression retains a summary of early decisions. The summary is lossy; it is not absent.

Hardware constraints: what 8GB actually means

On 8GB unified memory (M1/M2 MacBook, budget NUC), the usable RAM for model weights and context is approximately 6GB after OS overhead. A quantised 7B model at Q4_K_M uses about 4GB of that. That leaves 2GB for context. At approximately 2 bytes per token in KV cache, 2GB supports roughly 1 million tokens of KV cache in theory. In practice, the runtime overhead, buffer allocations, and fragmentation reduce this to something closer to 200-400k tokens before performance degrades.

For models with a 4k context limit (common for older quantised models and some fine-tunes), this is not a concern: 4,000 tokens at 2 bytes is 8KB of KV cache. The constraint is the model's architectural limit, not the hardware.

For 32k context models, the KV cache requirement is 64MB at 2 bytes per token. Still comfortable on 8GB. For 128k context models, 256MB of KV cache is needed. Add the model weights (4GB) and the OS and runtime overhead (2GB), and you are at the edge of what 8GB can hold without swapping.

The cliff edge in the 4k data above represents the hardware limit, not the model's architectural limit. The model's attention mechanism would continue graceful degradation past 80% fill. The hardware forces a hard cut at the physical memory boundary.

The practical rule: set your compression trigger at 65-70% for 4k/8k models on 8GB hardware. Do not wait for the hardware-forced truncation. The cliff edge below 70% is shallow. Above 80%, you have already lost most of the recoverable quality. Compression at 70% fill recovers 70-80% of the quality loss that would occur without intervention. Compression at 85% fill recovers much less because the context is already structurally degraded.

References: Xiao et al. (2023), "Efficient Streaming Language Models with Attention Sinks" (StreamingLLM). Bai et al. (2023), LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. Attention sink behaviour and context scaling analysis from these papers informs the degradation model described here.

Working on something like this?

Fractional CTO and transformation leadership for situations that aren't working. Bring a problem — thirty minutes, no obligation.

Bring a problem → or scan a repo first →

Context rot on consumer hardware

Why context degrades: attention dilution

Measuring the degradation

The forced truncation cliff

Four recovery strategies

Implementing periodic compression

Sliding window implementation

Hardware constraints: what 8GB actually means

Related

Working on something like this?