"Context rot" is not an official term. It is a useful name for a real phenomenon: the progressive degradation of agent output quality as a long-running session fills its context window. The degradation is not uniform. It has a shape.
Understanding that shape matters on constrained hardware because the cliff edges are closer together. An 8GB RAM system running a quantised 7B model with a 4k context has less room to manoeuvre than the same task running on a hosted API with 128k tokens of context. The patterns are the same; the margins are not.
Why context degrades: attention dilution
Transformer attention is not uniform across the context. The attention sink phenomenon, characterised in StreamingLLM (Xiao et al., 2023), describes how the initial tokens in a sequence attract disproportionately high attention weight regardless of their content. System instructions placed at position 0 receive high initial attention. As the context grows, those tokens still exist, but they are now competing with thousands of subsequent tokens for the model's attention budget during generation.
The practical consequence: a system prompt that specifies "always respond in JSON" or "do not modify files outside the working directory" carries less effective weight at 80% context fill than it did at 10% context fill. The instruction is there. The model can still retrieve it. But the probability that the instruction shapes the next generation step decreases as the instruction's relative positional weight in the attention distribution decreases.
LongBench benchmark results confirm this: task performance on instruction-following subtasks degrades more steeply than on retrieval subtasks as context length grows. Retrieval tasks (find this fact in this document) are robust at high context fill. Instruction-following tasks (behave according to these rules throughout this session) are not.
Measuring the degradation
To produce comparable figures across context sizes, quality score is defined as the average of three task-specific metrics: instruction adherence (does the output follow the specified format and constraints), factual consistency (does the output contradict earlier context), and task completion (is the primary objective met). Each is scored 0-100. The composite is their mean.
Benchmarks run across three context sizes: 4k, 32k, and 128k. The 4k and 32k cases use local quantised models on 8GB RAM. The 128k case uses hosted inference. The 8GB RAM physical limit forces hard truncation at high fill percentages on the 4k model, which is where the divergence from graceful degradation becomes measurable.
The forced truncation cliff
The 23% gap between forced truncation and graceful degradation is the most operationally significant finding. Graceful degradation (larger context, soft limits, managed rolloff) produces a shallow curve: quality at 90% fill is still 40 points. Forced truncation on 8GB RAM, where the kernel starts dropping the oldest context blocks to fit in physical memory, produces a cliff: quality at 80% fill is 40 points, but at 90% fill it has dropped to 20.
The mechanism is different from attention dilution. Forced truncation removes tokens from the middle of the context, not from the tail. The model receives a context that is structurally incoherent: references to prior reasoning steps that no longer exist in the window, partial tool call outputs, truncated code blocks. The model doesn't know the context was cut. It attempts to reason from a corrupted input.
This is why the cliff is steeper than the gradual attention dilution curve. Attention dilution is a continuous degradation. Forced truncation introduces structural corruption at the point where physical memory pressure begins. The transition is sudden.
Four recovery strategies
There are four common approaches to managing context fill before it causes degradation. They differ in their quality preservation and continuity properties.
Full context: no management. Accept degradation as fill increases. Maximum continuity until the cliff.
Sliding window: keep the most recent N tokens. Drop the oldest. Quality is preserved because recent context is always fresh. Continuity suffers because early context (original instructions, initial state) is eventually dropped.
Periodic compression: at a defined fill threshold (typically 70-75%), summarise the oldest portion of the context into a compact representation. Inject the summary. Continue. Quality preservation is high. Continuity is good. The summary loses detail but retains structure.
Full reset: at the threshold, clear the context entirely and restart with a fresh system prompt. Quality resets to baseline. Continuity is zero.
Implementing periodic compression
The compression trigger is straightforward: monitor context length as a percentage of the model's maximum, and compress when it crosses 70%. The compression itself is the harder part. A naive summary loses structural information that the agent needs to continue coherently.
class ContextManager:
"""Manages context compression for long-running agent sessions."""
def __init__(self, model_ctx_limit: int, compress_at: float = 0.70) -> None:
self.limit = model_ctx_limit
self.compress_at = compress_at
def needs_compression(self, current_tokens: int) -> bool:
"""True if context fill exceeds the compression threshold."""
return current_tokens / self.limit >= self.compress_at
def compress(
self,
messages: list[dict],
system_prompt: str,
summariser_fn: callable
) -> list[dict]:
"""
Summarise the oldest half of the message history.
Always preserve: system prompt, last 5 exchanges, any tool results
that are referenced in recent messages.
"""
if len(messages) <= 10:
return messages
preserve_recent = messages[-10:]
compress_range = messages[:-10]
# Never compress system prompt position 0
summary_text = summariser_fn(compress_range)
summary_msg = {
"role": "system",
"content": (
f"[Context summary: the following occurred before this point]\n"
f"{summary_text}\n"
f"[End of context summary. Continuing session below.]"
)
}
return [{"role": "system", "content": system_prompt}, summary_msg] + preserve_recent
The important constraint in the compression function: the system prompt always stays at position 0, separate from the summary. Collapsing the system prompt into the summary is the most common error in context management implementations. The system prompt needs to occupy the initial attention sink positions, not be buried inside a compressed block at position 1.
This is the direct application of StreamingLLM's attention sink finding: the model allocates high initial attention to the first few tokens. If those tokens are your system instructions, quality is preserved. If those tokens are a compression summary that buries the instructions, quality degrades the same way it would with a full context.
Sliding window implementation
Sliding window is simpler to implement and produces better quality scores at high fill than doing nothing, but it sacrifices continuity. The right use case is stateless or near-stateless tasks where each step is largely independent of steps more than a few exchanges back.
def sliding_window(
messages: list[dict],
system_prompt: str,
max_tokens: int,
token_counter: callable
) -> list[dict]:
"""
Keep the most recent messages that fit within max_tokens.
Always preserve the system prompt at position 0.
"""
base = [{"role": "system", "content": system_prompt}]
base_tokens = token_counter(base)
budget = max_tokens - base_tokens
kept = []
accumulated = 0
for msg in reversed(messages):
t = token_counter([msg])
if accumulated + t > budget:
break
kept.insert(0, msg)
accumulated += t
return base + kept
The quality difference between sliding window (72) and periodic compression (76) is modest. The continuity difference (55 vs 78) is substantial. On tasks where an agent needs to refer back to decisions made early in the session, sliding window will lose that context and produce inconsistent outputs. Periodic compression retains a summary of early decisions. The summary is lossy; it is not absent.
Hardware constraints: what 8GB actually means
On 8GB unified memory (M1/M2 MacBook, budget NUC), the usable RAM for model weights and context is approximately 6GB after OS overhead. A quantised 7B model at Q4_K_M uses about 4GB of that. That leaves 2GB for context. At approximately 2 bytes per token in KV cache, 2GB supports roughly 1 million tokens of KV cache in theory. In practice, the runtime overhead, buffer allocations, and fragmentation reduce this to something closer to 200-400k tokens before performance degrades.
For models with a 4k context limit (common for older quantised models and some fine-tunes), this is not a concern: 4,000 tokens at 2 bytes is 8KB of KV cache. The constraint is the model's architectural limit, not the hardware.
For 32k context models, the KV cache requirement is 64MB at 2 bytes per token. Still comfortable on 8GB. For 128k context models, 256MB of KV cache is needed. Add the model weights (4GB) and the OS and runtime overhead (2GB), and you are at the edge of what 8GB can hold without swapping.
The cliff edge in the 4k data above represents the hardware limit, not the model's architectural limit. The model's attention mechanism would continue graceful degradation past 80% fill. The hardware forces a hard cut at the physical memory boundary.
If the articles or tools have been useful, a coffee helps keep things running.
☕ buy me a coffeeReferences: Xiao et al. (2023), "Efficient Streaming Language Models with Attention Sinks" (StreamingLLM). Bai et al. (2023), LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. Attention sink behaviour and context scaling analysis from these papers informs the degradation model described here.
ticketyboo brings governed AI development to your pull request workflow. 5 governance runs free, one-time welcome grant. No card required.
View pricing Start free →