// Working with AI · A field guide

Why your AI gets worse the longer you talk to it.

A long chat quietly degrades, even inside the model's stated limit. The cause has a name, context rot, and a handful of plain habits fix most of it without any new tools.

Cameron Carmody Sydney June 2026 ~6 min read field guide

// at a glance

The problem

Long AI chats get less reliable as they grow, even well inside the model's token limit.

Why

Models don't read all their context equally. The middle gets skimmed, and clutter competes for attention.

The fix

Fresh chats, tight scope, restate the goal, and put the key instruction first or last.

The tooling

Dynamic Workflows automates the discipline: it delegates to fresh subagents and keeps only the answer.

You open a fresh chat and the first answer is sharp. You keep going, refining, pasting in more detail, asking follow-ups. By message twenty the thing has quietly become muddled. It forgets a constraint you set ten minutes ago. It contradicts something it told you on the same screen. It hands back a confident answer that's subtly wrong, and you can't say when the wheels came off.

It isn't you, and it isn't a dud model. It's a known failure mode, it has a name, and once you can see it you can work around most of it the same afternoon: no new software, no plugins, no prompt-engineering course.

01 · Context rot

The name for it.

The failure mode is called context rot: an AI's answers get less reliable as the conversation gets longer, even when you stay comfortably inside its stated limit. The research lab Chroma put hard numbers on it in July 2025, in a report called "Context Rot" that tested 18 leading AI models from Anthropic, OpenAI, Google and Alibaba, including the current frontier ones. Every single family degraded as the input grew. (A token, the unit these limits are counted in, is roughly a chunk of a word.) As Chroma put it, models are "typically presumed to process context uniformly," but "in practice, this assumption does not hold." Their conclusion is blunter: performance "grows increasingly unreliable as input length grows," and it shows up even on trivial tasks.

This is how today's models work. Switching to another product won't help; every leading family does it. The long chat doesn't fail because the model is bad. It fails because it's long.

02 · The mechanism

Why a long chat gets muddled.

Two mechanisms drive it.

The first is "lost in the middle." Models pay most attention to the start and the end of what you give them, and skim the bit in between. Researchers mapped this in 2023 and found a U-shaped accuracy curve: performance is highest when the relevant information sits at the beginning or end of the input, and "significantly degrades when models must access relevant information in the middle." On one test, accuracy dropped more than 20 percentage points purely because the answer was buried in the middle rather than at an edge. The single sentence that matters, placed in the centre of a long thread, is the one most likely to be ignored.

Figure 1 Lost in the middle. Models read the edges and skim the centre, so the one sentence that matters is most likely to be ignored when it sits mid-thread. The numbers illustrate the 2023 finding: a U-curve, highest at the edges, with a drop of over 20 points when the fact is buried in the middle.

The second is the gap between what you can pile on the desk and what the model can actually read. The advertised context window, the 200,000 tokens or the million, is the size of the pile you can stack on the desk. What the model keeps in front of it and reasons about is much smaller. NVIDIA's RULER benchmark found that a model scoring 96.6 out of 100 at 4,000 tokens fell to 81.2 by 128,000; the advertised window runs far past the length at which accuracy actually holds. A 2025 study from Adobe's research team, called NoLiMa, put the gap starkly: for many models the effective memory, the length at which they still work reliably, sits around 2,000 tokens, even when the advertised window is two million. You don't keep re-considering what you had for breakfast in 2019; the model has no such filter, so every token you've pasted in competes for the same finite attention. Anthropic's own engineers describe context as "a finite resource with diminishing marginal returns," an attention budget where "every new token introduced depletes this budget."

Figure 2 The pile versus the desktop. The advertised window is how much you can stack on the desk; effective memory is the much smaller amount the model actually reasons over, around 2,000 tokens in the NoLiMa benchmark (2025). The x-axis is log scale, so the three-order-of-magnitude gap reads at a glance.

That's the reframe that changes how you work.

The context window is not a memory that helps you. It's a budget you spend.

03 · The trap

Why pasting in more backfires.

The obvious defence is to give the model more background: the full history, every relevant document, and trust the big window to sort it out. The evidence says it backfires.

The big window does work for the easy case. Drop one distinctive fact into a million tokens of filler and ask for it back, and a capable model will retrieve it with well over 99 per cent recall. But that's a keyword match, not reasoning, and the realistic case is harder. Chroma found that even a single distractor, text that looks relevant but isn't the answer, measurably hurt performance against a clean baseline. Add more and it compounds. Stranger still, models did better on a shuffled, incoherent pile of text than on the same facts arranged in a tidy, logical order. The reason is debated; the practical lesson holds. The volume of text is only half of it. The other half is how much of that text competes for the model's attention.

The cleanest demonstration came from a memory test where models got the answer two ways: the full chat history of about 113,000 tokens, or a focused version of roughly 300 tokens holding only the relevant bit. Every model family scored significantly higher on the 300-token version. A tight prompt beat the full history across the board. Restating the one thing that matters beats making the model dig for it.

Figure 3 Restating the one thing that matters beats making the model dig for it. A ~300-token prompt outscored the ~113,000-token full history across every family tested. Scores are representative; the direction held for all families.

04 · The habits

The playbook.

None of this requires new tools. It requires a few habits.

Start fresh chats far more often than feels natural. The product coach Teresa Torres offers three plain triggers: start a new chat when you change topic, after a bad answer you want to retry, and once a conversation runs long. She puts a number on it: more than fifteen messages. If the thread holds something worth keeping, ask the model to summarise the conversation, then paste that summary into a clean chat. You carry the conclusion across and leave the clutter behind.

Restate the goal in the message that matters. Don't rely on the model remembering the brief you set fifteen exchanges ago; it's the worst-placed, most-diluted instruction in the whole thread. Say what you want again, now, with only the context this step needs.

The key instruction belongs first or last. Lead with it or close on it, never bury it mid-message. The middle is where instructions go to be forgotten.

Keep durable context in a document or a Project the model loads on demand, rather than re-pasting it into an ever-growing chat. Re-pasting grows the very thing that's hurting you. A loaded document keeps the source of truth outside the conversation, where it isn't compounding.

One catch: in browser ChatGPT or Claude you get no gauge showing how full the window is. You're flying blind. That's exactly why a crude message-count rule, fifteen then start again, is the most useful lever you've got.

05 · The tooling

The engineered version of the same discipline.

Everything above is you doing context management by hand. On 28 May 2026, Anthropic shipped the version that does it for you. Alongside Claude Opus 4.8 came Dynamic Workflows, a research preview in Claude Code.

Rather than stuffing a long job into one ballooning context, Claude writes a small program on the fly, a JavaScript orchestration script, that splits the work and fans it out to parallel subagents: separate helper sessions, each with its own clean context window. The script, ordinary code, holds the busywork: the repeated steps, the branching choices, the half-finished results. Claude's own session holds only the final answer. The heavy reading happens in disposable sub-contexts that get thrown away; the main session stays lean. As Anthropic put it, "intermediate results stay in script variables instead of landing in Claude's context."

That boundary is the whole trick. The code holds the mess: the loop, the half-finished results, the ten documents nobody needs to see again. The model's context holds only what it needs to reason about right now. It's the same instinct as starting a fresh chat and pasting in a 300-token summary, except the program runs the discipline for you, every step, without forgetting. Anthropic showed an earlier multi-agent version of this pattern beat the equivalent single agent by 90.2 per cent on its own internal research eval. Each subagent acts as an "intelligent filter," reading widely, then condensing only what mattered for the agent above it.

Figure 4 Dynamic Workflows: the heavy reading happens in three disposable sub-contexts that get thrown away, each handing back a filtered result rather than raw context, so the main session keeps only the final answer. Anthropic measured this pattern at 90.2% better than a single agent.

06 · The catch

The big gotcha.

Delegating across fresh subagents costs roughly fifteen times the tokens of a normal chat, against about four times for a single agent, and you pay per token. Clean, separate contexts cut both ways. The engineer Walden Yan, at Cognition, argued in "Don't Build Multi-Agents" that splitting tightly-coupled work across subagents loses the implicit decisions each one made: his example is one subagent building a Mario-style background while another builds a mismatched bird. Each had a clean context. Neither knew what the other chose. Delegation is for genuinely hard, genuinely parallel work where the value justifies the spend. Not the default for everything.

Closing

Brief it like a contractor.

It helps to think of it like hiring a contractor. You brief a contractor on the context and the goal for the job in front of them, and you don't assume they've read everything that came before. Hire a second one and you brief them from scratch, because they weren't there for the first one's work. Treat every AI task the same way. Whether you do it by hand with fresh chats and tight prompts, or let Dynamic Workflows do it for you, the rule holds: give each job a clean, complete brief, the context and the goal, and never assume the last conversation carried over.

Sources

Chroma 2025"Context Rot: How Increasing Input Tokens Impacts LLM Performance." July 2025. trychroma.com/research/context-rot
Lost in MiddleLiu et al. "Lost in the Middle: How Language Models Use Long Contexts." 2023 (TACL 2024). arXiv:2307.03172
RULERHsieh et al., NVIDIA. "RULER: What's the Real Context Size of Your Long-Context Language Models?" 2024. arXiv:2404.06654
NoLiMaModarressi et al., Adobe Research. "NoLiMa: Long-Context Evaluation Beyond Literal Matching." Feb 2025. arXiv:2502.05167
Anthropic"Effective context engineering for AI agents." 29 September 2025.
Anthropic"How we built our multi-agent research system." 13 June 2025.
Anthropic"Claude Opus 4.8 and Dynamic Workflows." 28 May 2026.
Product TalkTeresa Torres. "Context Rot: Why AI Gets Worse the Longer You Chat." producttalk.org/context-rot
CognitionWalden Yan. "Don't Build Multi-Agents." June 2025.

Want to put AI to work without the rot?

I build AI-assisted internal tools and workflow automation for small firms, the kind where the controls are structural and a human stays in the loop.

See what I do → Talk to me about your workflow

Cameron Carmody builds AI-assisted internal tools and workflow automation for small firms in Sydney: document intake, client portals, reporting and approval flows where humans stay in control. For consulting enquiries, reach out via the contact page.