Recursive Language Models: Why Bigger Context Isn't the Fix

I've been losing arguments with Claude Code, Anthropic's coding agent, at hour four for months. Not because the model gets a question wrong. Because somewhere around the 200,000-token mark of an honest debugging session, it starts contradicting decisions it made an hour earlier and forgetting what file it's in. The model didn't crash. The window didn't overflow. It just got noticeably worse at being itself. That failure mode has a name now, and the name is context rot.^[1]

Three researchers at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) just published the cleanest argument I've seen for what to do about it. Alex Zhang, Tim Kraska, and Omar Khattab put a paper on arXiv on December 31, 2025, titled Recursive Language Models.^[2]The pitch is small to state and big to live with. Stop feeding long prompts into the model's context window. Load the prompt into a Python REPL as a variable, and let the model write code to peek at it, search it with pattern matching, chunk it, and call itself recursively on pieces. The benchmark numbers say a small cheap model wrapped in this harness beats a frontier model raw, at roughly the same cost. The non-obvious claim, and the reason this matters past the benchmarks, is that the long-context arms race has been solving the wrong problem.

Summary

Context rot is the empirical observation that LLMs get noticeably worse as the input grows, even well inside the advertised context window. Chroma Research coined the phrase in 2025; it's not the same as running out of tokens.^[1] RLMs are a way to fix it without making the window bigger.

Three Splits, Not One Paradigm

Every meaningful jump in large language model (LLM) capability over the last four years wasn't a smarter model. It was someone noticing the model had been asked to do two jobs at once, and splitting them apart.

Chain of Thought, the Google paper from January 2022, split the answer from the reasoning that produced it. Before CoT, you asked the model a question and it returned an answer. After, you asked and it produced intermediate tokens first, then the answer. Same model, same forward pass, but the model was no longer trying to do arithmetic and write a sentence in the same breath.^[3]ReAct, Princeton plus Google, October 2022, did the next split. It separated the model's internal reasoning from external action. Thought tokens interleave with tool calls; the world's response flows back into context. The model isn't pretending to know what an application programming interface (API) returned. It calls the API, then reasons over the result.^[4]

Recursive Language Models do the third version of this move. They split context storage from context reasoning. The model that reasons about your answer is small and focused. The 500,000-token prompt lives somewhere else, in a Python read-eval-print loop (REPL), the interactive shell programmers use to run code one snippet at a time, which the model can query. That's a division of labor at the prompting layer, applied to the input itself.

Inference-time scaling, by what got split

Each jump was a division of labor, not a capacity bump

Jan 2022
Chain of Thought (Wei et al., Google)
Splits the answer from the reasoning. Same forward pass, but the model writes its work first.^[3]
Oct 2022
ReAct (Yao et al., Princeton + Google)
Splits internal reasoning from the external world. Tool calls interleave with thought tokens.^[4]
Sep 2024
o1-style reasoning models (OpenAI)
Orthogonal axis. Scales compute per token via reinforcement learning (RL) on long CoT traces. Industry converged on this in 2025 (DeepSeek R1, Claude extended thinking, Gemini 2.5).^[5]Doesn't touch the context-storage problem.
Dec 2025
Recursive Language Models (Zhang, Kraska, Khattab, MIT CSAIL)
Splits context storage from context reasoning. Prompt becomes a variable the model queries with code instead of input the model attends over.^[2]

Takeaway

o1-style reasoning is the dominant 2025 paradigm by adoption, but it scales compute per token. RLMs scale how the model accesses its own context. Different axes, both real.

I want to be careful here, because calling RLMs “the third paradigm” would be overreach. CoT and ReAct are foundational works with five-figure citation counts. RLMs is a December 2025 paper with 3,500 GitHub stars.^[6] But the structural move is the right shape, and it has an even older precedent. In 2014 and 2016, Alex Graves and his team at DeepMind built Neural Turing Machines and Differentiable Neural Computers, neural controllers explicitly separated from an external addressable memory the controller learned to read and write.^[7]Same instinct. Reasoner distinct from memory substrate. The architectural version didn't stick. Retrieval-augmented generation (RAG), the approach of pre-indexing documents and looking up the relevant chunks at query time, ate its lunch by being simpler. RLMs are the DNC idea reborn at the prompting layer, on top of frozen LLMs, and the reason to care is that the prompting-layer version finally produces numbers that make people pay attention.

What an RLM Actually Does

Strip the framing and the mechanism is simple. The user's long prompt never enters the root model's context window directly. It enters a Python notebook environment as a string variable. The root model gets the user's query plus a short message that amounts to: “there's a variable called prompt in the REPL, here are its first 500 characters and its length, write code to interact with it, wrap your final answer in FINAL().”^[2] Then it runs in a loop. The model writes Python. The REPL executes it. The output (truncated if huge) goes back to the model. Repeat until the model emits a FINAL.

Zhang's blog catalogues the strategies the model develops on its own, with no fine-tuning required.^[8] They look like what a competent engineer would do at a Python prompt.

The RLM inference loop

The prompt is a variable in the environment, not input to the model.

What goes in

User queryA few hundred tokens, fits anywhere
prompt = "..."Loaded as REPL variable, can be millions of tokens
Metadata onlyLength + first 500 chars given to root LM

Root LM + Python REPL

Loop: write code → execute → see truncated output → repeat

What the model can do

Peekprint(prompt[:2000])
Grepre.findall(r"...", prompt)
Chunk + mapSplit, then sub-call self on each
FINAL(answer)Terminate with the result

Never happensStuffing the full prompt into context

The model decides how to inspect its own input. Nothing about the access pattern is hand-coded by the framework.

Here's what an iteration looks like in practice. This is illustrative, not the literal API of the official alexzhang13/rlm package, but it captures the loop.

rlm_loop.pyPython

# The root LM sees only the query + a metadata stub.
# The 500,000-token prompt lives in the REPL as: prompt = "..."

# Turn 1: peek at structure
code = "print(prompt[:2000])"

# Turn 2: grep for what looks promising
code = """
import re
matches = re.findall(r'.*revenue.*', prompt)
print(matches[:20])
"""

# Turn 3: chunk and recursively sub-call self on each piece
code = """
chunks = [prompt[i:i+50000] for i in range(0, len(prompt), 50000)]
summaries = [llm_call(f'Summarize for revenue figures: {c}') for c in chunks]
result = '\n'.join(summaries)
"""

# Turn 4: produce the final answer
code = "FINAL('Q4 revenue was $42B, up 16% YoY.')"

Takeaway

The model isn't handed a retrieval index, a chunk strategy, or a decomposition tree. It writes them. The framework's entire job is exposing the prompt as a variable and running the loop.

The Numbers

The headline results are where this gets interesting. Across four long-context benchmarks the paper reports, the gap between a frontier model with the prompt stuffed into its window, and the same model wrapped in an RLM, is large. Not 5%. Not 15%. Whole-table large.

0% → 91.3%

multi-hop browsing benchmark

BrowseComp-Plus, GPT-5 alone vs RLM(GPT-5)

0.1 → 58.0

20-task author-built set

OOLONG-Pairs F1, GPT-5 alone vs RLM(GPT-5)

+114%

at similar total API cost

RLM(GPT-5-mini) vs GPT-5 on OOLONG at 132K tokens

+28.3%

after fine-tuning on 1,000 recorded runs

RLM-Qwen3-8B avg lift over base Qwen3-8B

BrowseComp-Plus, an OpenAI benchmark from April 2025 that tests persistent multi-hop document browsing (following links across many pages to piece together an answer), has GPT-5 alone scoring 0% on the contexts the paper evaluates.^[9] Wrapped in an RLM, the same model scores 91.3%. OOLONG, a benchmark out of Carnegie Mellon University from November 2025, is designed to defeat retrieval by forcing the model to read every chunk and combine findings across them. All three frontier 1M-token models (GPT-5, Claude Sonnet 4, Gemini 2.5 Pro) score under 50% on it at 128K tokens.^[10] The most striking number in the paper comes from OOLONG-Pairs, a 20-task set the authors built themselves: GPT-5 alone gets an F1 score of 0.1 (F1 is a standard accuracy metric; higher is better, 100 is perfect). The same GPT-5 wrapped in an RLM gets 58.0.

The cost story matters more than the accuracy story, because cost is what lets a small model in an RLM wrapper compete with a frontier model raw. On the OOLONG benchmark at 132K tokens, Zhang reports RLM(GPT-5-mini) outperforming GPT-5 by over 34 points (about a 114% relative gain) at roughly the same total API cost per query.^[8] That is the kind of number that gets investors to ask which layer of the stack is going to capture the value created.

“If tomorrow the best frontier LM can reasonably handle 10M tokens of context, then an RLM can reasonably handle 100M tokens of context, maybe at half the cost too.”

Alex Zhang, lead author, October 2025 blog post

What ‘Recursive’ Is Hiding

The paper's title is doing rhetorical work the math doesn't fully support, and the practitioner reaction picked up on it immediately. The Hacker News thread on the paper from January 2026 had the sharpest comments tagging the obvious tension. Legend2440, the top reply: “Isn't this just subagents? You call another LLM to go read a file and extract some piece of information, so that you don't clutter up the main context with the whole file. Neat idea, but not a new idea.”^[11]Another commenter, seeknotfind, looked closer: “I can't find any evidence that more is done than calling the model repeatedly.”^[11]

The fair version of that critique is this. The paper's default configuration uses a maximum recursion depth of one. The authors say so plainly in the limitations section.^[2] Sub-calls are a single layer deep. The thing called Recursive Language Models, in every experiment in the paper, recurses exactly once. Daren Wang, an independent researcher, took this seriously enough to publish a reproduction paper in March 2026. He ran the framework with DeepSeek v3.2 and Kimi K2 at depth 1 and depth 2. The depth-2 results are unkind.^[12]

“Deeper recursion causes models to overthink. Applying deeper recursion or using RLMs on simple retrieval tasks paradoxically degrades performance and exponentially inflates execution time, e.g., from 3.6s to 344.5s.”

Daren Wang, reproduction paper abstract, March 2026

It gets more honest. The paper's own ablation table (the table that shows what happens when you strip out individual pieces of the method) tells on itself. For one of the two frontier models tested (Qwen3-Coder-480B, Alibaba's open-weight coding model), the version of the framework that disables recursive sub-calls entirely beats the full RLM on two of the four benchmarks. CodeQA: 66.0% with no sub-calls, 56.0% with them. BrowseComp+: 46.0% versus 44.7%. The paper acknowledges this: “this ablation is able to outperform the RLM by 17.9% and 3% respectively.”^[2]The thing in the title isn't always the thing carrying the result.

A jump from 0.1 to 58 on a 20-task benchmark of your own construction is not the same as a jump on an established external benchmark, and the paper would be stronger if it framed it that way. The Qwen3-Coder version of the framework also needs an extra line in its system prompt warning it not to make too many sub-calls, because without that warning it “will try to perform a subcall on everything, leading to thousands of LM subcalls for basic tasks.”^[2]The cost-parity claim assumes blocking sequential sub-calls; an asynchronous implementation that the authors describe but didn't build would change the cost story.

Side note

Prime Intellect's January 2026 writeup, the one that called RLMs “the paradigm of 2026,” is also the friendliest source on the planet to this paper. It still admits RLMs “underperform on shorter, tool-light problems” and “increase execution time substantially.”^[13] When the cheerleader concedes the limits, take the cheerleader seriously.

“Just subagents” is a fair description of the working configuration. What the paper adds isn't recursion at depth ten. It's two specific moves that earlier subagent work didn't combine. First, the prompt itself lives in the environment, addressable as a string the model can slice and regex over. Prior work in this lineage (Context Folding, AgentFold, MemWalker) folded agent trajectory history or pre-summarized the document offline. The RLM keeps the prompt as a live string variable in the REPL, which is a structural difference, not just a different prompt template.^[2] Second, the authors post-trained an 8-billion-parameter model (RLM-Qwen3-8B) on a thousand filtered recordings of itself doing this, and got a 28.3% average lift over the base Qwen3-8B.^[2] That says how a model interacts with its own context is learnable, which is the part that should make a serious researcher pay attention.

What Actually Changes

Set aside the naming fight. If you assume the working configuration (call it depth-1 RLMs, or subagents-with-a-REPL, the label doesn't matter) is the real artifact, four things downstream of it change. None of them are speculative; the precursors are already shipping.

Long-horizon agents become viable.The reason Claude Code or Cursor sessions get worse over hours isn't that the underlying model got dumber. It's that the same model is now both reasoning about your bug and holding 200,000 tokens of conversation history in working memory. Splitting those jobs is exactly what RLMs propose. Expect the next generation of coding agents to look like a small focused root model writing code against a REPL that holds the codebase, the chat history, and the file diffs as variables. Not as context.

Cheap models get to compete.A small model in an RLM wrapper beating a frontier model raw, at the same dollar cost, breaks the assumption that frontier capability requires frontier inference cost. If that result holds in production (and Wang's reproduction is a real warning that it doesn't hold uniformly), then the right architecture is often a cheap open-weight root like Qwen3 or DeepSeek with selective frontier sub-calls on the hard chunks. The big-lab moats narrow. So does Nvidia's per-query revenue, which is its own conversation.

Post-training shifts target. RLM-Qwen3-8B is the proof of concept, not the punchline. The punchline is that you can train a model with reinforcement learning to be specifically good at calling itself. The next round of small open-weight models will almost certainly include native RLM training as a step, and Hugging Face (the public repository where most open-weight models get released) will have a hub category for it by Q3 2026.

The context-window race becomes less central.Anthropic's 1M-token Claude. Gemini 2.5 Pro's 1M tokens. GPT-5.4's ~1.05M with 2x billing above 272K.^[14] All of it still degrades on aggregation tasks below 50% at 128K, per the OOLONG numbers.^[10]The right question stops being “whose window is bigger” and starts being “whose model is best at managing its own access pattern.” That question doesn't favor the labs with the most GPUs. It favors the labs whose models are best at writing a five-line Python script under pressure.

If I were building agent infrastructure today, I'd treat this paper as a real result on a specific class of problems (synthesis- heavy tasks where retrieval can't shortcut the work) and a brittle prototype on the rest. Wang's 344-second depth-2 latency is real. The Qwen3 ablation is real. But so is the 0%-to-91.3% on BrowseComp-Plus, and so is the underlying move: the prompt belongs in the environment, not in the network.

What the paper actually shows isn't recursion. It's that the trajectory of how a model interacts with its own input is a training target, like any other, and the training works. That's the bit most of the walkthroughs missed. The naming will get rewritten. The split won't.

Sources and further reading

1.PrimaryHong, Troynikov, Huber (Chroma Research), "Context Rot: How Context Degradation Affects LLM Performance". 2025. Coined the term "context rot" the RLM paper builds on. Documents the empirical observation that LLMs degrade as input grows even inside the advertised window.
2.PrimaryZhang, Kraska, Khattab (MIT CSAIL), "Recursive Language Models". arXiv 2512.24601, submitted December 31, 2025; revised January 28, 2026. The paper. All benchmark numbers, the algorithm, the FINAL/FINAL_VAR mechanism, the depth=1 limitation, the Qwen3 ablation, the OOLONG-Pairs construction.
3.PrimaryWei et al. (Google), "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models". arXiv 2201.11903, January 2022. The CoT paper. Splits intermediate reasoning from the final answer.
4.PrimaryYao et al. (Princeton + Google), "ReAct: Synergizing Reasoning and Acting in Language Models". arXiv 2210.03629, October 2022. The ReAct paper. Splits internal reasoning from external action.
5.PrimaryOpenAI, "Learning to reason with LLMs (o1)". September 2024. The o1 model card and announcement. Established RL-trained long-CoT reasoning as the dominant 2025 inference-time scaling axis.
6.Primaryalexzhang13/rlm, official Recursive Language Models implementation. MIT-licensed, maintained by the paper authors. ~3,500 stars as of April 2026. Supports local, Docker, Modal, Prime Intellect, Daytona, and E2B sandboxes.
7.PrimaryGraves, Wayne, Danihelka (DeepMind), "Neural Turing Machines". arXiv 1410.5401, October 2014. Earlier architectural attempt to separate a neural controller from external addressable memory. The 2016 Differentiable Neural Computer (Nature) extended this. Same instinct as RLMs, at the architectural rather than prompting layer.
8.PrimaryAlex Zhang, "Recursive Language Models" (blog). October 2025. The blog post that introduced the idea before the formal paper. Source for the emergent strategy taxonomy (peek, grep, partition+map), the OOLONG cost-parity comparison, and Zhang's own framing.
9.PrimaryWei et al. (OpenAI), "BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents". arXiv 2504.12516, April 2025. The browsing-agent benchmark RLMs use. 1,266 questions requiring persistent multi-hop document retrieval.
10.PrimaryBertsch, Pratapa, Neubig, Gormley (CMU), "OOLONG: Aggregation over Long Contexts". arXiv 2511.02817, November 4, 2025. The benchmark designed to defeat retrieval by requiring atomic per-chunk analysis plus aggregation. GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro all score under 50% at 128K tokens on it.
11.ReportingHacker News, "Recursive Language Models" thread (paper submission). January 2026. 161 points, 23 comments. Submitter: schmuhblaster. Source for the "just subagents" critique (Legend2440) and the "calling the model repeatedly" objection (seeknotfind).
12.PrimaryDaren Wang, "Think, But Don't Overthink: Reproducing Recursive Language Models". arXiv 2603.02615, March 3, 2026. Independent reproduction with DeepSeek v3.2 and Kimi K2. Source for the depth-2 latency blowup (3.6s to 344.5s) and the finding that depth-2 RLMs degrade accuracy on S-NIAH and OOLONG.
13.ReportingPrime Intellect, "Recursive Language Models: the paradigm of 2026". January 2026. Friendly take that calls RLMs "the paradigm of 2026" and still concedes they "underperform on shorter, tool-light problems" and "increase execution time substantially."
14.PrimaryOpenAI, GPT-5 / GPT-5.4 model documentation. GPT-5 ships with a 400K context window. GPT-5.4 extends to ~1.05M tokens with 2x billing above the 272K threshold. Used for the context-window race comparison.

Three Splits, Not One Paradigm

Chain of Thought (Wei et al., Google)

ReAct (Yao et al., Princeton + Google)

o1-style reasoning models (OpenAI)

Recursive Language Models (Zhang, Kraska, Khattab, MIT CSAIL)

What an RLM Actually Does

The Numbers

What ‘Recursive’ Is Hiding

What Actually Changes