Exploring Recursive Language Models
I’ve been exploring Recursive Language Models (RLM), a new inference paradigm from MIT’s OASYS lab. The core idea is compelling: instead of forcing LLMs to process massive contexts in a single forward pass, give them a Python REPL and let them programmatically interact with the data.
The Problem RLM Solves
Even with 200K+ context windows, LLMs struggle with certain long-context tasks. The issue isn’t just attention - it’s that some tasks require computation over data, not just pattern matching.
Consider: “Count how many items in this 100K-token dataset have label ‘incorrect’.”
A baseline LLM response:
I need to classify each of the 10 literal interpretations as either
correct or incorrect.
1. "The n... [truncated - never reaches answer]
The model starts explaining what it would do instead of doing it.
The RLM Paradigm
RLM replaces llm.completion(prompt) with a REPL-based interaction loop:
- Context as Variable: The input is loaded as a Python variable the model can access
- Code Execution: Model writes Python to examine/process the data
- Recursive Calls: Model can invoke
lm(prompt)for sub-queries on chunks - Explicit Termination:
FINAL(answer)signals completion
from rlm import RLM
rlm = RLM(
backend="anthropic",
backend_kwargs={"model_name": "claude-sonnet-4-5-20250929"},
environment="local",
max_depth=1,
max_iterations=30,
)
result = rlm.completion(context, root_prompt="Find the secret code")
print(result.response) # Direct answer, not explanation
The max_depth parameter controls recursion. At depth 0, the model can spawn sub-LM calls. Those sub-calls (depth 1) become regular LLM calls without further recursion.
Empirical Findings: OOLONG Benchmark
I ran experiments on the OOLONG benchmark (long-context understanding from HuggingFace) comparing RLM vs direct prompting with Claude Sonnet 4.5.
Results (10 tasks):
| Metric | Baseline | RLM |
|---|---|---|
| Real Accuracy | 0% | 100% |
| Avg Latency | 7.5s | 30s |
| Input Tokens | 7.5K | 162K |
| Output Tokens | 3.6K | 10.4K |
The baseline reported 70% accuracy, but every “correct” answer was a false positive - the model was explaining the task, not answering it, and evaluation was matching keywords in explanations.
Every baseline response followed this pattern:
I need to classify each of the 10 literal interpretations...
[proceeds to explain methodology, never produces answer]
Every RLM response produced structured output:
Label: incorrect
Answer: 7
User: 44106
The REPL forces execution over explanation.
How RLM Trajectories Work
Looking at the execution logs, a typical RLM trajectory:
Iteration 1: Model examines context structure
lines = context.split('\n')
print(f"Found {len(lines)} lines")
print(lines[0]) # Sample first line
Iteration 2-3: Model parses and classifies
results = []
for line in lines:
parts = line.split(' <--> ')
# Classification logic
results.append(label)
Iteration 4: Model aggregates and answers
count = sum(1 for r in results if r == 'incorrect')
FINAL(f"Answer: {count}")
The model operates as a programmer, not an oracle.
Model Behavior Differences: Haiku vs Sonnet
Testing with Claude Haiku on the S-NIAH (needle-in-haystack) benchmark revealed a striking pattern (somewhat expected):
S-NIAH Results (10 tasks):
| Metric | Baseline | RLM |
|---|---|---|
| Accuracy | 100% | 40% |
| Avg Latency | 1.2s | 7.5s |
Wait - RLM made Haiku worse? Yes. The baseline Haiku directly answered simple retrieval questions. But RLM Haiku couldn’t follow the REPL format.
Expected RLM output:
162934
Actual Haiku RLM output:
# The Special Number
The special number is **162934**.
## Why It's Special
According to the text...
Haiku defaulted to its conversational training - producing markdown-formatted explanations with headers, bold text, and prose. It found the answer but wrapped it in unparseable formatting.
More examples from the same run:
# Hidden Message Analysis
## Key Finding
There **is** a hidden message explicitly stated in the te...
# Puzzle Identification and Answer
**Puzzle Mentioned:**
"The answer to the puzzle is: constellatio...
The model knows the answer but can’t resist explaining it. Every wrong answer contained the correct value buried in prose.
Sonnet reliably:
- Wrote valid, executable Python
- Used the
contextvariable correctly - Produced parseable structured outputs
- Followed system prompt’s REPL conventions
RLM requires the model to operate as a “code agent” - understanding it’s in a programmatic environment, not a conversational one. Haiku’s stronger conversational instincts override the REPL context, making it paradoxically worse at simple tasks when given programmatic tools.
Technical Deep-Dive: Implementation Bugs
Exploring the codebase revealed several edge cases:
1. FINAL pattern in code blocks
The original regex matched FINAL() anywhere:
final_pattern = r"FINAL\((.*?)\)" # Too greedy
This caused premature termination when models wrote code referencing the pattern:
```python
# Prepare for FINAL_VAR(result)
result = process_data()
Fix: anchor to line start:
```python
final_pattern = r"^\s*FINAL\((.*?)\)"
2. Streaming required for Anthropic
Anthropic’s API requires streaming for outputs exceeding ~21K tokens - a non-obvious constraint that causes silent failures without proper handling.
The Token Economics
RLM’s accuracy improvement comes at a cost:
- 4x latency - REPL iteration overhead
- 21x input tokens - Recursive context reprocessing
- 3x output tokens - Code generation + execution
For tasks where baseline accuracy is effectively 0%, the tradeoff is justified. The baseline doesn’t just perform worse - it can’t answer at all.
When to Use RLM
RLM shines when tasks require:
- Counting/aggregation over large datasets
- Multi-step computation that can’t be done in-head
- Structured extraction from noisy contexts
- Verification of intermediate results
RLM is overkill for:
- Simple retrieval (needle-in-haystack with one needle)
- Summarization tasks
- Anything a baseline LLM can answer directly
Key Insights
-
Baseline LLMs explain; RLM executes - The fundamental difference. Direct prompting on complex tasks triggers analysis mode.
-
Instruction-following is critical - The model must maintain REPL context across iterations. Weaker models lose this frame.
-
Evaluation metrics lie - Substring matching in explanations creates false positives. Always inspect actual outputs.
-
Programmers, not oracles - RLM reframes LLMs as agents that write code to answer questions, rather than pattern-matching to produce answers.
The paradigm shift is subtle but significant: instead of asking “what is the answer?”, RLM asks “what code would compute the answer?” For tasks requiring computation over data, this is the right abstraction.
| *RLM: github.com/alexzhang13/rlm | Paper: arXiv:2512.24601* |