Scaffold Engineering: Why the Code Around Your LLM Decides More Than the Model

Frontier coding LLMs have converged within 0.8 points on SWE-bench Pro. The scaffold around them is now the dominant performance variable. Here is what scaffold engineering means, the data behind the claim, and what I found running it locally on Qwen3.6-35B-A3B.
On SWE-bench Pro, the top five coding LLMs sit within 0.8 points of each other. That is roughly the noise floor of the benchmark. The model is no longer the variable that decides whether your agent works in production. The code wrapping it is.
π― Frontier Models Converged, Now What?
Until late 2025, the dominant question in coding agents was "which model is smartest". That question has gotten boring. The 2026 leaderboards tell a different story: at the frontier, models have converged. The scaffold around them, meaning the system prompt, the tool definitions, the retry logic, the file-handling rules, the context management, now decides more about agent performance than swapping the model does.
First, a definition. The scaffold is everything that surrounds the LLM. Some researchers split it further: scaffold for what is assembled before the run (system prompt, tool schemas, subagent registry), and harness for the runtime orchestration (tool dispatch, context management, safety enforcement). Most papers use the two interchangeably. In this post I use "scaffold" for the whole wrapper.
What you will learn
π Holding the Model Fixed: The Scaffold Lever
The cleanest evidence comes from experiments that hold the model fixed and change only the scaffold around it. Four examples from 2025 to 2026, ordered from the smallest controlled change to the largest practical swing.
Scaffold-only swaps at fixed model
| Setup | Score change | Source |
|---|---|---|
| π§ Same model, basic vs optimized scaffold (SWE-bench Pro) | +22 pts swing | SWE-bench Pro (Scale AI) |
| πͺ Grok Code Fast, edit format change ONLY | 6.7% β 68.3% (~10Γ) | Grok engineering notes |
| π£ little-coder on Qwen 9B (small-model scaffold) | 19.1% β 45.6% (+26.5 pts) | little-coder repo |
| π little-coder on Qwen3.6-35B-A3B (Aider Polyglot 225) | 78.67% | little-coder v0.0.5 |
diff patches instead of full file rewrites. A roughly 10Γ swing from a single line of scaffold configuration.βοΈ Performance = Model Γ Scaffold
The mental model that comes out of these results is simple: agent performance is a product, not a sum. The model contributes one factor, the scaffold contributes the other, and the outcome is their interaction. A frontier model in a weak scaffold can lose to a mid-tier model in a strong scaffold. There is already a published example of exactly this: a Sonnet-class model in a custom scaffold scored 52.7% on a coding benchmark where an Opus-class model in its default scaffold scored 52.0%. The smaller model won because the wrapper around it was better.
Scaffold Swap on Fixed Weights
Qwen 9B on Aider Polyglot 225, only the wrapper code changes
The chart above shows the cleanest reproducible case I trust: the small-model scaffold from little-coder on a 9B Qwen, on the standard Aider Polyglot benchmark. The weights and the evaluation are held constant. The only variable is the code wrapping the model. The scaffold is doing more than 26 points of work.
π§° Anatomy of a Scaffold: Five Loop Primitives
A scaffold is not a monolith. The Inside the Scaffold survey (Rombaut et al., 2026) catalogued 13 production agents and found that they compose five loop primitives in various combinations. Eleven of the thirteen agents mix at least two. Knowing the primitives is enough to read any scaffold you encounter.
π Loops are not the whole scaffold
str_replace patches) is treated as a separate axis. Five of the thirteen agents in the survey independently converged on str_replace. The Grok example earlier (6.7% to 68.3% on a single format swap) shows how heavy this one lever is, even before you touch the loop.βοΈ The Scaling Law: Small Models Need More
The Meta-Harness paper from Stanford (arXiv:2603.28052) measured scaffold impact across model scales and found a clean pattern: scaffold helps every model, but the gain is proportionally larger for smaller models.
Scaffold gain by model capability (Meta-Harness on Terminal-Bench 2.0)
| Model class | Meta-Harness gain |
|---|---|
| GPT-5.4-nano (weakest) | +8.7 pts |
| Gemini-3.1-Flash-Lite | +6.3 pts |
| Gemini-3-Flash | +3.7 pts |
| GPT-OSS-20B (strongest in test) | +3.0 pts |
A good way to read this: frontier models self-regulate many things that smaller models cannot. The smaller the model, the more the scaffold has to become explicit infrastructure for behaviors the larger model produces for free.
Frontier model self-regulates
- Reasoning length: Decides on its own when to stop thinking
- Tool call formatting: Produces well-formed calls reliably
- Edit vs rewrite: Picks the right surgical change
- Codebase exploration: Knows what to look at next
- Self-correction: Recovers after a failed attempt
Small model needs infrastructure
- Thinking budget cap: Hard limit, retry without thinking on overflow
- Malformed-output parser: Repair common JSON and tool-call mistakes
- Write guard: Force surgical edits, refuse full rewrites
- Workspace discovery injection: Pre-feed the model a file map
- Retry with test output: Inject the failing test back into the prompt
This is also why "little-coder style" scaffolds exist as a distinct family. Their entire design philosophy fits in one line from the project: where a frontier-model scaffold can assume the model will self-regulate, for small models each of those behaviors must become infrastructure.
β οΈ When More Scaffold Hurts
A natural intuition, once you accept the scaffold thesis, is that more scaffold should be better. The data refuses to confirm this. The NL Agent Harnesses ablation (arXiv:2603.25723) measured the marginal impact of common scaffold components on SWE-bench Verified at N=125 runs.
Component-level ablation on SWE-bench Verified (NL Agent Harnesses)
| Component | Marginal impact | Direction |
|---|---|---|
| Self-Evolution (acceptance-gated retry) | +4.8 pts | π’ Best positive |
| File-Backed State | +1.6 pts | π’ Positive |
| Verifier | β0.8 / β8.4 pts | π΄ Negative |
| Multi-Candidate Search | β2.4 / β5.6 pts | π΄ Negative |
The 'more is better' trap
π§ͺ I Ran This on Qwen3.6-35B-A3B Locally
The published numbers are useful, but I wanted my own data point on a model I can actually run on a Mac Studio. I built a local benchmark stack that vendors Aider's own Polyglot evaluator and drives it against Qwen3.6-35B-A3B through Ollama. Aider Polyglot is 225 Exercism exercises across six languages, the same benchmark every reference number in this post is reported on.
The evaluation pipeline
A first smoke run on a 36-exercise subset (go, python, rust), with the generate-test-repair scaffold and thinking enabled, looked like this:
First in-house smoke run (r0_smoke_thinkon_pgr, 2026-05-03)
| Metric | Value |
|---|---|
| Model | qwen3.6:35b-a3b-q8_0 (Ollama) |
| Evaluation | Aider Polyglot subset, 36 exos (go / python / rust) |
| Config | generate-test-repair, thinking ON, edit_format = whole (auto) |
| pass@1 | 44.4% |
| pass@2 | 63.9% |
| Errors | 0 / 36 (pipeline clean) |
edit_format = whole, which Aider auto-selected for this model. Every public reference number in this post was measured under diff. Given what we saw earlier (Grok 6.7% to 68.3% on this single lever), my 63.9% is plausibly depressed by the format, not by the model or the scaffold. Until I rerun under matched conditions, no cross-config delta is interpretable. That re-run, with N=3 and matched diff, is the next step.What the smoke proved
π― Takeaways
If you build with coding agents in 2026, the biggest single lever you control is the scaffold, not the model.
What I take from this
π Sources
Comments
No comments yet. Be the first to comment!