Scaffold Engineering: Why the Code Around Your LLM Decides More Than the Model

🧱 Scaffold Engineering

On SWE-bench Pro, the top five coding LLMs sit within 0.8 points of each other. That is roughly the noise floor of the benchmark. The model is no longer the variable that decides whether your agent works in production. The code wrapping it is.

🎯 Frontier Models Converged, Now What?

Until late 2025, the dominant question in coding agents was "which model is smartest". That question has gotten boring. The 2026 leaderboards tell a different story: at the frontier, models have converged. The scaffold around them, meaning the system prompt, the tool definitions, the retry logic, the file-handling rules, the context management, now decides more about agent performance than swapping the model does.

First, a definition. The scaffold is everything that surrounds the LLM. Some researchers split it further: scaffold for what is assembled before the run (system prompt, tool schemas, subagent registry), and harness for the runtime orchestration (tool dispatch, context management, safety enforcement). Most papers use the two interchangeably. In this post I use "scaffold" for the whole wrapper.

What you will learn

Why scaffold matters more than model at the frontier, how to read any scaffold as a composition of five loop primitives, the scaling law that makes scaffold gains bigger for smaller models, the failure mode where more scaffold actually hurts, and what I found running this stack on Qwen3.6-35B-A3B on my own machine.

📊 Holding the Model Fixed: The Scaffold Lever

The cleanest evidence comes from experiments that hold the model fixed and change only the scaffold around it. Four examples from 2025 to 2026, ordered from the smallest controlled change to the largest practical swing.

Scaffold-only swaps at fixed model

Setup	Score change	Source
🧠 Same model, basic vs optimized scaffold (SWE-bench Pro)	+22 pts swing	SWE-bench Pro (Scale AI)
🪄 Grok Code Fast, edit format change ONLY	6.7% → 68.3% (~10×)	Grok engineering notes
🐣 little-coder on Qwen 9B (small-model scaffold)	19.1% → 45.6% (+26.5 pts)	little-coder repo
🚀 little-coder on Qwen3.6-35B-A3B (Aider Polyglot 225)	78.67%	little-coder v0.0.5

💡 The single most striking number Grok Code Fast went from 6.7% to 68.3% on a coding benchmark by changing nothing but the edit format. Same weights, same prompt, same tools. The model was asked to deliver code as diff patches instead of full file rewrites. A roughly 10× swing from a single line of scaffold configuration.

✖️ Performance = Model × Scaffold

The mental model that comes out of these results is simple: agent performance is a product, not a sum. The model contributes one factor, the scaffold contributes the other, and the outcome is their interaction. A frontier model in a weak scaffold can lose to a mid-tier model in a strong scaffold. There is already a published example of exactly this: a Sonnet-class model in a custom scaffold scored 52.7% on a coding benchmark where an Opus-class model in its default scaffold scored 52.0%. The smaller model won because the wrapper around it was better.

Scaffold Swap on Fixed Weights

Qwen 9B on Aider Polyglot 225, only the wrapper code changes

The chart above shows the cleanest reproducible case I trust: the small-model scaffold from little-coder on a 9B Qwen, on the standard Aider Polyglot benchmark. The weights and the evaluation are held constant. The only variable is the code wrapping the model. The scaffold is doing more than 26 points of work.

🧰 Anatomy of a Scaffold: Five Loop Primitives

A scaffold is not a monolith. The Inside the Scaffold survey (Rombaut et al., 2026) catalogued 13 production agents and found that they compose five loop primitives in various combinations. Eleven of the thirteen agents mix at least two. Knowing the primitives is enough to read any scaffold you encounter.

ReAct (think-act-observe)

Generate-Test-Repair

Plan-Execute (phase split)

Multi-Attempt Retry (best-of-N)

Tree Search (flat to MCTS)

ReAct: The classic thought-action-observation loop. The model thinks about what to do, takes an action, observes the result, repeats. Used by 7 of 13 surveyed agents.

Generate-Test-Repair: The model generates a candidate, the harness runs tests or linters, and the failing output is fed back for a repair pass.

Plan-Execute: A distinct planning phase produces a structured plan, then execution runs with different (often more restricted) tool access.

Multi-Attempt Retry: The harness runs several independent attempts and selects the best. Cheap to add, surprisingly effective when verification is reliable.

Tree Search: From flat sampling to depth-first to full Monte Carlo Tree Search. The most expensive primitive, used when the search space rewards exploration.

📐 Loops are not the whole scaffold

The five primitives describe how the iteration runs. The same survey lists at least one other dimension with massive practical impact: the edit format. How the model is asked to deliver code (full file rewrite vs surgical diff vs str_replace patches) is treated as a separate axis. Five of the thirteen agents in the survey independently converged on str_replace. The Grok example earlier (6.7% to 68.3% on a single format swap) shows how heavy this one lever is, even before you touch the loop.

⚖️ The Scaling Law: Small Models Need More

The Meta-Harness paper from Stanford (arXiv:2603.28052) measured scaffold impact across model scales and found a clean pattern: scaffold helps every model, but the gain is proportionally larger for smaller models.

Scaffold gain by model capability (Meta-Harness on Terminal-Bench 2.0)

Model class	Meta-Harness gain
GPT-5.4-nano (weakest)	+8.7 pts
Gemini-3.1-Flash-Lite	+6.3 pts
Gemini-3-Flash	+3.7 pts
GPT-OSS-20B (strongest in test)	+3.0 pts

A good way to read this: frontier models self-regulate many things that smaller models cannot. The smaller the model, the more the scaffold has to become explicit infrastructure for behaviors the larger model produces for free.

Frontier model self-regulates

Reasoning length: Decides on its own when to stop thinking
Tool call formatting: Produces well-formed calls reliably
Edit vs rewrite: Picks the right surgical change
Codebase exploration: Knows what to look at next
Self-correction: Recovers after a failed attempt

Small model needs infrastructure

Thinking budget cap: Hard limit, retry without thinking on overflow
Malformed-output parser: Repair common JSON and tool-call mistakes
Write guard: Force surgical edits, refuse full rewrites
Workspace discovery injection: Pre-feed the model a file map
Retry with test output: Inject the failing test back into the prompt

This is also why "little-coder style" scaffolds exist as a distinct family. Their entire design philosophy fits in one line from the project: where a frontier-model scaffold can assume the model will self-regulate, for small models each of those behaviors must become infrastructure.

⚠️ When More Scaffold Hurts

A natural intuition, once you accept the scaffold thesis, is that more scaffold should be better. The data refuses to confirm this. The NL Agent Harnesses ablation (arXiv:2603.25723) measured the marginal impact of common scaffold components on SWE-bench Verified at N=125 runs.

Component-level ablation on SWE-bench Verified (NL Agent Harnesses)

Component	Marginal impact	Direction
Self-Evolution (acceptance-gated retry)	+4.8 pts	🟢 Best positive
File-Backed State	+1.6 pts	🟢 Positive
Verifier	−0.8 / −8.4 pts	🔴 Negative
Multi-Candidate Search	−2.4 / −5.6 pts	🔴 Negative

The 'more is better' trap

Two of the four common scaffold components measured were net negative. Verifier and Multi-Candidate Search both pulled scores down. The mechanism flagged by the authors is alignment failure: components that look locally correct can still misalign with the evaluation criteria, and the more confidently they vote, the more they steer the agent wrong. Scaffold engineering is not "add another module". It is a design problem with real failure modes.

🧪 I Ran This on Qwen3.6-35B-A3B Locally

The published numbers are useful, but I wanted my own data point on a model I can actually run on a Mac Studio. I built a local benchmark stack that vendors Aider's own Polyglot evaluator and drives it against Qwen3.6-35B-A3B through Ollama. Aider Polyglot is 225 Exercism exercises across six languages, the same benchmark every reference number in this post is reported on.

The evaluation pipeline

mermaid

100%

A first smoke run on a 36-exercise subset (go, python, rust), with the generate-test-repair scaffold and thinking enabled, looked like this:

First in-house smoke run (r0_smoke_thinkon_pgr, 2026-05-03)

Metric	Value
Model	qwen3.6:35b-a3b-q8_0 (Ollama)
Evaluation	Aider Polyglot subset, 36 exos (go / python / rust)
Config	generate-test-repair, thinking ON, edit_format = whole (auto)
pass@1	44.4%
pass@2	63.9%
Errors	0 / 36 (pipeline clean)

🪤 The open confound I am being honest about My run used edit_format = whole, which Aider auto-selected for this model. Every public reference number in this post was measured under diff. Given what we saw earlier (Grok 6.7% to 68.3% on this single lever), my 63.9% is plausibly depressed by the format, not by the model or the scaffold. Until I rerun under matched conditions, no cross-config delta is interpretable. That re-run, with N=3 and matched diff, is the next step.

What the smoke proved

The pipeline is solid: Zero errors across 36 exercises, clean JSON output, deterministic aggregation.

The scaffold thesis holds even at this depressed baseline: pass@2 reaches 63.9% on a model whose published Aider baseline is unknown; the little-coder-style scaffold target is 78.67%.

The matched-conditions run is the falsifiable next test: If my little-coder-style port lands within ±10 pts of 78.67%, the setup is validated. If not, the diagnosis itself becomes useful.

🎯 Takeaways

If you build with coding agents in 2026, the biggest single lever you control is the scaffold, not the model.

What I take from this

Frontier model swaps are the small lever: On SWE-bench Pro the top models cluster within roughly the noise floor.

Scaffold swaps are the big lever: +22 pts same-model swing on SWE-bench Pro, 10× on Grok by edit format alone, +26.5 pts on Qwen 9B with little-coder.

Read scaffolds as five loop primitives: ReAct, Generate-Test-Repair, Plan-Execute, Multi-Attempt Retry, Tree Search. Production agents compose them.

Smaller models need more explicit infrastructure: Thinking budget caps, write guards, output parsers, retry-with-test, workspace discovery.

More scaffold is not better: Verifier and Multi-Candidate Search were both net negative in the NLAH ablation. Scaffold is a design problem.

Run it yourself before you trust the headline: My 63.9% is honest data with an open confound (edit_format = whole). The matched-conditions run is the next step.

🚀 What to do next Pick one coding agent you already use. Identify which of the five loop primitives it composes. Then ask one question: what single scaffold component would most likely move its score? That is your biggest move for the week.

📚 Sources

Meta-Harness (Stanford) Inside the Scaffold (Rombaut) NL Agent Harnesses (Tsinghua) little-coder repo