Three teams, three scores, one model

Imagine three teams running SWE-bench Verified — the standard agent coding benchmark — with the same Anthropic Opus 4.5 weights. With different scaffolds, they'd land somewhere around 47%, 54%, 60% — a spread of 13 points on identical model capability.

Same model. Same prompts. Same problems.

The variable was the agent harness. And the spread tells you almost everything about why most A/B tests in agent development are confounded — the result you measure isn't the comparison you think you're running.

What a harness is ?

Every engineer has written a test harness — the code that wraps a system, feeds it inputs, captures outputs, and grades them. An agent harness is the same idea, pointed at a model.

A model produces tokens. A harness turns those tokens into actions. When you read "Opus 4.5 ran on SWE-bench," what actually ran wasn't the weights — it was the weights wrapped in a harness. The harness reads the benchmark task, calls the model, parses the response, executes the suggested actions in a sandbox, captures results, decides whether to call the model again, and submits a candidate solution. The weights only generate text. The harness does everything else.

Concrete harnesses you'd see on a Scale AI leaderboard:

  • Claude Code — Anthropic's coding harness with built-in tools, filesystem access, and bash

  • Mini-SWE-Agent — a minimal harness built for reproducible benchmark runs

  • SWE-Agent — the original Princeton harness for SWE-bench

  • Custom ReAct loops — your team's own wrapper around model responses

Each is a different harness for the same underlying model. Different harness, different result. The harness is the variable — and most teams treat it as fixed.

Framing and tightening your harness

Building a trustworthy harness — the tools, loop, budgets, and context handling wrapped around the model — is like standing up a CI pipeline. First you frame it: decide which steps run. Then you tighten it: pin the versions, lock the runner image, make every run reproducible. Skip the tightening and the green checkmark means nothing.

Two phases. Frame, then tighten.

Frame is the initial design. Decide what the harness can do.

  • Tools. What can the agent call? File search, test execution, patch writing, bash. Each tool adds capability and adds variance.

  • Budgets. How many turns before the harness gives up? How much cost per task? How much wall-clock time?

  • Context strategy. What happens when the conversation grows — summarize, truncate, or slide the window?

  • Retry policy. What failures trigger a retry? Transient errors only, or logic errors too?

You frame once, at project start. The frame defines the harness's possibility space.

Tighten is the iterative phase. Run, observe, lock.

  1. Run the harness with the frame as drafted.

  2. Re-run it a few times. Identify what varies between runs — temperature drift, retry behavior, context-eviction order, tool-call ordering.

  3. Lock one variable at a time. Pin temperature=0. Fix the random seed. Force deterministic context management. Make tool-call order explicit.

  4. Re-run. Confirm the variance you saw is gone.

  5. Repeat until run-to-run variance falls below the benchmark's noise floor.

Most teams skip tightening because the harness "works" — it produces a score and doesn't crash. That score is confounded. You can use it as a vibe check, not as an A/B comparison. Which is what most agent eval is right now.

The harness is the planner

A close software-engineering analogy: same SQL query, different query planner, very different execution time. The planner makes execution decisions the SQL doesn't — index choice, join order, parallelism. Those decisions don't show up in your SQL, but they show up in your results.

For agents, the harness is the planner. You'll also hear it called the scaffold — same thing, different word; "scaffold" is the term people reach for when they're talking about the harness's effect on a benchmark score. Until you lock it, your benchmark is measuring harness engineering, not model capability. Published SWE-bench Pro scaffold analyses (Scale AI's SEAL, 2026) document this directly: same model weights, different harness, five to twenty-two point swings on SWE-bench across reported configurations.

Locking the harness so an A/B test compares what you think it compares — that's scaffold parity. The rest of this issue is the checklist.

The 6 variables

These are the levers most likely to differ across two A/B-tested agent runs. Lock all six and you're comparing prompts or models. Skip even one and you're rolling dice with extra steps.

# scaffold_parity.yaml
# Lock all six before running any A/B comparison.

iteration_budget: 250          # max turns the agent can take
tool_availability:             # tools exposed, in fixed order
  - patch_writer
  - test_executor
  - file_search
reflection_loops:              # self-critique passes per turn
  count: 1
  prompt_id: "review_then_act"
repository_navigation:         # how the agent finds context
  strategy: "ripgrep_first"
  fallback: "ast_walk"
retry_policy:
  retry_on: ["transient_error", "rate_limit"]
  max_retries: 3
context_management:
  window: 200000
  summarize_after_turns: 50
  eviction: "oldest_tool_call"

iteration_budget is the maximum number of turns. A scaffold with 50 turns gives up where one with 250 turns succeeds. On SWE-bench, the gap between 100 and 250 turns is often double-digit points.

tool_availability is which tools the agent can call. Adding a test_executor so the agent can run tests itself changes its strategy. So does removing it. Order matters — agents tend to pick the first usable tool when they're uncertain.

reflection_loops is whether the agent reviews its own work before committing. Even one reflection pass with a prompt like "review your patch before submitting" can shift scores 3–5 points.

repository_navigation is how the agent finds relevant code. A ripgrep-based search heuristic finds different things than an AST walk, even given the same query.

retry_policy decides what counts as a recoverable error. Retrying on transient network errors is harmless; retrying on logic errors is masking. Both happen in scaffolds, and both produce different scores.

context_management is how the scaffold handles long traces. A sliding window with summarization behaves differently from truncation. Once a scaffold starts evicting tool calls to save tokens, the agent's "memory" of its own actions changes — and so does its next decision.

What this won't catch

Scaffold parity doesn't fix everything. Three sources of variance survive even a fully-locked scaffold:

  • Random seeds. Same scaffold, same prompt, different runs can vary by 1–3 percentage points from sampling alone. Use a fixed seed.

  • Model-side temperature. If you forget to set temperature=0, you're sampling. Set it explicitly.

  • Version drift. "Opus 4.5" today and "Opus 4.5" in three months may not be byte-identical. Pin the model snapshot in your config.

Six scaffold variables plus those three give you nine things to lock. Below nine, you're guessing about how much of your A/B delta is signal versus noise.

What to do this week

Pull your last agent A/B test — the one where prompt B beat prompt A by 4 points, or model X beat model Y by 6. Open the configs side by side. Check all six variables above. If even one was different between runs, the result is unreliable in a specific way: you have no idea how much of the delta was the change you tested versus the change you didn't notice.

If you find a parity gap, redo the comparison with the gap closed. If the result reverses or flattens, you've just learned something useful. If the result holds, you've earned higher confidence in it.

Don't ship a model swap or a prompt change based on an unlocked-scaffold A/B. The cost of redoing is small. The cost of acting on a confounded result is the next quarter of work pointed at the wrong thing.

What did you think of today's email?
Your feedback helps me create better emails for you! comment down 👇
Loved It 😊
It was ok 🙂
Could be better 🤔

Until next time - Teja Derangula,
The gap between thinking and building has shrunk — take advantage.

Reply

Avatar

or to participate

Keep Reading