Evaluating Model/Harness Pairs: Strengths, Weaknesses, Improvement Areas, and Startup Checklist
A practical framework for testing a specific model inside a specific coding harness, identifying where the pair fails, and deciding what to improve first.
Evaluating Model/Harness Pairs: Strengths, Weaknesses, Improvement Areas, and Startup Checklist
Executive summary
A coding agent is not just a model. It is a model/harness pair.
The same model can behave very differently depending on:
- the edit protocol,
- the tool schema,
- the amount and shape of context,
- the validation loop,
- the stopping rules,
- the sandbox,
- the repo instructions,
- the recovery strategy after errors.
So you should not evaluate a model in the abstract. Evaluate the pair:
model + harness + repo + task type + validation environment
The goal is to answer four questions:
- What is this pair good at?
- Where does it fail?
- Are failures model-limited, harness-limited, repo-limited, or validation-limited?
- What is the cheapest next harness change likely to improve real outcomes?
The basic loop is:
run tasks -> collect traces -> label failures -> map failures to harness surfaces -> change one thing -> rerun
The unit of evaluation is the pair, not the model
A model can look weak in one harness and strong in another. Reasons include:
- Some models follow structured tool protocols well; others need simpler tools.
- Some models are good at broad reasoning but bad at precise patch formats.
- Some models use long context well; others perform better with compressed state.
- Some models stop early unless the harness blocks premature completion.
- Some models over-edit unless the harness keeps diff scope visible.
- Some models recover from failing tests; others need stronger failure summaries.
Therefore, record results as:
model: gpt-x / claude-y / qwen-z / local-model
harness: hermes / codex / aider / custom
harness_version: commit or config hash
repo: target repo
repo_state: commit sha
task_type: bugfix | feature | refactor | test | research | review
validation_available: yes | partial | no
Do not say “model A is bad at coding” until you know whether the harness gave it a fair interface.
What you are trying to measure
A useful evaluation should measure more than pass/fail.
Outcome metrics
These answer whether the task got done.
- Task solved: yes/no/partial/blocked.
- Tests passed: yes/no/not run/not available.
- CI passed: yes/no/not run.
- Human accepted diff: yes/no/modified/reverted.
- Final answer truthful: yes/no.
Process metrics
These explain how the task got done.
- Time to first useful edit.
- Number of tool calls.
- Number of shell commands.
- Number of validation attempts.
- Number of repeated failed attempts.
- Tokens/cost.
- Human interventions.
- Files read vs files changed.
- Diff size.
- Unrelated diff size.
Reliability metrics
These catch dangerous behavior.
- False success rate: claimed done but validation failed or never ran.
- Premature stop rate.
- Unrelated modification rate.
- Test avoidance rate.
- Hallucinated file/API rate.
- Permission/sandbox violation rate.
- Stuck loop rate.
- Context drift rate.
Usability metrics
These matter in real work even if tests pass.
- Was the plan understandable?
- Was the final summary useful?
- Were risks disclosed?
- Did it ask good clarification questions?
- Did it require too much babysitting?
- Did it produce a reviewable diff?
Build a task suite by capability
Do not start with one giant benchmark. Build a ladder of tasks that isolates capabilities.
Level 0: Harness sanity checks
These verify that the harness itself works.
- Can it read files?
- Can it search files?
- Can it edit one file?
- Can it run a command and observe the exit code?
- Can it inspect git diff?
- Can it refuse to submit before validation?
- Can it report a blocked validation honestly?
If these fail, do not blame the model yet.
Level 1: Atomic coding tasks
Tiny tasks with obvious success criteria.
Examples:
- Fix a typo in a string and run one test.
- Change a function return value and update one assertion.
- Add a missing null check.
- Add one small unit test.
- Rename a local variable without changing behavior.
These catch edit protocol, basic file navigation, and validation discipline.
Level 2: Local bugfix tasks
Tasks requiring 2–5 files and one relevant test command.
Examples:
- Fix parser behavior and add a regression test.
- Fix a route handler and update API tests.
- Fix a CLI flag bug and update help text.
- Fix an off-by-one error where the failing test already exists.
These reveal whether the pair can localize code, reason through a small bug, and close the loop.
Level 3: Cross-cutting changes
Tasks that touch multiple modules and require architecture awareness.
Examples:
- Add a field to a data model, API response, and UI display.
- Change an internal interface and update callers.
- Introduce a new validation rule across worker and API paths.
- Split a module without changing behavior.
These reveal context handling, planning, diff discipline, and regression risk.
Level 4: Ambiguous product tasks
Tasks where the agent must ask questions, propose a plan, and make tradeoffs.
Examples:
- Improve onboarding UX for a confusing setup flow.
- Add observability for task failures.
- Make error messages actionable.
- Reduce cost of a recurring workflow.
These reveal planning, user intent modeling, and restraint.
Level 5: Adversarial or failure-mode tasks
Tasks designed to expose bad behavior.
Examples:
- A test fails for an unrelated reason; does the agent notice?
- The obvious file is a generated file; does it avoid editing it?
- Validation requires a missing service; does it report blocked instead of lying?
- The user asks for a broad refactor; does it over-edit?
- The task conflicts with repo instructions; does it follow policy?
These are where harness improvements usually come from.
The core trace schema
Every run should produce a structured trace. At minimum:
{
"run_id": "...",
"model": "...",
"harness": "...",
"harness_config": "...",
"repo": "...",
"repo_commit": "...",
"task_id": "...",
"task_type": "bugfix",
"prompt": "...",
"acceptance_criteria": [],
"tool_calls": [],
"files_read": [],
"files_changed": [],
"commands_run": [
{
"command": "pytest tests/test_foo.py -q",
"exit_code": 0,
"summary": "passed"
}
],
"diff_summary": "...",
"validation_status": "passed|failed|blocked|not_run",
"final_status": "success|partial|failed|blocked|false_success|user_intervened",
"failure_labels": [],
"cost": null,
"duration_seconds": null
}
The trace is more important than the final answer. You need it to diagnose whether the model failed, the harness failed, or the task was underspecified.
Strength and weakness dimensions
Evaluate the pair on these dimensions.
1. Task understanding
Questions:
- Did it restate the goal correctly?
- Did it identify acceptance criteria?
- Did it ask clarification only when needed?
- Did it avoid solving a different problem?
Failure signs:
- Starts editing before understanding.
- Optimizes for a nearby but wrong issue.
- Ignores constraints in the prompt.
- Treats examples as requirements.
Likely improvement surfaces:
- Better planning step.
- Explicit acceptance criteria extraction.
- Clarification policy.
- Original-goal reminder in context state.
2. Repo navigation and localization
Questions:
- Did it find the right files quickly?
- Did it use search effectively?
- Did it understand call sites and data flow?
- Did it avoid irrelevant files?
Failure signs:
- Reads many files without narrowing.
- Edits the first matching file without checking callers.
- Misses existing tests or patterns.
- Hallucinates files or APIs.
Likely improvement surfaces:
- Repo map or symbol index.
- Better search tools.
- File relevance scoring.
- “Find existing pattern before editing” rule.
- Smaller, targeted context bundles.
3. Edit quality
Questions:
- Is the diff minimal?
- Does it follow existing style?
- Does it preserve public interfaces unless asked?
- Does it add placeholders or TODOs?
- Does it introduce unrelated formatting churn?
Failure signs:
- Large diff for small task.
- Rewrites working code unnecessarily.
- Adds speculative abstractions.
- Leaves comments like “implementation goes here.”
- Breaks style conventions.
Likely improvement surfaces:
- Model-specific edit format.
- Diff-size warnings.
- Scope guard.
- Repo style instructions.
- Post-edit self-review.
4. Tool use
Questions:
- Does it use read/search/edit tools correctly?
- Does it run shell commands at the right time?
- Does it inspect command output accurately?
- Does it retry intelligently after tool failure?
Failure signs:
- Repeats the same failing command.
- Ignores nonzero exit codes.
- Uses shell where a safer file tool exists.
- Fails to quote paths.
- Gives up after one transient failure.
Likely improvement surfaces:
- Tool descriptions.
- Safer command wrappers.
- Repetition detector.
- Error summarizer.
- Tool-call examples per model.
5. Validation discipline
Questions:
- Did it discover the right validation command?
- Did it run targeted tests after edits?
- Did it interpret failures correctly?
- Did it distinguish test failures from environment failures?
- Did it avoid claiming success without evidence?
Failure signs:
- “Tests should pass.”
- Runs no tests.
- Runs irrelevant tests.
- Ignores failing tests.
- Claims success despite blocked validation.
Likely improvement surfaces:
- Validation command discovery.
- Hard submit gate.
- Stop hook.
- Test-output summarizer.
- Blocked-validation final state.
6. Recovery behavior
Questions:
- After failure, does it change strategy?
- Does it localize the new error?
- Does it roll back bad edits?
- Does it ask for help when genuinely blocked?
Failure signs:
- Same patch repeated.
- Same command repeated.
- Random edits after test failure.
- Context spirals into irrelevant debugging.
Likely improvement surfaces:
- Max repeated-error threshold.
- Reflection step after N failures.
- Checkpoint/rollback.
- Critic pass.
- Alternative-strategy prompt.
7. Stopping behavior
Questions:
- Does it stop only after satisfying acceptance criteria?
- Does it disclose risks and validation gaps?
- Does it avoid premature final answers?
- Does it avoid endless loops?
Failure signs:
- Final answer before tests.
- Final answer with open todos.
- Keeps working after success.
- Keeps trying after repeated identical failure.
Likely improvement surfaces:
- Submit gate.
- Todo gate.
- Done checklist.
- Loop detector.
- “Blocked” state.
8. Context durability
Questions:
- Does it remember the original task after many turns?
- Does it preserve decisions after compaction?
- Does it keep validation failures visible?
- Does it maintain a current plan?
Failure signs:
- Solves an old subproblem after the goal changed.
- Forgets user constraints.
- Loses failing test output.
- Re-opens resolved questions.
Likely improvement surfaces:
- Structured task state.
- Context compaction rules.
- Persistent plan/todo block.
- Diff and validation summaries.
Failure labels and what they imply
Use failure labels to map symptoms to improvements.
| Failure label | What it usually means | First thing to try | |---|---|---| | misread_task | Planning/acceptance extraction weak | Add explicit acceptance criteria step | | wrong_files | Retrieval/localization weak | Add repo map, symbol search, or better search prompt | | bad_patch | Edit protocol mismatch | Try different edit format or smaller edit chunks | | over_editing | Scope drift | Add diff-size guard and scope self-review | | no_validation | Stop policy weak | Add hard submit gate and validation requirement | | wrong_validation | Command discovery weak | Add repo validation config | | ignored_failure | Output interpretation weak | Summarize command output and block submit on nonzero exit | | false_success | Final gate weak | Trust tool logs, not model claims | | stuck_loop | Recovery weak | Add repeated-error detector and strategy reset | | context_drift | Compaction/state weak | Preserve goal, plan, todos, failures in state | | tool_misuse | Tool schema/policy weak | Simplify tools or add examples | | sandbox_blocked | Environment/policy mismatch | Improve blocked-state reporting or preflight setup |
How to identify improvement areas
After 20–50 runs, sort failures by frequency and cost.
Use this decision tree.
If the agent often cannot find the right code
Improve retrieval before prompts.
Try:
- repo map,
- symbol index,
- semantic search,
- call graph hints,
- “read existing tests first” rule,
- better file search tool descriptions.
Do not start with a bigger model unless retrieval is already good.
If the agent finds the right code but patches badly
Improve edit interface.
Try:
- smaller edit batches,
- model-specific patch format,
- search/replace instead of freeform diff,
- whole-file edit for small files,
- post-patch syntax check,
- immediate diff inspection.
If the patch is good but unvalidated
Improve stopping and validation.
Try:
- validation discovery,
- mandatory test command,
- hard submit gate,
- final answer schema,
- stop hook that rejects missing validation.
If validation fails and the agent flails
Improve recovery.
Try:
- concise test failure summaries,
- repeated-error detection,
- rollback checkpoints,
- “diagnose before edit” state,
- critic/reviewer after two failed attempts.
If the agent overengineers
Improve scope control.
Try:
- require minimal diff,
- show diff size before submit,
- acceptance criteria checklist,
- “no speculative abstractions” repo rule,
- self-review question: “What did I change that was not necessary?”
If the final answer lies or overstates success
Improve final gate.
Try:
- derive validation summary from command logs,
- require blocked state when checks cannot run,
- reject “should pass” language,
- require exact commands and exit codes.
The model/harness scorecard
Use a 1–5 scale for each dimension.
model_harness_scorecard:
task_understanding: 1-5
code_localization: 1-5
edit_quality: 1-5
tool_use: 1-5
validation_discipline: 1-5
failure_recovery: 1-5
stopping_behavior: 1-5
context_durability: 1-5
cost_efficiency: 1-5
reviewability: 1-5
Suggested interpretation:
- 5: reliable; failures are rare and understandable.
- 4: usable with light review.
- 3: useful but needs supervision.
- 2: occasionally useful, frequent failure mode.
- 1: not usable for this task class yet.
Also record confidence:
sample_size: 25
confidence: low | medium | high
notes: "Mostly Python bugfixes; no frontend tasks tested."
Do not average scores too early. A pair can be excellent for small bugfixes and terrible for broad refactors.
Basic startup checklist
Use this when starting work with a new model/harness pair.
Step 1: Define the evaluation target
- [ ] Which model?
- [ ] Which harness?
- [ ] Which provider/API mode?
- [ ] Which repo?
- [ ] Which task classes matter?
- [ ] What validation is available?
- [ ] What level of autonomy is allowed?
Write it down as:
evaluation_target:
model:
harness:
repo:
task_classes:
allowed_tools:
validation_commands:
autonomy_level:
Step 2: Prepare the repo instructions
- [ ] Add or update
AGENTS.md/CLAUDE.md/ harness equivalent. - [ ] List setup commands.
- [ ] List test commands.
- [ ] List lint/typecheck/build commands.
- [ ] State coding conventions.
- [ ] State generated files or forbidden paths.
- [ ] State common pitfalls.
- [ ] State done criteria.
Step 3: Preflight the harness
- [ ] File read works.
- [ ] File search works.
- [ ] File edit works.
- [ ] Shell command execution works.
- [ ] Git diff/status works.
- [ ] Test command can run manually.
- [ ] Sandbox policy is understood.
- [ ] Network policy is understood.
- [ ] Logs/traces are saved.
Step 4: Run five smoke tasks
- [ ] One tiny edit.
- [ ] One tiny test addition.
- [ ] One command failure interpretation.
- [ ] One validation-required task.
- [ ] One premature-submit trap.
Do not move to harder tasks until these work.
Step 5: Run a balanced starter suite
Use 20 tasks:
- [ ] 5 atomic edits.
- [ ] 5 local bugfixes.
- [ ] 3 test-writing tasks.
- [ ] 3 small refactors.
- [ ] 2 ambiguous planning tasks.
- [ ] 2 adversarial/failure-mode tasks.
For each, record:
- success/partial/fail/blocked,
- validation status,
- failure labels,
- cost/time,
- human intervention,
- final answer truthfulness.
Step 6: Label failures
For every failed or suspicious run:
- [ ] Assign one primary failure label.
- [ ] Assign secondary labels if needed.
- [ ] Note the first irreversible wrong turn.
- [ ] Note whether the model, harness, repo, or environment was primarily responsible.
- [ ] Convert repeated failures into regression tasks.
Step 7: Pick one improvement
Choose the highest-frequency, highest-cost failure category.
Only change one major thing at a time:
- [ ] prompt,
- [ ] tool schema,
- [ ] edit format,
- [ ] validation gate,
- [ ] context retrieval,
- [ ] compaction,
- [ ] model,
- [ ] command policy.
Then rerun the same suite.
Step 8: Compare before/after
Track:
- [ ] task success rate,
- [ ] false success rate,
- [ ] validation-run rate,
- [ ] cost per accepted patch,
- [ ] time per accepted patch,
- [ ] unrelated diff rate,
- [ ] user intervention rate,
- [ ] stuck loop rate.
A change is not good if it improves pass rate but doubles false success or unrelated diffs.
A 30-run starter protocol
If you want a concrete first pass, do this.
Runs 1–5: smoke
Goal: catch harness setup issues.
- Tiny edit.
- Tiny test.
- Failing command.
- Missing validation trap.
- Blocked validation trap.
Runs 6–15: local coding
Goal: measure ordinary usefulness.
- Five bugfixes.
- Three test additions.
- Two small refactors.
Runs 16–22: context and drift
Goal: measure whether it stays on task.
- Cross-file change.
- Existing pattern imitation.
- Generated-file avoidance.
- Public API preservation.
- Long-ish task with plan updates.
- Task with irrelevant tempting file.
- Task requiring reading tests first.
Runs 23–27: validation and recovery
Goal: measure whether it can close the loop.
- Test failure caused by its patch.
- Test failure caused by environment.
- Lint failure.
- Typecheck failure.
- Repeated failure trap.
Runs 28–30: autonomy boundary
Goal: measure whether it knows when to stop or ask.
- Ambiguous task needing clarification.
- Task with insufficient context.
- Task where correct answer is “blocked.”
After 30 runs, you should know the dominant weakness.
Diagnosing model-limited vs harness-limited failures
Probably harness-limited
- Model did not know tests because harness never surfaced them.
- Model claimed done because final answers were accepted without checks.
- Model edited wrong files because search was weak.
- Model lost context because compaction dropped task state.
- Model repeated commands because harness had no repetition detector.
- Model used dangerous command because policy was unclear.
Fix the harness first.
Probably model-limited
- Model sees the right code and tests but cannot reason through the bug.
- Model repeatedly misinterprets simple errors despite clear summaries.
- Model cannot follow a simple edit protocol after examples.
- Model ignores short, explicit instructions.
- Model cannot maintain a plan even with structured state.
Try a stronger model, different model role, or reduced autonomy.
Probably repo/environment-limited
- Tests are flaky.
- Setup is undocumented.
- Validation requires missing secrets/services.
- Dependency install is broken.
- Generated files are not documented.
- There are no narrow tests for the target behavior.
Improve repo setup before judging the agent.
Model adaptation experiments
For a new model, run small ablations.
Edit format ablation
Try the same 10 tasks with:
- unified diff,
- search/replace blocks,
- whole-file rewrite for small files,
- harness-native patch tool.
Measure patch apply failures, edit correctness, and token cost.
Context shape ablation
Try:
- raw files,
- repo map + targeted files,
- compressed task state + targeted files,
- long transcript + targeted files.
Measure localization, drift, and cost.
Tool surface ablation
Try:
- minimal tools,
- full tools,
- read/search/edit/test only,
- shell-heavy mode.
Measure tool misuse and task success.
Stop policy ablation
Try:
- prompt-only done criteria,
- soft warning gate,
- hard submit gate with waiver,
- hard submit gate with no waiver.
Measure false success and blocked-task honesty.
Planning ablation
Try:
- no explicit plan,
- short plan,
- plan requiring acceptance criteria,
- plan/act separation.
Measure overediting, task understanding, and latency.
The improvement backlog template
Keep a backlog like this:
improvement:
title: "Reject final answers when validation was not run"
observed_failure_labels:
- no_validation
- false_success
evidence:
- run_012
- run_019
- run_027
hypothesis: "A hard submit gate will reduce false success without hurting pass rate."
change:
- add submit validator checking command logs
- require blocked state when tests cannot run
eval_suite:
- smoke_validation_required
- bugfix_001
- bugfix_002
success_metric:
false_success_rate: "down by 80%"
validation_run_rate: "above 90%"
guardrail_metric:
task_success_rate: "does not drop by more than 5%"
This keeps you from making random changes.
The first improvements I would usually make
In most young harnesses, the highest-leverage sequence is:
- Trace logging.
- Repo instruction file.
- Validation command discovery.
- Hard submit gate.
- Failure labels.
- Regression suite from failures.
- Diff self-review.
- Repetition detector.
- Context compaction that preserves state.
- Model-specific edit format.
Only after that would I spend serious effort on best-of-N, critic models, or fine-tuning.
Minimal daily checklist while working
At the start of a session:
- [ ] What model/harness pair am I evaluating?
- [ ] What task class am I testing today?
- [ ] What is the expected validation command?
- [ ] What failure mode am I watching for?
For each run:
- [ ] Save the trace.
- [ ] Record success/partial/fail/blocked.
- [ ] Record validation status.
- [ ] Record whether final answer was truthful.
- [ ] Assign failure labels if needed.
At the end of the session:
- [ ] Count failures by label.
- [ ] Pick the top recurring failure.
- [ ] Write one hypothesis.
- [ ] Make one harness change.
- [ ] Rerun the smallest eval that should improve.
- [ ] Promote new failures into regression tasks.
Bottom line
To evaluate a given model and harness, do not ask “is this model good?”
Ask:
For this repo and this task class, where does this model/harness pair fail first?
Then improve the narrowest responsible layer:
- retrieval if it cannot find code,
- edit protocol if it patches badly,
- validation gate if it claims success too early,
- recovery loop if it flails after failures,
- context state if it drifts,
- model choice if the harness is already giving it a fair interface.
The repeatable practice is simple:
trace -> label -> hypothesize -> change one thing -> rerun -> compare
That is how you turn agent improvement from vibes into engineering.