Evaluating Model/Harness Pairs: Strengths, Weaknesses, Improvement Areas, and Startup Checklist

A practical framework for testing a specific model inside a specific coding harness, identifying where the pair fails, and deciding what to improve first.

Evaluating Model/Harness Pairs: Strengths, Weaknesses, Improvement Areas, and Startup Checklist

Executive summary

A coding agent is not just a model. It is a model/harness pair.

The same model can behave very differently depending on:

So you should not evaluate a model in the abstract. Evaluate the pair:

model + harness + repo + task type + validation environment

The goal is to answer four questions:

  1. What is this pair good at?
  2. Where does it fail?
  3. Are failures model-limited, harness-limited, repo-limited, or validation-limited?
  4. What is the cheapest next harness change likely to improve real outcomes?

The basic loop is:

run tasks -> collect traces -> label failures -> map failures to harness surfaces -> change one thing -> rerun

The unit of evaluation is the pair, not the model

A model can look weak in one harness and strong in another. Reasons include:

Therefore, record results as:

model: gpt-x / claude-y / qwen-z / local-model
harness: hermes / codex / aider / custom
harness_version: commit or config hash
repo: target repo
repo_state: commit sha
task_type: bugfix | feature | refactor | test | research | review
validation_available: yes | partial | no

Do not say “model A is bad at coding” until you know whether the harness gave it a fair interface.

What you are trying to measure

A useful evaluation should measure more than pass/fail.

Outcome metrics

These answer whether the task got done.

Process metrics

These explain how the task got done.

Reliability metrics

These catch dangerous behavior.

Usability metrics

These matter in real work even if tests pass.

Build a task suite by capability

Do not start with one giant benchmark. Build a ladder of tasks that isolates capabilities.

Level 0: Harness sanity checks

These verify that the harness itself works.

If these fail, do not blame the model yet.

Level 1: Atomic coding tasks

Tiny tasks with obvious success criteria.

Examples:

These catch edit protocol, basic file navigation, and validation discipline.

Level 2: Local bugfix tasks

Tasks requiring 2–5 files and one relevant test command.

Examples:

These reveal whether the pair can localize code, reason through a small bug, and close the loop.

Level 3: Cross-cutting changes

Tasks that touch multiple modules and require architecture awareness.

Examples:

These reveal context handling, planning, diff discipline, and regression risk.

Level 4: Ambiguous product tasks

Tasks where the agent must ask questions, propose a plan, and make tradeoffs.

Examples:

These reveal planning, user intent modeling, and restraint.

Level 5: Adversarial or failure-mode tasks

Tasks designed to expose bad behavior.

Examples:

These are where harness improvements usually come from.

The core trace schema

Every run should produce a structured trace. At minimum:

{
  "run_id": "...",
  "model": "...",
  "harness": "...",
  "harness_config": "...",
  "repo": "...",
  "repo_commit": "...",
  "task_id": "...",
  "task_type": "bugfix",
  "prompt": "...",
  "acceptance_criteria": [],
  "tool_calls": [],
  "files_read": [],
  "files_changed": [],
  "commands_run": [
    {
      "command": "pytest tests/test_foo.py -q",
      "exit_code": 0,
      "summary": "passed"
    }
  ],
  "diff_summary": "...",
  "validation_status": "passed|failed|blocked|not_run",
  "final_status": "success|partial|failed|blocked|false_success|user_intervened",
  "failure_labels": [],
  "cost": null,
  "duration_seconds": null
}

The trace is more important than the final answer. You need it to diagnose whether the model failed, the harness failed, or the task was underspecified.

Strength and weakness dimensions

Evaluate the pair on these dimensions.

1. Task understanding

Questions:

Failure signs:

Likely improvement surfaces:

2. Repo navigation and localization

Questions:

Failure signs:

Likely improvement surfaces:

3. Edit quality

Questions:

Failure signs:

Likely improvement surfaces:

4. Tool use

Questions:

Failure signs:

Likely improvement surfaces:

5. Validation discipline

Questions:

Failure signs:

Likely improvement surfaces:

6. Recovery behavior

Questions:

Failure signs:

Likely improvement surfaces:

7. Stopping behavior

Questions:

Failure signs:

Likely improvement surfaces:

8. Context durability

Questions:

Failure signs:

Likely improvement surfaces:

Failure labels and what they imply

Use failure labels to map symptoms to improvements.

| Failure label | What it usually means | First thing to try | |---|---|---| | misread_task | Planning/acceptance extraction weak | Add explicit acceptance criteria step | | wrong_files | Retrieval/localization weak | Add repo map, symbol search, or better search prompt | | bad_patch | Edit protocol mismatch | Try different edit format or smaller edit chunks | | over_editing | Scope drift | Add diff-size guard and scope self-review | | no_validation | Stop policy weak | Add hard submit gate and validation requirement | | wrong_validation | Command discovery weak | Add repo validation config | | ignored_failure | Output interpretation weak | Summarize command output and block submit on nonzero exit | | false_success | Final gate weak | Trust tool logs, not model claims | | stuck_loop | Recovery weak | Add repeated-error detector and strategy reset | | context_drift | Compaction/state weak | Preserve goal, plan, todos, failures in state | | tool_misuse | Tool schema/policy weak | Simplify tools or add examples | | sandbox_blocked | Environment/policy mismatch | Improve blocked-state reporting or preflight setup |

How to identify improvement areas

After 20–50 runs, sort failures by frequency and cost.

Use this decision tree.

If the agent often cannot find the right code

Improve retrieval before prompts.

Try:

Do not start with a bigger model unless retrieval is already good.

If the agent finds the right code but patches badly

Improve edit interface.

Try:

If the patch is good but unvalidated

Improve stopping and validation.

Try:

If validation fails and the agent flails

Improve recovery.

Try:

If the agent overengineers

Improve scope control.

Try:

If the final answer lies or overstates success

Improve final gate.

Try:

The model/harness scorecard

Use a 1–5 scale for each dimension.

model_harness_scorecard:
  task_understanding: 1-5
  code_localization: 1-5
  edit_quality: 1-5
  tool_use: 1-5
  validation_discipline: 1-5
  failure_recovery: 1-5
  stopping_behavior: 1-5
  context_durability: 1-5
  cost_efficiency: 1-5
  reviewability: 1-5

Suggested interpretation:

Also record confidence:

sample_size: 25
confidence: low | medium | high
notes: "Mostly Python bugfixes; no frontend tasks tested."

Do not average scores too early. A pair can be excellent for small bugfixes and terrible for broad refactors.

Basic startup checklist

Use this when starting work with a new model/harness pair.

Step 1: Define the evaluation target

Write it down as:

evaluation_target:
  model:
  harness:
  repo:
  task_classes:
  allowed_tools:
  validation_commands:
  autonomy_level:

Step 2: Prepare the repo instructions

Step 3: Preflight the harness

Step 4: Run five smoke tasks

Do not move to harder tasks until these work.

Step 5: Run a balanced starter suite

Use 20 tasks:

For each, record:

Step 6: Label failures

For every failed or suspicious run:

Step 7: Pick one improvement

Choose the highest-frequency, highest-cost failure category.

Only change one major thing at a time:

Then rerun the same suite.

Step 8: Compare before/after

Track:

A change is not good if it improves pass rate but doubles false success or unrelated diffs.

A 30-run starter protocol

If you want a concrete first pass, do this.

Runs 1–5: smoke

Goal: catch harness setup issues.

Runs 6–15: local coding

Goal: measure ordinary usefulness.

Runs 16–22: context and drift

Goal: measure whether it stays on task.

Runs 23–27: validation and recovery

Goal: measure whether it can close the loop.

Runs 28–30: autonomy boundary

Goal: measure whether it knows when to stop or ask.

After 30 runs, you should know the dominant weakness.

Diagnosing model-limited vs harness-limited failures

Probably harness-limited

Fix the harness first.

Probably model-limited

Try a stronger model, different model role, or reduced autonomy.

Probably repo/environment-limited

Improve repo setup before judging the agent.

Model adaptation experiments

For a new model, run small ablations.

Edit format ablation

Try the same 10 tasks with:

Measure patch apply failures, edit correctness, and token cost.

Context shape ablation

Try:

Measure localization, drift, and cost.

Tool surface ablation

Try:

Measure tool misuse and task success.

Stop policy ablation

Try:

Measure false success and blocked-task honesty.

Planning ablation

Try:

Measure overediting, task understanding, and latency.

The improvement backlog template

Keep a backlog like this:

improvement:
  title: "Reject final answers when validation was not run"
  observed_failure_labels:
    - no_validation
    - false_success
  evidence:
    - run_012
    - run_019
    - run_027
  hypothesis: "A hard submit gate will reduce false success without hurting pass rate."
  change:
    - add submit validator checking command logs
    - require blocked state when tests cannot run
  eval_suite:
    - smoke_validation_required
    - bugfix_001
    - bugfix_002
  success_metric:
    false_success_rate: "down by 80%"
    validation_run_rate: "above 90%"
  guardrail_metric:
    task_success_rate: "does not drop by more than 5%"

This keeps you from making random changes.

The first improvements I would usually make

In most young harnesses, the highest-leverage sequence is:

  1. Trace logging.
  2. Repo instruction file.
  3. Validation command discovery.
  4. Hard submit gate.
  5. Failure labels.
  6. Regression suite from failures.
  7. Diff self-review.
  8. Repetition detector.
  9. Context compaction that preserves state.
  10. Model-specific edit format.

Only after that would I spend serious effort on best-of-N, critic models, or fine-tuning.

Minimal daily checklist while working

At the start of a session:

For each run:

At the end of the session:

Bottom line

To evaluate a given model and harness, do not ask “is this model good?”

Ask:

For this repo and this task class, where does this model/harness pair fail first?

Then improve the narrowest responsible layer:

The repeatable practice is simple:

trace -> label -> hypothesize -> change one thing -> rerun -> compare

That is how you turn agent improvement from vibes into engineering.