Skip to content

Benchmark methodology

This document captures the methodology used by benches/agent-success. It exists so the numbers in the README survive "how did you measure?" questions.

Corpus

  • BIRD-mini (recommended): 100 questions across 11 databases, available from the BIRD project. We use the dev split.
  • Spider-dev (alternate): 1034 questions across 200 databases.

We do not redistribute the dataset; benches/agent-success/load_corpus.py is a fetcher.

Agents

We compare two agents over the same model:

  • Baseline — the agent receives one block of information_schema at startup. Reset each question.
  • Treatment — the agent receives describe_for_agent(max_tokens=2048, hint=question) plus a resolve_column tool.

Both agents share:

  • The same model and temperature.
  • The same maximum tool-call budget (8).
  • The same SQL executor that returns either rows or psycopg error strings.

Metrics

Metric Definition
SQL success rate Fraction of questions where the agent emits a query whose result set matches the gold answer.
Retry count Number of tool_call cycles before final answer.
Schema-discovery latency Wall clock from question received to first SQL emitted.

Running

cd benches/agent-success
python load_corpus.py
python baseline.py corpus.json
python treatment.py corpus.json $DB_URL
python compare.py

Reporting policy

Whatever the numbers are, we publish them. If schemadex doesn't help, the README says so.

Synthetic adversarial corpus

The full BIRD/Spider harness needs an LLM API key. To give honest numbers for the resolution path without one, we ship a synthetic corpus and a deterministic chooser.

  • Corpus: benches/agent-success/synthetic_corpus.json, 38 records covering case flips (Id, ID), underscore drift (firstname vs first_name), plural drift (regions, currencies), and semantic near-misses (state for status, product_name for name).
  • Schema: benches/agent-success/synthetic_schema.sql, five tables, ~30 columns.
  • Baseline chooser: literal lookup — the candidate succeeds iff it is exactly a column on the table.
  • Treatment chooser: SchemaCache.resolve(table, candidate), accept if confidence >= 0.85.

Reproduce:

python benches/agent-success/synthetic_corpus.py
python benches/agent-success/run_synthetic.py

Output (committed under benches/agent-success/out/synthetic_*.json):

Metric Baseline Treatment
n 38 38
success_rate 0.000 0.947
median_retries 1.000 0.000
median_latency_ms 0.007 0.003
p95_latency_ms 0.012 0.012

What this measures and doesn't

This isolates the resolution-path contribution. Two of the 38 records still miss because their confidence is below the 0.85 floor (statestatus, bodyreview_body are semantic-only matches that Jaro-Winkler scores lower than the lexical near-misses). Lower the floor and treatment success climbs further, at the cost of false-positive risk on totally unrelated names.

It does not measure: end-to-end LLM SQL accuracy, token-budget contributions, sentinel-flag contributions, cache-hit contributions. Those need the live-LLM harness below.

Ollama harness

benches/agent-success/run_ollama.py runs the same corpus through a local LLM via the Ollama HTTP API.

  • Baseline prompt: raw table(col1, col2, ...) dump, no ranking, no samples. Model emits {"table": ..., "column": ...}.
  • Treatment prompt: describe_for_agent(hint=question, max_tokens=1024) dump, followed by resolve_column post-correction at confidence ≥ 0.85.

Run:

ollama serve &
ollama pull qwen2.5-coder:3b
python benches/agent-success/synthetic_corpus.py
python benches/agent-success/run_ollama.py --model qwen2.5-coder:3b --n 0

Result on qwen2.5-coder:3b (38 records, Apple M2 Max):

Metric Baseline Treatment
n 38 38
success_rate 0.974 1.000
median_retries 0.000 0.000
median_latency_ms 283.158 294.852
p95_latency_ms 331.242 446.319

The single baseline miss was "What is the orderid on shipments?" — the model picked id (the shipments PK) instead of order_id (the FK). resolve_column corrected it.

Caveats

  • A 3B code model is good at unscrambling typos when it sees the whole schema; +2.6 pp is the headroom left for resolve_column on top.
  • Treatment adds ~10 ms p50 / ~115 ms p95 over baseline. That's the cost of the longer prompt (richer per-column metadata) plus the resolve step. On a real agent loop where each retry is a full second, +12% latency to drop the retry-loop entirely is a clear win.
  • The full BIRD/Spider harness is still required for end-to-end SQL accuracy numbers; this Ollama harness only measures column-naming, not query construction.