The interesting question about open-weight models is no longer "are they good." It is "are they good at the specific thing you are asking them to do." Benchmark leaderboards answer the first question. They tell you almost nothing about the second.

So we ran our own. We have a working agent in production: it turns plain-English business questions into validated, executable queries against a large analytics store, runs them, and returns the answer. It already runs on a proprietary model (gpt-5-mini). The agent stays fixed. We unplugged its model, dropped in thirteen open-weight alternatives one at a time, and ran every one through the same benchmark: same tools, same retrieval, same retry budget, same scoring.

No prompt tuning per model. No per-model scaffolding. We wanted to measure the models, not our ability to babysit each one. It is a harsh test, and a fair one.

What we found. An open-weight model tied the proprietary baseline kimi-k2.6 matched gpt-5-mini at 84.6%, and one of its only two misses was an infrastructure error, not a modelling one. Two more open models (gpt-oss-120b, deepseek-v4-flash) beat our own fast preset at 76.9%. Robustness beat size and recency: a 27B model beat its larger 35B sibling on accuracy and used less than half the tokens. And a code-specialist model landed near the bottom. A multi-step analytics agent rewards reasoning and source judgment, not raw query fluency.

The system under test

Before the scores make sense, you need to know what the model is actually doing. "Turn a question into a query" undersells it. The model doesn't just emit one query and stop. It runs the show inside a multi-step agent: it plans, acts, checks its own work, and recovers when something breaks.

At the top, the model plans and acts. It reads the conversation, works out what the user wants, and reaches for the right capability: fetch data, interpret a result, ask a clarifying question when the request is vague, or hand back an answer. It chooses the steps. Nothing scripts them for it.

Underneath, when data is needed, the model enters a guarded query loop. It drafts a query, the system checks it before anything runs, and if the check fails the model gets handed the problem and has to repair its own work. The same model reasons in both roles. A single question can put it through several rounds of decisions before a word reaches the user, and a weak link anywhere breaks the chain.

The model fills two roles: the planner and the query author and fixer. The self-repair path is where weaker reasoners spiral.

The data underneath is a large, real-world analytics store, the kind any operations-heavy business ends up with. Only one property of it matters for this study: the same business quantity can often be computed more than one legitimate way, at more than one level of detail. Picking the right one is a reasoning problem, not a syntax problem. That distinction does most of the work below.

Why this is a brutal test for a model

A chatbot benchmark asks a model for one good answer. This system asks for a chain of correct decisions, and any link can break the whole thing. To drive it end-to-end, a model has to clear every one of these bars, not on average but on the specific question in front of it:

Reliable tool use. The model has to call capabilities cleanly. One that emits malformed calls, invents capability names, or narrates instead of acting never gets off the ground.
Clean structured output. Several steps expect a strict machine-readable shape. Wrap it in prose or a stray code fence and the step fails.
Valid query on the first try. Drafts are checked before they run. Broken syntax burns a recovery attempt, and recovery attempts are finite.
Semantic grounding. The hard part isn't syntax. It's choosing the right source and the right level of detail when several look plausible. That's reasoning, and it's where most points are won and lost.
Multi-step recovery. When a check fails, the model is handed the problem and asked to fix its own work. A model that can't read its own mistake keeps producing new ones until the attempts run out.
Output discipline. A model that writes a novel in its scratchpad on every step runs slower, costs more, and loses the thread across a long chain of decisions.

Four of those six have nothing to do with the query language itself. They are about behaving like a dependable component while under reasoning load. That is the gap between a model that demos well and one you can wire into a production pipeline, and it is the gap this comparison was built to expose.

How we measured it

One swap, everything else frozen

Swapping the model is a one-line change. Everything around it stays fixed: same retrieval, same capabilities, same prompts, same recovery budget, same reasoning settings. The model is the only moving part. Open models ran through a standard routed endpoint, the baseline through its own provider.

Why this matters: every result below is attributable to the model alone. We did not rewrite a prompt to rescue a struggling model, and we didn't hand the leaders an advantage. If a model needed special treatment to behave, not getting it is the finding.

What "pass" means

Scoring is execution-based, not text-based. We don't diff the generated query against a reference string, because two correct queries can look nothing alike. Instead we run both and compare the results:

result_match (0/1): does the generated query produce the same answer as the reference query on the live data store? The comparison is set-based (it ignores the order of rows and fields) with a small numeric tolerance for rounding. This is the headline metric.
useful_behavior (0/1): a wider rubric that also credits a graceful clarification on a genuinely ambiguous question.
clarification_quality: scored only when the agent chooses to clarify. It checks that the agent offered real, plain-English options instead of leaking internal jargon at the user.

Thirteen questions, run end-to-end through the agent, one repetition each, the identical set for every model. The questions are deliberately hard: unit economics, cost attribution, profit-and-loss at different levels of detail. The sort of analytics that usually sends people to a data team.

The leaderboard

Fifteen configurations, ranked by execution accuracy. The two baseline bars are the proprietary model (full and "quick" presets). Everything else is open-weight, running for free on routed endpoints.

Pass = the generated query produced the reference answer on the live data store.

A few things stand out. kimi-k2.6 ties the proprietary baseline at 84.6%, an open-weight first for this system and since one of its two misses was an upstream API error, its real modelling accuracy sits a notch above the number shown. Three open models clear the baseline's fast preset (69.2%): gpt-oss-120b and deepseek-v4-flash both beat the preset we ship for latency-sensitive traffic. No question was failed by all 15; every one is solvable by some model. And the open models cost $0 on routed free-tier endpoints versus $0.17 for the baseline run. The bottom of the pack is not where parameter counts would put it.

Full results

The complete matrix accuracy, latency, token economy and infrastructure errors for every configuration. Baseline presets are marked; the top open-weight model is kimi-k2.6.

text

Model                          Acc.   P / F   p50 (s)  p95 (s)  tok mean  errored
-----------------------------  -----  ------  -------  -------  --------  -------
gpt-5-mini (full) · baseline   84.6%  11 / 2  63.3     129.5    29,418    0
kimi-k2.6                      84.6%  11 / 2  338.2    856.8    35,697    1
gpt-oss-120b                   76.9%  10 / 3  109.0    212.6    36,135    1
deepseek-v4-flash              76.9%  10 / 3  63.0     618.2    69,240    0
gpt-5-mini (quick) · baseline  69.2%  9 / 4   29.4     107.6    31,349    0
gemma-4-26b                    69.2%  9 / 4   218.4    1135.3   31,956    0
qwen3.6-27b                    61.5%  8 / 5   196.7    1693.1   43,679    0
hy3-preview                    61.5%  8 / 5   55.3     210.9    40,287    0
nemotron-3-super-120b          53.8%  7 / 6   126.3    675.7    47,949    0
glm-5.1                        53.8%  7 / 6   47.1     178.8    39,651    0
qwen3.6-35b-a3b                46.2%  6 / 7   111.0    1778.9   92,786    2
mimo-v2-flash                  46.2%  6 / 7   73.7     459.8    92,389    0
minimax-m2.7                   38.5%  5 / 8   101.8    463.8    56,862    0
devstral-2512                  23.1%  3 / 10  20.8     116.2    51,979    0
mistral-medium-3.5             23.1%  3 / 10  83.0     934.5    57,844    5

Three lenses the leaderboard hides

Accuracy alone would tell you to ship kimi-k2.6 and stop reading. But a model that is right slowly, or right expensively, or right most of the time and then hangs for fifteen minutes, is a very different thing to run in production. Three other lenses change the picture.

Lens 1: accuracy vs. speed

Plot accuracy against median latency and you get a Pareto frontier: the models on the upper-left edge are the ones nothing else beats on both axes at once. Bubble size is mean tokens per question, so bigger bubbles cost more to run on a paid endpoint.

Up and to the left is better. Big bubbles are token-hungry.

kimi-k2.6 pays for its top accuracy with the slowest median in the field, 338 seconds. Nobody watches a spinner for five minutes, so you'd only ship it behind streaming. deepseek-v4-flash matches the baseline's p50 (63s) at 76.9% accuracy, the best speed-to-accuracy trade among the newcomers. Then look at its bubble: it is one of the biggest.

Lens 2: token economy

Tokens are the real cost on a paid endpoint, and a decent proxy for how much a model thrashes. Two of them stand out for the wrong reason.

Right side = expensive. The two far-right points spend ~92k tokens for middling accuracy.

qwen3.6-35b-a3b and mimo-v2-flash both burn around 92,000 tokens per question, roughly triple the leaders, and still land at 46.2%. It's a scratchpad problem: the model reasons in circles, re-derives context it already has, and pads every step. On a metered endpoint they're the most expensive models in the field and among the least accurate at the same time. The leaders (gpt-5-mini, kimi-k2.6, gpt-oss-120b) sit at 29-36k. In this run, the disciplined models were also the accurate ones.

Lens 3: tail risk

Median latency is the happy path. The p95 is what your unluckiest user feels. Plotting p50 against p95 shows which models stay predictable and which have a long, ugly tail where the agent gets stuck retrying.

Far up the chart = a long, ugly tail where the worst case dwarfs the typical case.

glm-5.1 has the tightest tail in the field (47s median, 179s p95) but only 53.8% accuracy. hy3-preview is the best-behaved of the genuinely useful models: 55s median, 211s tail, 61.5% accuracy. At the other end, qwen3.6-27b and qwen3.6-35b-a3b post p95s near 1,700-1,800 seconds: half-hour outliers where the model never converges and the retry budget runs out. A good median doesn't save a tail like that.

The hidden variable: errors aren't model quality

mistral-medium-3.5 scored 23.1%, but five of its thirteen runs never completed: the routed endpoint threw credit-cap and server errors mid-run. That is an infrastructure result, not a capability one, so we treat the score as a floor rather than a verdict. The chart below splits clean answers from upstream failures, so you can tell which low scores are real and which are noise.

Red segments are upstream API failures, not wrong answers. Watch mistral-medium-3.5.

Why some models fit and others don't

This is the part a leaderboard can't show you. When you hold the system constant and vary only the model, the winners and losers split along traits that have little to do with their headline benchmark scores, and everything to do with the six bars above.

Robustness beats size and recency

The clearest example: qwen3.6-27b (61.5%, 44k tokens) beat its own larger sibling qwen3.6-35b-a3b (46.2%, 93k tokens) on accuracy while using less than half the tokens. Same family, bigger model, worse result. The larger variant produced runaway scratchpads, and a multi-step loop punishes that: more tokens meant more chances to drift off the right source, not more correct answers. Across the field, neither size nor recency tracked accuracy. What tracked accuracy was whether a model stayed coherent and decisive across a long chain of dependent steps. On a system like this, robustness is the capability that matters, and it doesn't ship with parameter count or release date.

A coding model is not an analytics agent

You would expect a code-tuned model to win a job that ends in a query. It didn't. devstral-2512, a code specialist, was the fastest model in the field (21s median) and one of the least accurate (23.1%). The reason is structural. A multi-step analytics agent spends most of its effort deciding what to compute and from where: reading the request, choosing the canonical source, attributing cost correctly. Very little of it is exotic syntax. Code fluency answers a question nobody here was struggling with. The bottleneck is reasoning, and reasoning isn't a byproduct of training on more code. On a complex architecture, a strong general reasoner beats a strong coder.

Active parameters, not total

Several of the strong performers are mixture-of-experts models with modest active parameter counts; the leaders route through a fraction of their nominal size per token. What separated the good MoE models from the bad ones wasn't how big they were. It was whether their per-step output stayed tight enough to keep the agent's context coherent over a long chain of steps. The ones that did well got to the point in a few steps. The ones that didn't kept thinking out loud.

The right source beats clever queries

Every model in the field can write a non-trivial aggregation, and almost none of the syntax failed. What failed, over and over, was source selection: pulling revenue from a daily aggregate when the question needed raw event-level records, or filtering on the wrong date field. The models that topped the board share one habit. On a per-unit-economics question they anchor both the numerator and the denominator on the same raw source, rather than mixing a curated aggregate with a raw measure. That's a grounding skill, and it's the biggest single divider between the 84% tier and the 46% tier.

Recovery separates the usable from the fragile

When the validator rejects a query, the model gets the error and a chance to fix it. The strong models read the error, notice they referenced a field that doesn't exist or built a join that fans out and explodes the row count, and correct it inside the budget. The weak ones produce a confident, wrong query and then "fix" it into a different wrong one. devstral-2512 is the clearest case: fast at 21s median, 23.1% accurate. Being fast doesn't help much if you can't recover.

Put it together: the models that fit this system behave like reliable components - tight output, a valid query on the first try, honest error-reading, good source instincts. Size and public-benchmark rank predicted almost none of that.

The failure taxonomy

We labelled every miss. Eight patterns cover nearly all of them, and how they're distributed carries the main lesson of the study.

F1 · Wrong revenue source. Gross vs. net, or a cost-source snapshot vs. the canonical revenue source.
F2 · Cost mis-attribution. Rolling shared per-asset cost up to the parent and double-counting it.
F3 · Fan-out / missing key. Joining two large sources with no shared date key; the row count explodes silently.
F4 · Hallucinated fields. Invented field names or non-distinct mapping joins.
F5 · Wrong grain for unit economics. Daily aggregate used where raw event-level records were required.
F6 · Date-field mismatch. Filtering on the business-date field instead of the event-time field.
F7 · Token bloat. Runaway scratchpad, leading to a slow tail and sometimes a provider credit cap.
F8 · Asked instead of guessed. On a genuinely ambiguous question, the model correctly asked rather than answered.

The difficulty chart ranks the thirteen questions by how many of the fifteen models got them wrong. The pattern isn't random.

The hardest questions cluster on F1, F2 and F5: source and grain selection, not query mechanics.

The two hardest questions are both per-unit-economics problems (11/15 and 10/15 fail). They are not hard because the query is exotic. They are hard because the same business concept can be computed more than one legitimate way, and only one is right for what was asked. Sorting that out is reasoning under ambiguity: read the intent, weigh the candidate sources, commit to one. The models that topped the board are the ones that made that call correctly. The rest reached for the convenient source and produced a plausible, wrong number. So the hardest questions are precisely the ones that test the trait this study is about.

Rows are questions (abstracted), columns are models in leaderboard order. Green = pass, red = fail.

The takeaway: the failures cluster on judgment rather than mechanics - on which source, which grain and which date to trust. Robust models resolve the ambiguity and stay correct. Fragile ones grab the easy source and return a confident wrong answer. That is why the ranking holds across question types: the trait that wins the easy questions is the same one that decides the hard ones.

Picks, and what "fit" actually means

If you had to choose today

Top open-weight pick: kimi-k2.6. Ties the proprietary baseline at 84.6%, and its true number is probably higher once the one infra miss is cleared. The cost is latency, a 338s median, so only ship it behind a streaming UI.
Best balance: gpt-oss-120b. 76.9% accuracy, a tight 213s tail, lean tokens. The most component-like model in the field, predictable on every axis.
Best p50: deepseek-v4-flash. Same 76.9% accuracy at a 63s median, matching the baseline's speed. Watch its 618s tail and heavy token mean if you're paying per token.
Latency-first, accuracy-tolerant: glm-5.1 or hy3-preview. Tightest tails in the field at 54-62% accuracy. Fine for low-stakes, high-volume queries, but not for the questions that feed a decision.
Skip for now: devstral-2512, minimax-m2.7, mimo-v2-flash, and the token-bloated qwen3.6-35b-a3b. mistral-medium-3.5 deserves a clean re-run before any verdict, since its score is an infrastructure artifact.

The portable lesson

The model leaderboard for your system won't look like the public ones, because the public ones don't make the model behave like a component inside a control loop. With the agent held constant, the traits that predicted success were reliable tool use, output discipline, valid queries on the first try, honest error recovery, and a good instinct for which source is canonical. Parameter count predicted almost nothing. A 27B model beat its 35B sibling, and an open model tied a proprietary one.

Test on your own loop, and hold everything else still. The model is only one variable, and often not the one with the most headroom.

Method notes. Fifteen configurations over a fixed 13-question benchmark, one repetition each, scored by execution match against reference answers. The model was the only variable; the surrounding agent was held constant. Open models ran on a standard routed endpoint at free tier; the baseline (gpt-5-mini) on its own provider. Domain, schema, client, and system internals are abstracted throughout by design; only the model-facing difficulty of the task is described. All numbers come from the run's machine-readable results matrix.

Work with us

Have a project in mind? Let's talk.

Pilots, platforms, or roadmaps — tell us what you're building and we'll get back within one business day.

Newsletter

Get our latest writing in your inbox.

Agentic engineering, AI platforms, and what we learn shipping them — no spam, unsubscribe anytime.

Same brain, different model: 15 LLMs on one production agent