The interesting question about open-weight models is no longer "are they good." It is "are they good at the specific thing you are asking them to do." Benchmark leaderboards answer the first question. They tell you almost nothing about the second.
So we ran our own. We have a working agent in production: it turns plain-English business questions into validated, executable queries against a large analytics store, runs them, and returns the answer. It already runs on a proprietary model (gpt-5-mini). The agent stays fixed. We unplugged its model, dropped in thirteen open-weight alternatives one at a time, and ran every one through the same benchmark: same tools, same retrieval, same retry budget, same scoring.
No prompt tuning per model. No per-model scaffolding. We wanted to measure the models, not our ability to babysit each one. It is a harsh test, and a fair one.
What we found. An open-weight model tied the proprietary baselinekimi-k2.6matchedgpt-5-miniat 84.6%, and one of its only two misses was an infrastructure error, not a modelling one. Two more open models (gpt-oss-120b,deepseek-v4-flash) beat our own fast preset at 76.9%. Robustness beat size and recency: a 27B model beat its larger 35B sibling on accuracy and used less than half the tokens. And a code-specialist model landed near the bottom. A multi-step analytics agent rewards reasoning and source judgment, not raw query fluency.
The system under test
Before the scores make sense, you need to know what the model is actually doing. "Turn a question into a query" undersells it. The model doesn't just emit one query and stop. It runs the show inside a multi-step agent: it plans, acts, checks its own work, and recovers when something breaks.
At the top, the model plans and acts. It reads the conversation, works out what the user wants, and reaches for the right capability: fetch data, interpret a result, ask a clarifying question when the request is vague, or hand back an answer. It chooses the steps. Nothing scripts them for it.
Underneath, when data is needed, the model enters a guarded query loop. It drafts a query, the system checks it before anything runs, and if the check fails the model gets handed the problem and has to repair its own work. The same model reasons in both roles. A single question can put it through several rounds of decisions before a word reaches the user, and a weak link anywhere breaks the chain.

The data underneath is a large, real-world analytics store, the kind any operations-heavy business ends up with. Only one property of it matters for this study: the same business quantity can often be computed more than one legitimate way, at more than one level of detail. Picking the right one is a reasoning problem, not a syntax problem. That distinction does most of the work below.
Why this is a brutal test for a model
A chatbot benchmark asks a model for one good answer. This system asks for a chain of correct decisions, and any link can break the whole thing. To drive it end-to-end, a model has to clear every one of these bars, not on average but on the specific question in front of it:
- Reliable tool use. The model has to call capabilities cleanly. One that emits malformed calls, invents capability names, or narrates instead of acting never gets off the ground.
- Clean structured output. Several steps expect a strict machine-readable shape. Wrap it in prose or a stray code fence and the step fails.
- Valid query on the first try. Drafts are checked before they run. Broken syntax burns a recovery attempt, and recovery attempts are finite.
- Semantic grounding. The hard part isn't syntax. It's choosing the right source and the right level of detail when several look plausible. That's reasoning, and it's where most points are won and lost.
- Multi-step recovery. When a check fails, the model is handed the problem and asked to fix its own work. A model that can't read its own mistake keeps producing new ones until the attempts run out.
- Output discipline. A model that writes a novel in its scratchpad on every step runs slower, costs more, and loses the thread across a long chain of decisions.
Four of those six have nothing to do with the query language itself. They are about behaving like a dependable component while under reasoning load. That is the gap between a model that demos well and one you can wire into a production pipeline, and it is the gap this comparison was built to expose.
How we measured it
One swap, everything else frozen
Swapping the model is a one-line change. Everything around it stays fixed: same retrieval, same capabilities, same prompts, same recovery budget, same reasoning settings. The model is the only moving part. Open models ran through a standard routed endpoint, the baseline through its own provider.
Why this matters: every result below is attributable to the model alone. We did not rewrite a prompt to rescue a struggling model, and we didn't hand the leaders an advantage. If a model needed special treatment to behave, not getting it is the finding.
What "pass" means
Scoring is execution-based, not text-based. We don't diff the generated query against a reference string, because two correct queries can look nothing alike. Instead we run both and compare the results:
result_match(0/1): does the generated query produce the same answer as the reference query on the live data store? The comparison is set-based (it ignores the order of rows and fields) with a small numeric tolerance for rounding. This is the headline metric.useful_behavior(0/1): a wider rubric that also credits a graceful clarification on a genuinely ambiguous question.clarification_quality: scored only when the agent chooses to clarify. It checks that the agent offered real, plain-English options instead of leaking internal jargon at the user.
Thirteen questions, run end-to-end through the agent, one repetition each, the identical set for every model. The questions are deliberately hard: unit economics, cost attribution, profit-and-loss at different levels of detail. The sort of analytics that usually sends people to a data team.
The leaderboard
Fifteen configurations, ranked by execution accuracy. The two baseline bars are the proprietary model (full and "quick" presets). Everything else is open-weight, running for free on routed endpoints.

A few things stand out. kimi-k2.6 ties the proprietary baseline at 84.6%, an open-weight first for this system and since one of its two misses was an upstream API error, its real modelling accuracy sits a notch above the number shown. Three open models clear the baseline's fast preset (69.2%): gpt-oss-120b and deepseek-v4-flash both beat the preset we ship for latency-sensitive traffic. No question was failed by all 15; every one is solvable by some model. And the open models cost $0 on routed free-tier endpoints versus $0.17 for the baseline run. The bottom of the pack is not where parameter counts would put it.
Full results
The complete matrix accuracy, latency, token economy and infrastructure errors for every configuration. Baseline presets are marked; the top open-weight model is kimi-k2.6.
Model Acc. P / F p50 (s) p95 (s) tok mean errored
----------------------------- ----- ------ ------- ------- -------- -------
gpt-5-mini (full) · baseline 84.6% 11 / 2 63.3 129.5 29,418 0
kimi-k2.6 84.6% 11 / 2 338.2 856.8 35,697 1
gpt-oss-120b 76.9% 10 / 3 109.0 212.6 36,135 1
deepseek-v4-flash 76.9% 10 / 3 63.0 618.2 69,240 0
gpt-5-mini (quick) · baseline 69.2% 9 / 4 29.4 107.6 31,349 0
gemma-4-26b 69.2% 9 / 4 218.4 1135.3 31,956 0
qwen3.6-27b 61.5% 8 / 5 196.7 1693.1 43,679 0
hy3-preview 61.5% 8 / 5 55.3 210.9 40,287 0
nemotron-3-super-120b 53.8% 7 / 6 126.3 675.7 47,949 0
glm-5.1 53.8% 7 / 6 47.1 178.8 39,651 0
qwen3.6-35b-a3b 46.2% 6 / 7 111.0 1778.9 92,786 2
mimo-v2-flash 46.2% 6 / 7 73.7 459.8 92,389 0
minimax-m2.7 38.5% 5 / 8 101.8 463.8 56,862 0
devstral-2512 23.1% 3 / 10 20.8 116.2 51,979 0
mistral-medium-3.5 23.1% 3 / 10 83.0 934.5 57,844 5Three lenses the leaderboard hides
Accuracy alone would tell you to ship kimi-k2.6 and stop reading. But a model that is right slowly, or right expensively, or right most of the time and then hangs for fifteen minutes, is a very different thing to run in production. Three other lenses change the picture.
Lens 1: accuracy vs. speed
Plot accuracy against median latency and you get a Pareto frontier: the models on the upper-left edge are the ones nothing else beats on both axes at once. Bubble size is mean tokens per question, so bigger bubbles cost more to run on a paid endpoint.

kimi-k2.6 pays for its top accuracy with the slowest median in the field, 338 seconds. Nobody watches a spinner for five minutes, so you'd only ship it behind streaming. deepseek-v4-flash matches the baseline's p50 (63s) at 76.9% accuracy, the best speed-to-accuracy trade among the newcomers. Then look at its bubble: it is one of the biggest.
Lens 2: token economy
Tokens are the real cost on a paid endpoint, and a decent proxy for how much a model thrashes. Two of them stand out for the wrong reason.

qwen3.6-35b-a3b and mimo-v2-flash both burn around 92,000 tokens per question, roughly triple the leaders, and still land at 46.2%. It's a scratchpad problem: the model reasons in circles, re-derives context it already has, and pads every step. On a metered endpoint they're the most expensive models in the field and among the least accurate at the same time. The leaders (gpt-5-mini, kimi-k2.6, gpt-oss-120b) sit at 29-36k. In this run, the disciplined models were also the accurate ones.
Lens 3: tail risk
Median latency is the happy path. The p95 is what your unluckiest user feels. Plotting p50 against p95 shows which models stay predictable and which have a long, ugly tail where the agent gets stuck retrying.

glm-5.1 has the tightest tail in the field (47s median, 179s p95) but only 53.8% accuracy. hy3-preview is the best-behaved of the genuinely useful models: 55s median, 211s tail, 61.5% accuracy. At the other end, qwen3.6-27b and qwen3.6-35b-a3b post p95s near 1,700-1,800 seconds: half-hour outliers where the model never converges and the retry budget runs out. A good median doesn't save a tail like that.
The hidden variable: errors aren't model quality
mistral-medium-3.5 scored 23.1%, but five of its thirteen runs never completed: the routed endpoint threw credit-cap and server errors mid-run. That is an infrastructure result, not a capability one, so we treat the score as a floor rather than a verdict. The chart below splits clean answers from upstream failures, so you can tell which low scores are real and which are noise.

Why some models fit and others don't
This is the part a leaderboard can't show you. When you hold the system constant and vary only the model, the winners and losers split along traits that have little to do with their headline benchmark scores, and everything to do with the six bars above.
Robustness beats size and recency
The clearest example: qwen3.6-27b (61.5%, 44k tokens) beat its own larger sibling qwen3.6-35b-a3b (46.2%, 93k tokens) on accuracy while using less than half the tokens. Same family, bigger model, worse result. The larger variant produced runaway scratchpads, and a multi-step loop punishes that: more tokens meant more chances to drift off the right source, not more correct answers. Across the field, neither size nor recency tracked accuracy. What tracked accuracy was whether a model stayed coherent and decisive across a long chain of dependent steps. On a system like this, robustness is the capability that matters, and it doesn't ship with parameter count or release date.
A coding model is not an analytics agent
You would expect a code-tuned model to win a job that ends in a query. It didn't. devstral-2512, a code specialist, was the fastest model in the field (21s median) and one of the least accurate (23.1%). The reason is structural. A multi-step analytics agent spends most of its effort deciding what to compute and from where: reading the request, choosing the canonical source, attributing cost correctly. Very little of it is exotic syntax. Code fluency answers a question nobody here was struggling with. The bottleneck is reasoning, and reasoning isn't a byproduct of training on more code. On a complex architecture, a strong general reasoner beats a strong coder.
Active parameters, not total
Several of the strong performers are mixture-of-experts models with modest active parameter counts; the leaders route through a fraction of their nominal size per token. What separated the good MoE models from the bad ones wasn't how big they were. It was whether their per-step output stayed tight enough to keep the agent's context coherent over a long chain of steps. The ones that did well got to the point in a few steps. The ones that didn't kept thinking out loud.
The right source beats clever queries
Every model in the field can write a non-trivial aggregation, and almost none of the syntax failed. What failed, over and over, was source selection: pulling revenue from a daily aggregate when the question needed raw event-level records, or filtering on the wrong date field. The models that topped the board share one habit. On a per-unit-economics question they anchor both the numerator and the denominator on the same raw source, rather than mixing a curated aggregate with a raw measure. That's a grounding skill, and it's the biggest single divider between the 84% tier and the 46% tier.
Recovery separates the usable from the fragile
When the validator rejects a query, the model gets the error and a chance to fix it. The strong models read the error, notice they referenced a field that doesn't exist or built a join that fans out and explodes the row count, and correct it inside the budget. The weak ones produce a confident, wrong query and then "fix" it into a different wrong one. devstral-2512 is the clearest case: fast at 21s median, 23.1% accurate. Being fast doesn't help much if you can't recover.
Put it together: the models that fit this system behave like reliable components - tight output, a valid query on the first try, honest error-reading, good source instincts. Size and public-benchmark rank predicted almost none of that.
The failure taxonomy
We labelled every miss. Eight patterns cover nearly all of them, and how they're distributed carries the main lesson of the study.
- F1 · Wrong revenue source. Gross vs. net, or a cost-source snapshot vs. the canonical revenue source.
- F2 · Cost mis-attribution. Rolling shared per-asset cost up to the parent and double-counting it.
- F3 · Fan-out / missing key. Joining two large sources with no shared date key; the row count explodes silently.
- F4 · Hallucinated fields. Invented field names or non-distinct mapping joins.
- F5 · Wrong grain for unit economics. Daily aggregate used where raw event-level records were required.
- F6 · Date-field mismatch. Filtering on the business-date field instead of the event-time field.
- F7 · Token bloat. Runaway scratchpad, leading to a slow tail and sometimes a provider credit cap.
- F8 · Asked instead of guessed. On a genuinely ambiguous question, the model correctly asked rather than answered.
The difficulty chart ranks the thirteen questions by how many of the fifteen models got them wrong. The pattern isn't random.

The two hardest questions are both per-unit-economics problems (11/15 and 10/15 fail). They are not hard because the query is exotic. They are hard because the same business concept can be computed more than one legitimate way, and only one is right for what was asked. Sorting that out is reasoning under ambiguity: read the intent, weigh the candidate sources, commit to one. The models that topped the board are the ones that made that call correctly. The rest reached for the convenient source and produced a plausible, wrong number. So the hardest questions are precisely the ones that test the trait this study is about.

The takeaway: the failures cluster on judgment rather than mechanics - on which source, which grain and which date to trust. Robust models resolve the ambiguity and stay correct. Fragile ones grab the easy source and return a confident wrong answer. That is why the ranking holds across question types: the trait that wins the easy questions is the same one that decides the hard ones.
Picks, and what "fit" actually means
If you had to choose today
- Top open-weight pick:
kimi-k2.6. Ties the proprietary baseline at 84.6%, and its true number is probably higher once the one infra miss is cleared. The cost is latency, a 338s median, so only ship it behind a streaming UI. - Best balance:
gpt-oss-120b. 76.9% accuracy, a tight 213s tail, lean tokens. The most component-like model in the field, predictable on every axis. - Best p50:
deepseek-v4-flash. Same 76.9% accuracy at a 63s median, matching the baseline's speed. Watch its 618s tail and heavy token mean if you're paying per token. - Latency-first, accuracy-tolerant:
glm-5.1orhy3-preview. Tightest tails in the field at 54-62% accuracy. Fine for low-stakes, high-volume queries, but not for the questions that feed a decision. - Skip for now:
devstral-2512,minimax-m2.7,mimo-v2-flash, and the token-bloatedqwen3.6-35b-a3b.mistral-medium-3.5deserves a clean re-run before any verdict, since its score is an infrastructure artifact.
The portable lesson
The model leaderboard for your system won't look like the public ones, because the public ones don't make the model behave like a component inside a control loop. With the agent held constant, the traits that predicted success were reliable tool use, output discipline, valid queries on the first try, honest error recovery, and a good instinct for which source is canonical. Parameter count predicted almost nothing. A 27B model beat its 35B sibling, and an open model tied a proprietary one.
Test on your own loop, and hold everything else still. The model is only one variable, and often not the one with the most headroom.
Method notes. Fifteen configurations over a fixed 13-question benchmark, one repetition each, scored by execution match against reference answers. The model was the only variable; the surrounding agent was held constant. Open models ran on a standard routed endpoint at free tier; the baseline (gpt-5-mini) on its own provider. Domain, schema, client, and system internals are abstracted throughout by design; only the model-facing difficulty of the task is described. All numbers come from the run's machine-readable results matrix.
Written by
CosX AI
Published
June 15, 2026
Duration
15 min read
Share
Tags


