DiffusionGemma hits 0.35s median in the WebBrain local planner bench
We pulled four new local candidates into the WebBrain bench: Gemma 4 12B Coder Fable5 Composer 2.5, Cohere North-Mini-Code 1.0, DiffusionGemma-26B-A4B-it, and VibeThinker-3B. The practical result: Gemma and North both completed the frozen legacy first-tool-call run at Q4, DiffusionGemma could not use the normal llama.cpp server path but did complete the harness through vLLM at very high speed, and VibeThinker matched its own caveat: it is not trained for tool-calling or autonomous agents. DiffusionGemma is promising, but not yet something I would put in WebBrain's default path.
What we ran
This was not a leaderboard run. It was a local-serving reality check: can these models sit behind WebBrain as a first-action browser planner?
For the comparable runs, we used the frozen first-tool-call harness:
node test/llm/run-llamacpp.mjs \
--freeze test/llm/freeze/baseline-2026-05-23.json \
--chat-template-compat alternating \
--concurrency 1
That means:
- 100 single-turn browser-agent prompts.
- The May 23 Claude Sonnet 4.6 WebBrain system prompt and 41-tool schema, frozen.
- Legacy text-call compatibility: no native OpenAI
toolsfield is sent. - One active request at a time.
Q4_K_MGGUFs where a GGUF path exists and can run.
The strict numbers below score only the first model action. Exact means tool name and args match the expected first call. Name means the tool name matches but args differ. Parsed calls measures format reliability, not correctness.
Results
| Model | Serving path | Parsed calls | Exact | Name | Median | p95 | Observed VRAM | Status |
|---|---|---|---|---|---|---|---|---|
| Gemma 4 12B Coder Fable5 Composer 2.5 | Q4_K_M GGUF, llama.cpp, 32k ctx | 94/100 | 9% | 26% | 1.9s | 2.8s | ~8.8-9.0 GB | Clean text run |
| Cohere North-Mini-Code 1.0 | Q4_K_M GGUF, llama.cpp b9714, 32k ctx | 93/100 | 9% | 24% | 3.2s | 4.0s | ~19.5 GB | Clean run with parser workaround |
| DiffusionGemma-26B-A4B-it | NVFP4, vLLM, 32k ctx | 79/100 | 10% | 26% | 0.35s | 0.65s | ~32.0 GB reserved | Very fast; vision probe works |
| VibeThinker-3B | BF16, vLLM, 64k ctx, 8 server seqs | 84/100 | 2% | 19% | 5.0s | 15.1s | ~26.8 GB reserved | Scored, but not recommended |
Two VRAM caveats:
- The llama.cpp numbers are observed desktop readings on an RTX 5090, not isolated lab measurements. They include 32k context, full GPU offload, flash attention where available, q4 KV, and one slot.
- The vLLM numbers are not minimum model memory needs. The DiffusionGemma endpoint reserved almost the whole RTX 5090, and the VibeThinker server was configured with 64k context and 8 concurrent sequences.
Gemma 4 12B Coder
Model: yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF
Local file:
G:\llama\models\gemma4-12b-coder\gemma4-coding-Q4_K_M.gguf
Gemma was the cleanest llama.cpp run in this batch.
| Metric | Value |
|---|---|
| Completed cases | 100/100 |
| Transport errors | 0 |
| Parsed calls | 94/100 |
| Exact first-call match | 9/100 |
| Tool-name match | 26/100 |
| Average latency | 2.34s |
| Median latency | 1.91s |
| p95 latency | 2.81s |
| Slowest case | 40.5s |
| Observed VRAM | ~8.8-9.0 GB |
The good part is format reliability. Under the frozen legacy path, Gemma emitted parseable calls in 94 cases without transport errors. It was also quick: most requests landed around two seconds.
The weakness is first-action selection. It called get_accessibility_tree 57 times out of 100. That is often safe behavior in a real browser session, but this benchmark asks for the expected first action. When the user explicitly asks to go somewhere, search something, or open a known page, "inspect the current blank tab" is usually too conservative.
We also ran test/vision-probe.mjs against the same server. The probe failed with:
image input is not supported - hint: if this is unexpected, you may need to provide the mmproj
So this particular GGUF is a text planner candidate only in the current setup.
North Mini Code
Model: CohereLabs/North-Mini-Code-1.0, Q4 GGUF from bartowski/North-Mini-Code-1.0-GGUF
Local file:
G:\llama\models\north-mini-code-1.0\North-Mini-Code-1.0-Q4_K_M.gguf
North loaded successfully in the staged current llama.cpp build:
G:\llama\llama-b9714-cuda13\llama-server.exe
But the first smoke test hit a serving-layer problem. With the native template, North generated a sensible action:
[
{
"tool_call_id": "0",
"tool_name": "navigate",
"parameters": {
"url": "about:addons"
}
}
]
llama.cpp's OpenAI endpoint rejected it before returning a normal response:
The model produced output that does not match the expected peg-native format
For the comparable run, we restarted the server with --skip-chat-parsing and added a narrow raw-action extractor for North/Cohere-style JSON in the runner. The prompt remained the same frozen legacy prompt, and no structured OpenAI tools were sent.
| Metric | Value |
|---|---|
| Completed cases | 100/100 |
| Transport errors | 0 |
| Parsed calls | 93/100 |
| Exact first-call match | 9/100 |
| Tool-name match | 24/100 |
| Average latency | 3.30s |
| Median latency | 3.19s |
| p95 latency | 4.02s |
| Slowest case | 5.16s |
| Observed VRAM | ~19.5 GB |
North was slower and much heavier than Gemma, but more stable than the first template failure suggested. Like Gemma, it strongly preferred get_accessibility_tree: 59 of its 93 parsed calls used that tool. It had 7 no-tool answers.
The vision probe failed the same way as Gemma:
image input is not supported - hint: if this is unexpected, you may need to provide the mmproj
So North is also text-only in this local run.
DiffusionGemma
Model: google/diffusiongemma-26B-A4B-it, vLLM endpoint rooted at nvidia/diffusiongemma-26B-A4B-it-NVFP4. We also tested the Q4 GGUF from unsloth/diffusiongemma-26B-A4B-it-GGUF for local llama.cpp compatibility.
Local file:
G:\llama\models\diffusiongemma-26b-a4b-it\diffusiongemma-26B-A4B-it-Q4_K_M.gguf
DiffusionGemma was the headline surprise.
The expectation going in was simple: if the model can preserve roughly the same action quality as Gemma 4 26B-A4B while running much faster, it changes the local-agent tradeoff. That is mostly what happened. It did not beat the old Gemma 26B run on strict action quality, but it got close enough to be interesting and was dramatically faster once the vLLM server was warm.
| Metric | Value |
|---|---|
| Completed cases | 100/100 |
| Transport errors | 0 |
| Parsed calls | 79/100 |
| Exact first-call match | 10/100 |
| Tool-name match | 26/100 |
| Average latency | 0.38s |
| Median latency | 0.35s |
| p95 latency | 0.65s |
| Slowest case | 0.88s |
| Observed VRAM | ~32.0 GB reserved by vLLM |
The first one-case smoke took 8.8s and picked about:extensions instead of Firefox's expected about:addons. After that warmup, the full 100-case run finished in 37.8s wall time, with most cases between 200ms and 650ms. That is the flashy part: it behaves like a large local model, but the warmed vLLM path felt closer to a lightweight router.
Quality was more mixed. Compared with the earlier Gemma 4 26B-A4B saved run, DiffusionGemma had:
- better exact score by a small amount than Gemma 4 12B Coder and North Mini Code,
- the same 26% tool-name score as Gemma 4 12B Coder,
- lower parseability than Gemma 26B, with 21 no-tool answers,
- much better latency than every other large-class local run in this table.
The caveat is important: DiffusionGemma does not yet feel reliable enough for WebBrain's default planner. In interactive use it can sometimes behave as if an action has already happened when it has not actually been taken yet. Our working hypothesis is that the diffusion generation path is part of that failure mode: it is less stepwise-deterministic than a normal autoregressive planner, and browser automation punishes that kind of state drift quickly. The speed is real; the action-state calibration still needs work.
The other surprise is vision. Unlike the Gemma 4 12B Coder and North Q4 GGUFs, the vLLM DiffusionGemma endpoint accepted the vision-probe.mjs screenshot and produced a useful structured caption of the Google password-error page. It correctly identified the page purpose, the visible text, the focused password input, and the "Enter a password" error state. That makes it more than a text planner curiosity.
There is still a serving split to be aware of. Standard llama-server and llama-cli did not load the Q4 GGUF in three builds we checked:
- the existing April-era
G:\llamabuild, G:\llama-b9286,- the staged official Windows CUDA
b9714build.
All failed with the same architecture error:
unknown model architecture: 'diffusion-gemma'
That is expected right now. DiffusionGemma is not a normal autoregressive model, and the GGUF notes point to the DiffusionGemma llama.cpp PR and the dedicated llama-diffusion-cli runner. The Hugging Face discussion for the Unsloth GGUF also notes that this path is CLI-only for now, with no llama-server support yet.
So we built the PR branch locally:
G:\llama-diffusiongemma\build\bin\llama-diffusion-cli.exe
Smoke command:
llama-diffusion-cli.exe \
-m G:\llama\models\diffusiongemma-26b-a4b-it\diffusiongemma-26B-A4B-it-Q4_K_M.gguf \
-p "Explain promises in JavaScript in one sentence." \
-n 64 \
-ngl 99 \
--diffusion-steps 24
That worked. It produced a coherent one-sentence answer and peaked around 21.3 GB total 5090 memory in the smoke run. The CLI reported about 1.3s total generation time for the 256-token canvas, using entropy-bound early stopping.
But the WebBrain score above comes from vLLM, not from the GGUF CLI path. For local llama.cpp users, this is still not drop-in. For vLLM users with enough VRAM, it is suddenly one of the most interesting candidates in the batch.
VibeThinker
Model: WeiboAI/VibeThinker-3B
We tried VibeThinker two ways.
First, we pulled a Q4_K_M GGUF and ran it through llama.cpp. Those numbers are discarded. The Q4 run showed repetitive answers, never-ending responses, high latency, and weak tool-call behavior. For a 3B model, this quantization may simply be too lossy for this use case.
Then we reran the BF16 model through vLLM on port 8000. That gave a cleaner data point:
| Metric | Value |
|---|---|
| Completed cases | 100/100 |
| Transport errors | 0 |
| Parsed calls | 84/100 |
| Exact first-call match | 2/100 |
| Tool-name match | 19/100 |
| Average latency | 6.20s |
| Median latency | 4.97s |
| p95 latency | 15.13s |
| Slowest case | 24.4s |
| Observed VRAM | ~26.8 GB reserved by vLLM |
This cleaner run still supports the upstream warning. The VibeThinker model card says it was not trained on tool-calling or agent-based programming data and does not recommend it for function calling, API orchestration, or autonomous coding agents. It recommends competitive-programming-style tasks instead.
That showed up here. VibeThinker emitted calls most of the time, but the first-action routing was weak, no-tool answers were common, and the model often selected generic reading/inspection behavior where the harness expected a concrete browser action.
So the fair interpretation is narrow: VibeThinker may still be interesting for reasoning or competitive-programming prompts, but these WebBrain planner results should not be used as evidence for or against its intended use case.
What I would keep testing
The practical local candidates from this batch are DiffusionGemma on vLLM, Gemma 4 12B Coder, and North Mini Code.
Gemma is lighter and faster in the llama.cpp lane. North is heavier, slightly slower, and needed a parser workaround, but once raw actions were allowed through it completed the full run cleanly. Both have the same exact-match score, and both overuse get_accessibility_tree.
That suggests the next useful experiment is prompt-side, not model-side:
- make direct-navigation instructions stronger,
- discourage inspection when the current tab is obviously irrelevant,
- keep the frozen legacy parser path for apples-to-apples comparison,
- rerun Gemma and North after the prompt change.
DiffusionGemma is the speed track. It is the most unusual model in the batch, and the vLLM run showed why people are excited about it: large-model behavior with sub-second warmed routing latency. The remaining question is whether prompt changes can reduce its no-tool rate without losing that speed.
VibeThinker should stay out of the browser-agent planner table unless a tool-trained variant appears.
Comparison with earlier planner runs
For context, here is the same strict first-tool comparison across the saved runs we have tested before. This is not the same scoring lens as the May benchmark post, which compared models against consensus and Sonnet. This table replays each saved result against the current expected/NNN.json ideal first call: exact is name plus args, name is tool name only.
| Model | Parsed calls | Exact | Name | Median latency |
|---|---|---|---|---|
| MiniMax M2.7 | 88/100 | 23% | 36% | 3.0s |
| Claude Sonnet 4.6 | 92/100 | 19% | 41% | 2.8s |
| Qwen 3.6 35B-A3B | 90/100 | 18% | 38% | 10.3s |
| Qwen 3.6 27B | 92/100 | 18% | 37% | 10.2s |
| Nemotron Omni 30B | 93/100 | 16% | 36% | 2.5s |
| Qwen 3.5 9B | 90/100 | 15% | 35% | 0.91s |
| Gemma 4 E4B | 87/100 | 14% | 35% | 4.5s |
| Intel Gemma 4 31B int4 | 88/100 | 14% | 34% | 0.63s |
| Gemma 4 26B-A4B | 87/100 | 13% | 30% | 1.4s |
| Browser-Use Qwen 30B-A3B Q4 | 93/100 | 12% | 35% | 0.47s |
| Qwen 3.5 4B | 82/100 | 12% | 33% | 5.4s |
| Gemma 4 E2B | 76/100 | 12% | 31% | 3.8s |
| DiffusionGemma 26B-A4B vLLM | 79/100 | 10% | 26% | 0.35s |
| Gemma 4 12B Coder Q4 | 94/100 | 9% | 26% | 1.9s |
| North Mini Code Q4 | 93/100 | 9% | 24% | 3.2s |
| Qwen 3.5 0.8B | 90/100 | 7% | 15% | 0.44s |
| LFM 2.5 | 83/100 | 4% | 23% | 5.9s |
| Qwen 3.5 2B | 89/100 | 4% | 7% | 0.78s |
| VibeThinker 3B BF16 | 84/100 | 2% | 19% | 5.0s |
The more useful comparison is by class.
Mid dense: Gemma 4 12B Coder vs Qwen 3.5 9B
Gemma 4 12B Coder is closest to Qwen 3.5 9B in spirit: mid-sized dense-ish local planners that should fit comfortably below the huge-model tier.
| Model | Parsed calls | Exact | Name | Median latency |
|---|---|---|---|---|
| Qwen 3.5 9B | 90/100 | 15% | 35% | 0.91s |
| Gemma 4 12B Coder Q4 | 94/100 | 9% | 26% | 1.9s |
Gemma wins on parseability, but Qwen 3.5 9B is still the better first-action router in this saved strict replay. Gemma's weakness is not format. It is action choice: too many safe-but-generic get_accessibility_tree calls where Qwen more often picks the expected tool directly.
Mini class: VibeThinker 3B vs Gemma 4 E4B vs Qwen 3.5 4B
The small-model comparison is where VibeThinker has to prove it can transfer reasoning into browser-agent tool use. Against the earlier mini-class runs, it does not.
| Model | Parsed calls | Exact | Name | Median latency |
|---|---|---|---|---|
| Gemma 4 E4B | 87/100 | 14% | 35% | 4.5s |
| Qwen 3.5 4B | 82/100 | 12% | 33% | 5.4s |
| VibeThinker 3B BF16 | 84/100 | 2% | 19% | 5.0s |
This makes the VibeThinker conclusion clearer. Even with BF16/vLLM instead of the bad Q4 GGUF run, it trails both Gemma 4 E4B and Qwen 3.5 4B by a wide margin on exact and name-only matching. That lines up with the model-card warning: VibeThinker may be a competitive-programming/reasoning model, but it is not a browser-agent planner model.
Large MoE class: North vs Qwen 3.6 35B-A3B vs Gemma 4 26B
North belongs in the large local tier, where the question is whether a heavier model gives better routing than the older large MoE candidates.
| Model | Parsed calls | Exact | Name | Median latency |
|---|---|---|---|---|
| Qwen 3.6 35B-A3B | 90/100 | 18% | 38% | 10.3s |
| Gemma 4 26B-A4B | 87/100 | 13% | 30% | 1.4s |
| DiffusionGemma 26B-A4B vLLM | 79/100 | 10% | 26% | 0.35s |
| North Mini Code Q4 | 93/100 | 9% | 24% | 3.2s |
North is operationally promising because it loaded, fit, and produced parseable actions after the --skip-chat-parsing workaround. But it does not beat the older large local models on first-action quality. Qwen 3.6 35B-A3B remains the large-tier action-selection reference. Gemma 4 26B-A4B still has better strict routing quality than DiffusionGemma in this saved replay, but DiffusionGemma is the speed outlier: 0.35s median versus 1.4s for Gemma 26B and 10.3s for Qwen 35B.
That Qwen latency should not be read as the final word on the model. At roughly ten seconds it is much slower than Gemma 4 26B's one-second-ish run and DiffusionGemma's sub-second warmed vLLM path, but I still think there is room to improve Qwen 3.6 35B-A3B with a better vLLM configuration. The quality signal is strong enough that the serving stack deserves more tuning before we treat the latency gap as permanent.
The new models did not beat the earlier leaders on strict first-action accuracy. Their stronger signal is operational: DiffusionGemma is genuinely fast through vLLM and vision-capable, Gemma 4 12B Coder is light and parseable, and North Mini Code is heavy but also parseable once its native action format is allowed through. All three need prompt work before they can challenge the older Qwen, MiniMax, Sonnet, or Nemotron runs on action selection.