June 20, 2026 · 7 min read · ← All posts

Gemma 4 31B QAT quietly becomes the best local Gemma planner we have tested

We ran google/gemma-4-31B-it-qat-w4a16-ct through the same frozen WebBrain first-tool-call harness we use for local planner comparisons. The result is small on branding and big in practice: the QAT build improves over the older Gemma 4 31B int4 result, narrowly edges Qwen 3.6 27B on strict first-action quality, and does it with sub-second median latency.

What we ran

The endpoint was a local vLLM server exposing:

model id: gemma-4-31b
root: google/gemma-4-31B-it-qat-w4a16-ct
max context reported by vLLM: 65536

We used the same frozen legacy planner harness as the latest local-model post:

node test/llm/run-llamacpp.mjs \
  --base http://127.0.0.1:8000 \
  --model gemma-4-31b \
  --tag 2026-06-20-vllm-legacy \
  --concurrency 1 \
  --timeout 180000 \
  --no-save-request \
  --freeze test/llm/freeze/baseline-2026-05-23.json \
  --chat-template-compat alternating

That means 100 single-turn browser-agent prompts, the May 23 Claude Sonnet 4.6 WebBrain system prompt and 41-tool schema, legacy text-call compatibility, no native OpenAI tools field, and one active request at a time.

The result files are in:

test/llm/results/2026-06-20-vllm-legacy_chrome_gemma-4-31b_frozen

The result

Metric	Gemma 4 31B QAT w4a16
Completed cases	100/100
Transport errors	0
Parsed calls	95/100
Exact first-call match	19/100
Tool-name match	37/100
Average latency	0.67s
Median latency	0.55s
p95 latency	1.07s
Slowest case	5.01s
Observed GPU memory	~30.1 GB used on RTX 5090

This is the best Gemma-family local planner result we have saved so far.

The older Gemma 4 31B int4 AutoRound run was already useful, but the QAT build improves every first-action metric that matters here:

Model	Parsed calls	Exact	Name	Median	p95
Intel Gemma 4 31B int4 AutoRound	88/100	14%	34%	0.63s	1.46s
Gemma 4 31B QAT w4a16	95/100	19%	37%	0.55s	1.07s

The improvement is not just a formatting cleanup. The QAT model reduced no-tool answers from 12 to 5, raised exact first-call matches by 5 points, and slightly reduced the over-cautious get_accessibility_tree habit from 45 calls to 41. It still inspects pages often, but it is less likely to stall out before making a usable first move.

The Qwen 3.6 27B comparison

This is the part that makes the run interesting beyond the Gemma family.

Model	Parsed calls	Exact	Name	Median latency
Qwen 3.6 27B	92/100	18%	37%	10.2s
Gemma 4 31B QAT w4a16	95/100	19%	37%	0.55s

So yes: in this strict first-action replay, Gemma 4 31B QAT narrowly beats Qwen 3.6 27B on quality. It is not a blowout. It is +3 parsed calls, +1 exact point, and a tied tool-name score. But the latency difference is not narrow at all: 0.55s median versus 10.2s.

That gives it a very different profile from the usual "larger local model is smarter but slower" tradeoff. Here, the QAT Gemma run is both slightly better and dramatically faster in this vLLM setup.

The naming is also refreshingly literal. This feels like the kind of improvement people might have called "Gemma 4.1 31B" if it had shipped as a conventional checkpoint refresh. Google named the mechanism instead: QAT. That is probably more honest. The model is not claiming to be a new generation; it is a quantization-aware serving-oriented variant that behaves like a real planner upgrade.

The analogy to the Qwen 3.5 to Qwen 3.6 jump is useful, but with one caveat: we do not have a saved Qwen 3.5 27B run in this WebBrain result set. So this is not a measured Qwen-style before/after. It is a behavioral analogy: the same base family suddenly feels more useful for agent routing because the deployed variant changed the practical quality-speed frontier.

Sonnet 4.6 as the reference

The strict ideal-first-call replay is useful, but the older WebBrain benchmark used a different reference: treat Claude Sonnet 4.6 as the truth and ask how often each model picked the same first tool.

Here is that view updated with the newer runs. The original May rows come from the published Sonnet export. The newer rows were scored against the same 100 Sonnet tool-name picks from that export, so this table is about first-tool alignment, not argument equality.

#	Model	Match all	Match when Sonnet tooled	Tool-call rate	Valid-name rate	Median
ref	Claude Sonnet 4.6	100%	100%	92%	92%	2.8s
1	Gemma 4 31B QAT w4a16	77.0%	78.3%	95%	95%	0.55s
2	Qwen 3.6 27B	77.0%	77.2%	92%	92%	10.2s
3	MiniMax M2.7	77.0%	76.1%	88%	88%	3.1s
4	Intel Gemma 4 31B int4 AutoRound	74.0%	72.8%	88%	87%	0.63s
5	Qwen 3.5 4B	73.0%	71.7%	82%	82%	5.5s
6	Gemma 4 26B-A4B	71.0%	70.7%	87%	87%	1.4s
7	Qwen 3.6 35B-A3B	70.0%	70.7%	90%	90%	10.3s
8	Qwen 3.5 9B	70.0%	69.6%	90%	89%	0.91s
9	Gemma 4 E4B	68.0%	68.5%	87%	87%	4.5s
10	Nemotron Omni 30B	67.0%	68.5%	93%	93%	2.6s
11	DiffusionGemma 26B-A4B	67.0%	64.1%	79%	79%	0.35s
12	Gemma 4 E2B	63.0%	60.9%	76%	75%	4.0s
13	Gemma 4 12B Coder Fable5 Composer 2.5	61.0%	62.0%	94%	94%	1.9s
14	Cohere North-Mini-Code 1.0	59.0%	58.7%	93%	93%	3.2s
15	Browser-Use Qwen 30B-A3B Q4	43.0%	45.7%	93%	93%	0.48s
16	LFM 2.5	40.0%	38.0%	83%	83%	5.9s
17	Qwen 3.5 0.8B	37.0%	34.8%	90%	90%	0.45s
18	Qwen 3.5 2B	36.0%	34.8%	89%	89%	0.78s
19	VibeThinker 3B BF16	33.0%	32.6%	84%	81%	5.0s
20	Molmo2 8B	8.0%	0.0%	2%	1%	1.7s

This is the strongest version of the QAT result. In the Sonnet-reference table, Gemma 4 31B QAT ties the best all-case score from the earlier benchmark and takes the tiebreaker on the 92 prompts where Sonnet actually called a tool. It is only a small quality lead, but it comes with a huge latency gap: 0.55s median for QAT versus 10.2s for Qwen 3.6 27B.

The newer small and specialty runs also make more sense in this lens. DiffusionGemma remains a speed story, not a top Sonnet-alignment story. Gemma 4 12B Coder and North are highly parseable but less Sonnet-like in first-tool choice. VibeThinker stays near the bottom, which matches its own model-card warning about tool-calling and autonomous-agent use. Molmo2 8B is effectively not a text tool-calling planner in this harness.

Vision probe

The vision probe also passed.

Against the Google password verification screenshot, Gemma 4 31B QAT correctly identified:

the Google account password verification page,
the visible account text,
the password field and empty value,
the "Show password" checkbox,
the "Enter a password" error state.

Timing was also solid:

Vision probe metric	Value
TTFT	1.31s
Total time	3.50s
Prompt tokens	595
Completion tokens	152

One small miss: the model listed "Blockers: None" even though an authentication prompt is obviously a blocker for continuing. For WebBrain, that is the kind of detail a follow-up prompt tweak can usually tighten. The important part is that the endpoint accepted image input and produced a structured caption usable by the planner.

What this changes

For the current WebBrain local planner table, Gemma 4 31B QAT now belongs above the older Gemma 31B int4 run and just above Qwen 3.6 27B on strict first-action quality.

The broader leaderboard still has stronger action routers. MiniMax M2.7, Sonnet 4.6, and Qwen 3.6 35B-A3B still beat it on exact or name-only matching in the saved runs. But Gemma 4 31B QAT has the best balance we have seen from a local Gemma: high parseability, competitive first-action selection, working vision, and latency that keeps the browser loop feeling interactive.

The clean takeaway: this is not just a smaller file or a faster serving trick. In this harness, QAT made Gemma 4 31B a better browser-agent planner.

Written by Emre Sokullu. WebBrain is MIT-licensed and open on GitHub.