Degradation Harness — what it is

The core idea

Not two accuracy numbers — a statistical diff

A large MoE serve is non-deterministic even at temperature 0 (expert-routing jitter), so two identical runs disagree on ~35–40% of generated text. You can't just compare "84.2 vs 83.9". Instead the harness aligns items and runs McNemar's test on the flip asymmetry: random noise is symmetric and cancels out — real degradation flips one way.

fp16 baseline ✓ cached once → align items

candidate quant · runs live ↗ McNemar on flips → verdict

Design choices

What makes it different

🎯 Real operating point

Samples at the official Qwen params, not greedy — thinking 0.6/0.95/20, non-thinking 0.7/0.8/20 + pp1.5. The same params the quants are served at in production, so degradation is measured where the model actually runs.

📡 Tier-0 KL probe

The lead tripwire. Teacher-forced logprobs on fixed text are ~25× more reproducible than generation (top-1 ≈0.985). Catches distributional collapse before any benchmark even runs.

🔒 Comparability contract

A candidate is only graded if tier, seed, think, samples, temp & max_tokens all match the cached baseline. Mismatch prints ⚠ NOT COMPARABLE instead of a misleading number.

💾 Reference cached once

The fp16 baseline is collected one time (on a rented box), committed to the repo, then every variant on your own hardware is a fast diff against disk — the expensive box is long gone.

🧠 Thinking is first-class

--think toggles reasoning. Off = fast, isolates knowledge + format. On = real reasoning behavior plus free signal (reasoning token-count). Math benches need it to score above zero.

🚨 Truncation as a canary

A bench's cap-hit rate climbing above the baseline's trunc% is itself a degradation signal — a damaged model loops and rambles into the cap instead of answering.

Cost vs. resolution

Three tiers

Standard is the default and the working baseline tier — it runs every cheap bench on its whole split and replaces the one expensive bench (mmlu_pro) with a stratified ~30% slice that reweights back to the full number, for ~3× less cost.

smoke

10–20 min

n = 150

Catches catastrophic failures + large degradation.

default

standard

~3 hrs

cheap = full · mmlu_pro = 3600 strat.

Moderate degradation + per-category damage. Better category resolution than a flat sample.

full

~5.5 hrs

whole split

Only earns its keep resolving sub-1% absolute mmlu_pro differences.

The yardstick · collected 2026-06-03

fp16-397B canonical baseline

The true full-precision reference every candidate is graded against. Collected once on a one-time vast rental, then the box was torn down. Bars below show the fp16 accuracy; Δ compares against the GPTQ-Int4 dev reference.

Qwen3.5-397B-A17B · bf16

tier standard · seed 0 · official sampling · c=64

4× B300 SXM6 · vLLM 0.22 · TP=4
~752 GB weights · ~2000 tok/s
~2.5 h · ≈ $80 total

Bench	fp16	Δ vs Int4 dev	mode
gsm8k	0.972	+.003	no-think
mmlu_pro	0.860	−.005	no-think
gpqa	0.806	+.020	no-think
ifeval	0.904	−.012	no-think
humaneval	0.939	.000	no-think
aime_2026	0.967	.000	think · maj@4
hmmt_feb_2026	0.667	+.031	think · maj@4
tradequiz	0.620	−.002	no-think

Headline: fp16 ≈ GPTQ-Int4 on every bench — all within noise. The 4-bit quant is an essentially lossless stand-in, which retroactively validates every degradation diff ever run against the Int4 dev reference.

Beyond the standard battery

Two custom judgment benches

Multi-turn, infra-coupled benches that don't fit the one-shot chat()→score() contract. Run locally against the exposed endpoint.

GODMODE

90/90

Grounded reasoning, 3 scenarios × 30 Q. The model writes bash into a sandbox against a price parquet, up to 30 turns. Ceiling is ~89–90 (one stable denominator-convention miss). The bigger finding: apparent fp16 nondeterminism was serving config, not the model — bit-exact once chunked-prefill is off and bs=1.

trader-traits

96%

v3, 39 scenarios across 6 traits, think-off, 3 seeds, blind Opus judge (98% agreement). Per-trait gate: T1 100 · T2 100 · T3 89 · T4 100 · T5 80 · T6 100. Identical to the Int4 stand-in — lossless on the judgment exam too.

The standard battery

What's measured

Knowledge & format: gsm8k · mmlu_pro · gpqa (gated) · ifeval · Code: humaneval (sandboxed exec) · Math reasoning: aime_2026 · hmmt_feb_2026 (maj@4, thinking) · Domain: tradequiz (trading/exchange certification)