Degradation Harness

How much did the model
actually get worse?

A measurement harness that detects whether a quantized, pruned, or distilled model has degraded against its full-precision baseline — built for sensitivity, stability, and speed, not for reproducing a leaderboard.

baseline · Qwen3.5-397B-A17B fp16 8 core benches + 2 custom McNemar · KL probe · official sampling
The core idea

Not two accuracy numbers — a statistical diff

A large MoE serve is non-deterministic even at temperature 0 (expert-routing jitter), so two identical runs disagree on ~35–40% of generated text. You can't just compare "84.2 vs 83.9". Instead the harness aligns items and runs McNemar's test on the flip asymmetry: random noise is symmetric and cancels out — real degradation flips one way.

fp16 baseline ✓ cached once align items
candidate quant · runs live McNemar on flips verdict
Design choices

What makes it different

🎯 Real operating point

Samples at the official Qwen params, not greedy — thinking 0.6/0.95/20, non-thinking 0.7/0.8/20 + pp1.5. The same params the quants are served at in production, so degradation is measured where the model actually runs.

📡 Tier-0 KL probe

The lead tripwire. Teacher-forced logprobs on fixed text are ~25× more reproducible than generation (top-1 ≈0.985). Catches distributional collapse before any benchmark even runs.

🔒 Comparability contract

A candidate is only graded if tier, seed, think, samples, temp & max_tokens all match the cached baseline. Mismatch prints ⚠ NOT COMPARABLE instead of a misleading number.

💾 Reference cached once

The fp16 baseline is collected one time (on a rented box), committed to the repo, then every variant on your own hardware is a fast diff against disk — the expensive box is long gone.

🧠 Thinking is first-class

--think toggles reasoning. Off = fast, isolates knowledge + format. On = real reasoning behavior plus free signal (reasoning token-count). Math benches need it to score above zero.

🚨 Truncation as a canary

A bench's cap-hit rate climbing above the baseline's trunc% is itself a degradation signal — a damaged model loops and rambles into the cap instead of answering.

Cost vs. resolution

Three tiers

Standard is the default and the working baseline tier — it runs every cheap bench on its whole split and replaces the one expensive bench (mmlu_pro) with a stratified ~30% slice that reweights back to the full number, for ~3× less cost.

smoke
10–20 min
n = 150
Catches catastrophic failures + large degradation.
default
standard
~3 hrs
cheap = full · mmlu_pro = 3600 strat.
Moderate degradation + per-category damage. Better category resolution than a flat sample.
full
~5.5 hrs
whole split
Only earns its keep resolving sub-1% absolute mmlu_pro differences.
The yardstick · collected 2026-06-03

fp16-397B canonical baseline

The true full-precision reference every candidate is graded against. Collected once on a one-time vast rental, then the box was torn down. Bars below show the fp16 accuracy; Δ compares against the GPTQ-Int4 dev reference.

Qwen3.5-397B-A17B · bf16
tier standard · seed 0 · official sampling · c=64
4× B300 SXM6 · vLLM 0.22 · TP=4
~752 GB weights · ~2000 tok/s
~2.5 h · ≈ $80 total
Benchfp16 accuracyfp16Δ vs Int4 devmode
gsm8k
0.972+.003no-think
mmlu_pro
0.860−.005no-think
gpqa
0.806+.020no-think
ifeval
0.904−.012no-think
humaneval
0.939.000no-think
aime_2026
0.967.000think · maj@4
hmmt_feb_2026
0.667+.031think · maj@4
tradequiz
0.620−.002no-think
Headline: fp16 ≈ GPTQ-Int4 on every bench — all within noise. The 4-bit quant is an essentially lossless stand-in, which retroactively validates every degradation diff ever run against the Int4 dev reference.
Beyond the standard battery

Two custom judgment benches

Multi-turn, infra-coupled benches that don't fit the one-shot chat()score() contract. Run locally against the exposed endpoint.

GODMODE

90/90

Grounded reasoning, 3 scenarios × 30 Q. The model writes bash into a sandbox against a price parquet, up to 30 turns. Ceiling is ~89–90 (one stable denominator-convention miss). The bigger finding: apparent fp16 nondeterminism was serving config, not the model — bit-exact once chunked-prefill is off and bs=1.

trader-traits

96%

v3, 39 scenarios across 6 traits, think-off, 3 seeds, blind Opus judge (98% agreement). Per-trait gate: T1 100 · T2 100 · T3 89 · T4 100 · T5 80 · T6 100. Identical to the Int4 stand-in — lossless on the judgment exam too.

The standard battery

What's measured

Knowledge & format: gsm8k · mmlu_pro · gpqa (gated) · ifeval  ·  Code: humaneval (sandboxed exec)  ·  Math reasoning: aime_2026 · hmmt_feb_2026 (maj@4, thinking)  ·  Domain: tradequiz (trading/exchange certification)