A measurement harness that detects whether a quantized, pruned, or distilled model has degraded against its full-precision baseline — built for sensitivity, stability, and speed, not for reproducing a leaderboard.
A large MoE serve is non-deterministic even at temperature 0 (expert-routing jitter), so two identical runs disagree on ~35–40% of generated text. You can't just compare "84.2 vs 83.9". Instead the harness aligns items and runs McNemar's test on the flip asymmetry: random noise is symmetric and cancels out — real degradation flips one way.
Samples at the official Qwen params, not greedy — thinking 0.6/0.95/20,
non-thinking 0.7/0.8/20 + pp1.5. The same params the quants are served at in
production, so degradation is measured where the model actually runs.
The lead tripwire. Teacher-forced logprobs on fixed text are ~25× more reproducible than generation (top-1 ≈0.985). Catches distributional collapse before any benchmark even runs.
A candidate is only graded if tier, seed, think,
samples, temp & max_tokens all match the cached
baseline. Mismatch prints ⚠ NOT COMPARABLE instead of a misleading number.
The fp16 baseline is collected one time (on a rented box), committed to the repo, then every variant on your own hardware is a fast diff against disk — the expensive box is long gone.
--think toggles reasoning. Off = fast, isolates knowledge + format.
On = real reasoning behavior plus free signal (reasoning token-count). Math benches need
it to score above zero.
A bench's cap-hit rate climbing above the baseline's trunc% is itself a degradation signal — a damaged model loops and rambles into the cap instead of answering.
Standard is the default and the working baseline tier — it runs every cheap bench on its whole split and replaces the one expensive bench (mmlu_pro) with a stratified ~30% slice that reweights back to the full number, for ~3× less cost.
The true full-precision reference every candidate is graded against. Collected once on a one-time vast rental, then the box was torn down. Bars below show the fp16 accuracy; Δ compares against the GPTQ-Int4 dev reference.
standard · seed 0 · official sampling · c=64| Bench | fp16 accuracy | fp16 | Δ vs Int4 dev | mode |
|---|---|---|---|---|
| gsm8k | 0.972 | +.003 | no-think | |
| mmlu_pro | 0.860 | −.005 | no-think | |
| gpqa | 0.806 | +.020 | no-think | |
| ifeval | 0.904 | −.012 | no-think | |
| humaneval | 0.939 | .000 | no-think | |
| aime_2026 | 0.967 | .000 | think · maj@4 | |
| hmmt_feb_2026 | 0.667 | +.031 | think · maj@4 | |
| tradequiz | 0.620 | −.002 | no-think |
Multi-turn, infra-coupled benches that don't fit the one-shot
chat()→score() contract. Run locally against the exposed endpoint.
Grounded reasoning, 3 scenarios × 30 Q. The model writes bash into a sandbox against a price parquet, up to 30 turns. Ceiling is ~89–90 (one stable denominator-convention miss). The bigger finding: apparent fp16 nondeterminism was serving config, not the model — bit-exact once chunked-prefill is off and bs=1.
v3, 39 scenarios across 6 traits, think-off, 3 seeds, blind Opus judge (98% agreement). Per-trait gate: T1 100 · T2 100 · T3 89 · T4 100 · T5 80 · T6 100. Identical to the Int4 stand-in — lossless on the judgment exam too.