Private drafts mirror — not public, not indexed
← All writing
private view

Finding · local inference

Dense is byte-reproducible; MoE flips at near-ties

I was tuning a Qwen3.5-397B-A17B deployment on local hardware, chasing decode speed. I ended up chasing something else entirely: why the same benchmark, run twice at temperature 0, never spent the same number of tokens twice. Here is what I found about determinism in quantized Mixture-of-Experts inference — a result I haven’t seen written down anywhere.

The symptom

My benchmark is a chartist answering 30 questions about a price chart across an agentic, multi-turn workflow — three scenarios on different timeframes. On a 4-bit (NVFP4) 397B quant the model nailed the task every time: 99–100% on the bench. But the token spend was all over the place run to run, at temperature 0, single stream. Same prompt, same weights, different trajectory.

I did the obvious thing first and blamed the serving stack. I stripped SGLang parameters one at a time. Still nondeterministic. Official SGLang — same. vLLM — same. A 6-bit llama.cpp build, by contrast, was byte-identical every run. So I was about to write this off as “vLLM/SGLang are nondeterministic by design” (both even ship flags that claim to force determinism) and move on.

Then I took a break, and realised I’d only changed the serving stack. I hadn’t changed the model.

The actual variable: architecture × precision

So I swept it properly. The pattern that fell out:

modelprecisionwhat’s quantizedbit-exactdivergence
27B densefp163/3none
27B denseAWQ-INT4all3/3none
264B-REAP (prune of 397B)FP8, experts onlyexperts6/6none observed
122B MoEbf162/3low-rate near-tie
122B MoEFP8 (official)experts + more2/3genuine near-tie flip
122B MoENVFP4all0/3first-token-block flip
397B MoENVFP4all0/3first-token-block flip

Two clean conclusions:

  1. Dense models are byte-reproducible at any precision. fp16, fp8, AWQ-INT4, NVFP4 — all 3/3 identical. Quantization changes the trajectory (a 4-bit model is terser) but adds zero run-to-run noise.
  2. MoE models flip — and the quantization aggressiveness is the rate dial. bf16/fp8 MoE flips rarely (it needs to hit a genuine decision point). NVFP4 MoE flips constantly — in the very first block of generated tokens, every scenario, before any tool even runs.

The mechanism

The fused MoE expert path — top-k gate → grouped GEMM → weighted expert-combine reduction — is floating-point non-associative and argmax-tie-sensitive. It’s the one path a dense model never runs. Whether a given run flips depends on whether it hits a near-tie in that reduction, and precision sets how often that happens:

  • dense — no such path, so rate 0 at any precision.
  • bf16 / fp8 MoE — low rate. You need a real near-tie: an actual decision point in the trajectory. Most generation stays byte-identical; the occasional tie tips.
  • nvfp4 MoE — high rate. Quant noise tips a tie in the first token block, every time, regardless of trajectory.

A useful control: I took a 35%-REAP prune of the 397B and quantized only the experts to FP8, leaving everything else fp16. Six runs, byte-identical. That’s suggestive — but honest caveat: those trajectories were short and clean and may simply never have hit a near-tie, rather than being immune. N=3 per corner (6 for the REAP model); the rates here are qualitative, not measured.

One trap worth flagging: harness-injected nondeterminism

Early on, one scenario showed a dramatic 47% token-count swing that looked like violent MoE nondeterminism. It wasn’t. My agentic harness ran ls -la in the model’s workspace, and the long-format listing carried the live mtime of a file the model had just written. That wall-clock timestamp changes run to run, lands in the model’s context, and the agentic recovery loop amplifies it. One leaked timestamp, blown up into a cascade.

After stripping volatile fields, the swing collapsed from 47% to 2.4% — but the underlying model still flipped at one genuine decision point. The lesson: any agentic eval that feeds ls -l, timestamps, PIDs, or other volatile host state into the context will manufacture “nondeterminism” the loop then amplifies. Strip those before you judge the model.

Practical takeaways

  • Want byte-reproducibility? Use a dense model (any precision) — or accept that any MoE on these kernels can flip at a near-tie.
  • bf16/fp8 is not a determinism fix for an MoE. It lowers the flip rate vs NVFP4; it does not zero it.
  • NVFP4 MoE is the worst case — it flips in the first token block.
  • The fully-deterministic option remains fixed-reduction kernels (6-bit / llama.cpp). The kernel’s reduction order is the real lever — not the bit-width.

The open question I’m chasing next: on 4× Blackwell, does a 35%-REAP + FP8-experts 397B beat the 4-bit quants on reliability and quality, not just determinism? That’s the REAP work.


Setup: faith, 4× RTX 6000 Pro (Blackwell), vLLM, TP=4, temp 0, think-off. Models: Qwen3.6-27B dense; Qwen3.5-122B-A10B MoE; my Qwen3.5-264B-A17B REAP-35% FP8-experts-only; nvidia Qwen3.5-397B-A17B-NVFP4.