Dense is byte-reproducible; MoE flips at near-ties

I was tuning a Qwen3.5-397B-A17B deployment on local hardware, chasing decode speed. I ended up chasing something else entirely: why the same benchmark, run twice at temperature 0, never spent the same number of tokens twice. Here is what I found about determinism in quantized Mixture-of-Experts inference — a result I haven’t seen written down anywhere.

The symptom

My benchmark is a chartist answering 30 questions about a price chart across an agentic, multi-turn workflow — three scenarios on different timeframes. On a 4-bit (NVFP4) 397B quant the model nailed the task every time: 99–100% on the bench. But the token spend was all over the place run to run, at temperature 0, single stream. Same prompt, same weights, different trajectory.

I did the obvious thing first and blamed the serving stack. I stripped SGLang parameters one at a time. Still nondeterministic. Official SGLang — same. vLLM — same. A 6-bit llama.cpp build, by contrast, was byte-identical every run. So I was about to write this off as “vLLM/SGLang are nondeterministic by design” (both even ship flags that claim to force determinism) and move on.

Then I took a break, and realised I’d only changed the serving stack. I hadn’t changed the model.

The actual variable: architecture × precision

So I swept it properly. The pattern that fell out:

model	precision	what’s quantized	bit-exact	divergence
27B dense	fp16	—	3/3	none
27B dense	AWQ-INT4	all	3/3	none
264B-REAP (prune of 397B)	FP8, experts only	experts	6/6	none observed
122B MoE	bf16	—	2/3	low-rate near-tie
122B MoE	FP8 (official)	experts + more	2/3	genuine near-tie flip
122B MoE	NVFP4	all	0/3	first-token-block flip
397B MoE	NVFP4	all	0/3	first-token-block flip

Two clean conclusions:

Dense models are byte-reproducible at any precision. fp16, fp8, AWQ-INT4, NVFP4 — all 3/3 identical. Quantization changes the trajectory (a 4-bit model is terser) but adds zero run-to-run noise.
MoE models flip — and the quantization aggressiveness is the rate dial. bf16/fp8 MoE flips rarely (it needs to hit a genuine decision point). NVFP4 MoE flips constantly — in the very first block of generated tokens, every scenario, before any tool even runs.

The mechanism

The fused MoE expert path — top-k gate → grouped GEMM → weighted expert-combine reduction — is floating-point non-associative and argmax-tie-sensitive. It’s the one path a dense model never runs. Whether a given run flips depends on whether it hits a near-tie in that reduction, and precision sets how often that happens:

dense — no such path, so rate 0 at any precision.
bf16 / fp8 MoE — low rate. You need a real near-tie: an actual decision point in the trajectory. Most generation stays byte-identical; the occasional tie tips.
nvfp4 MoE — high rate. Quant noise tips a tie in the first token block, every time, regardless of trajectory.

A useful control: I took a 35%-REAP prune of the 397B and quantized only the experts to FP8, leaving everything else fp16. Six runs, byte-identical. That’s suggestive — but honest caveat: those trajectories were short and clean and may simply never have hit a near-tie, rather than being immune. N=3 per corner (6 for the REAP model); the rates here are qualitative, not measured.

One trap worth flagging: harness-injected nondeterminism

Early on, one scenario showed a dramatic 47% token-count swing that looked like violent MoE nondeterminism. It wasn’t. My agentic harness ran ls -la in the model’s workspace, and the long-format listing carried the live mtime of a file the model had just written. That wall-clock timestamp changes run to run, lands in the model’s context, and the agentic recovery loop amplifies it. One leaked timestamp, blown up into a cascade.

After stripping volatile fields, the swing collapsed from 47% to 2.4% — but the underlying model still flipped at one genuine decision point. The lesson: any agentic eval that feeds ls -l, timestamps, PIDs, or other volatile host state into the context will manufacture “nondeterminism” the loop then amplifies. Strip those before you judge the model.

Practical takeaways

Want byte-reproducibility? Use a dense model (any precision) — or accept that any MoE on these kernels can flip at a near-tie.
bf16/fp8 is not a determinism fix for an MoE. It lowers the flip rate vs NVFP4; it does not zero it.
NVFP4 MoE is the worst case — it flips in the first token block.
The fully-deterministic option remains fixed-reduction kernels (6-bit / llama.cpp). The kernel’s reduction order is the real lever — not the bit-width.

The open question I’m chasing next: on 4× Blackwell, does a 35%-REAP + FP8-experts 397B beat the 4-bit quants on reliability and quality, not just determinism? That’s the REAP work.

Setup: faith, 4× RTX 6000 Pro (Blackwell), vLLM, TP=4, temp 0, think-off. Models: Qwen3.6-27B dense; Qwen3.5-122B-A10B MoE; my Qwen3.5-264B-A17B REAP-35% FP8-experts-only; nvidia Qwen3.5-397B-A17B-NVFP4.

I was tuning a Qwen3.5-397B-A17B deployment on local hardware. Here’s what I found about determinism in quantized MoE inference that I haven’t seen documented elsewhere.

So this morning i was online looking ways to beef up qwen 397b model decoding speed and i found this https://github.com/local-inference-lab/rtx6kpro/blob/master/models/qwen35-397b.md . It is specially tuned docker container setup getting 180 tok/s out of nvfp4 quant. So guess what, i had to try it. Second 4bit quant they measured had 150 tok/s performance, less speed but quant itself claimed more accurate so i picked it. I constructed my own benchmark, basically chartist answering 30 unique questions about price chart in a agentic multiturn workflow. Total of 3 scenarios on different timeframes. So i ran it, noticed decode speeds (mtp enabled) in the range of 200 to 300 tok/s, average somewhere around 220 tok/s. I also tested with different kinds of inputs to see how the performance changes. Basically throw a made up language to model and it floors at 140 tok/s or just use language like Estonian lol. Single stream performance was around 130 tok/s without mtp. So awesome right, i was thrilled until i noticed every run of my test produced different token spend. Qwen scored 99 to 100% on bench itself, basically nailed the bench every time but token spend was all over the place.

Long story short, i went from stripping sglang parameters one by one, nothing still non deterministic, then i tried official sglang, same, vllm same. 6bit llama.cpp produced deterministic results, same identical results every time. So i was about to write it off as vllm and sglang has some sort of non determinism in it by design, tools have even specific flag to force determinism. Took a break, figured i have not tried different model, so i fired up nvfp4 quant instead of awq. Same non deterministic. So it must be something vllm does differently internally, right? So i fired up dense qwen 27b model, started with fp16, then fp8, then awq and nvfp4, fully deterministic, identical results. So it is model issue, specifically something up with moe, i tested 122b fp16, fp8, 4bit quants - non deterministic, fp16 less, got worse with 4 bit quants.

Next thing i wanted to see how if 397b fp16 behaves, vastai? Well i did not get to it really lol. I found 35% REAP pruned from huggingface, had FP8 version too but it did not work so i quantized fp16 reap myself, only experts got FP8 treatment, reason being experts are where space savings really live and left everything else fp16. So i ran it, scored as expected, deterministic, identical results every time. So this lead me to a question .. worth chasing what made 4bit 397b quants non deterministic? Do i need to? So given all that, what i want to know now, would 35% REAP + FP8 beat 4bit quants on reliability or quality? on 4x 96GB blackwells? …