▍ Finding · local inference
Dense is byte-reproducible; MoE flips at near-ties
I was tuning a Qwen3.5-397B-A17B deployment on local hardware, chasing decode speed. I ended up chasing something else entirely: why the same benchmark, run twice at temperature 0, never spent the same number of tokens twice. Here is what I found about determinism in quantized Mixture-of-Experts inference — a result I haven’t seen written down anywhere.
The symptom
My benchmark is a chartist answering 30 questions about a price chart across an agentic, multi-turn workflow — three scenarios on different timeframes. On a 4-bit (NVFP4) 397B quant the model nailed the task every time: 99–100% on the bench. But the token spend was all over the place run to run, at temperature 0, single stream. Same prompt, same weights, different trajectory.
I did the obvious thing first and blamed the serving stack. I stripped SGLang
parameters one at a time. Still nondeterministic. Official SGLang — same. vLLM —
same. A 6-bit llama.cpp build, by contrast, was byte-identical every run. So I
was about to write this off as “vLLM/SGLang are nondeterministic by design”
(both even ship flags that claim to force determinism) and move on.
Then I took a break, and realised I’d only changed the serving stack. I hadn’t changed the model.
The actual variable: architecture × precision
So I swept it properly. The pattern that fell out:
| model | precision | what’s quantized | bit-exact | divergence |
|---|---|---|---|---|
| 27B dense | fp16 | — | 3/3 | none |
| 27B dense | AWQ-INT4 | all | 3/3 | none |
| 264B-REAP (prune of 397B) | FP8, experts only | experts | 6/6 | none observed |
| 122B MoE | bf16 | — | 2/3 | low-rate near-tie |
| 122B MoE | FP8 (official) | experts + more | 2/3 | genuine near-tie flip |
| 122B MoE | NVFP4 | all | 0/3 | first-token-block flip |
| 397B MoE | NVFP4 | all | 0/3 | first-token-block flip |
Two clean conclusions:
- Dense models are byte-reproducible at any precision. fp16, fp8, AWQ-INT4, NVFP4 — all 3/3 identical. Quantization changes the trajectory (a 4-bit model is terser) but adds zero run-to-run noise.
- MoE models flip — and the quantization aggressiveness is the rate dial. bf16/fp8 MoE flips rarely (it needs to hit a genuine decision point). NVFP4 MoE flips constantly — in the very first block of generated tokens, every scenario, before any tool even runs.
The mechanism
The fused MoE expert path — top-k gate → grouped GEMM → weighted expert-combine reduction — is floating-point non-associative and argmax-tie-sensitive. It’s the one path a dense model never runs. Whether a given run flips depends on whether it hits a near-tie in that reduction, and precision sets how often that happens:
- dense — no such path, so rate 0 at any precision.
- bf16 / fp8 MoE — low rate. You need a real near-tie: an actual decision point in the trajectory. Most generation stays byte-identical; the occasional tie tips.
- nvfp4 MoE — high rate. Quant noise tips a tie in the first token block, every time, regardless of trajectory.
A useful control: I took a 35%-REAP prune of the 397B and quantized only the experts to FP8, leaving everything else fp16. Six runs, byte-identical. That’s suggestive — but honest caveat: those trajectories were short and clean and may simply never have hit a near-tie, rather than being immune. N=3 per corner (6 for the REAP model); the rates here are qualitative, not measured.
One trap worth flagging: harness-injected nondeterminism
Early on, one scenario showed a dramatic 47% token-count swing that looked like
violent MoE nondeterminism. It wasn’t. My agentic harness ran ls -la in the
model’s workspace, and the long-format listing carried the live mtime of a
file the model had just written. That wall-clock timestamp changes run to run,
lands in the model’s context, and the agentic recovery loop amplifies it. One
leaked timestamp, blown up into a cascade.
After stripping volatile fields, the swing collapsed from 47% to 2.4% — but the
underlying model still flipped at one genuine decision point. The lesson:
any agentic eval that feeds ls -l, timestamps, PIDs, or other volatile host
state into the context will manufacture “nondeterminism” the loop then
amplifies. Strip those before you judge the model.
Practical takeaways
- Want byte-reproducibility? Use a dense model (any precision) — or accept that any MoE on these kernels can flip at a near-tie.
- bf16/fp8 is not a determinism fix for an MoE. It lowers the flip rate vs NVFP4; it does not zero it.
- NVFP4 MoE is the worst case — it flips in the first token block.
- The fully-deterministic option remains fixed-reduction kernels (6-bit /
llama.cpp). The kernel’s reduction order is the real lever — not the bit-width.
The open question I’m chasing next: on 4× Blackwell, does a 35%-REAP + FP8-experts 397B beat the 4-bit quants on reliability and quality, not just determinism? That’s the REAP work.
Setup: faith, 4× RTX 6000 Pro (Blackwell), vLLM, TP=4, temp 0, think-off. Models: Qwen3.6-27B dense; Qwen3.5-122B-A10B MoE; my Qwen3.5-264B-A17B REAP-35% FP8-experts-only; nvidia Qwen3.5-397B-A17B-NVFP4.
I was tuning a Qwen3.5-397B-A17B deployment on local hardware. Here’s what I found about determinism in quantized MoE inference that I haven’t seen documented elsewhere.
So this morning i was online looking ways to beef up qwen 397b model decoding speed and i found this https://github.com/local-inference-lab/rtx6kpro/blob/master/models/qwen35-397b.md . It is specially tuned docker container setup getting 180 tok/s out of nvfp4 quant. So guess what, i had to try it. Second 4bit quant they measured had 150 tok/s performance, less speed but quant itself claimed more accurate so i picked it. I constructed my own benchmark, basically chartist answering 30 unique questions about price chart in a agentic multiturn workflow. Total of 3 scenarios on different timeframes. So i ran it, noticed decode speeds (mtp enabled) in the range of 200 to 300 tok/s, average somewhere around 220 tok/s. I also tested with different kinds of inputs to see how the performance changes. Basically throw a made up language to model and it floors at 140 tok/s or just use language like Estonian lol. Single stream performance was around 130 tok/s without mtp. So awesome right, i was thrilled until i noticed every run of my test produced different token spend. Qwen scored 99 to 100% on bench itself, basically nailed the bench every time but token spend was all over the place.
Long story short, i went from stripping sglang parameters one by one, nothing still non deterministic, then i tried official sglang, same, vllm same. 6bit llama.cpp produced deterministic results, same identical results every time. So i was about to write it off as vllm and sglang has some sort of non determinism in it by design, tools have even specific flag to force determinism. Took a break, figured i have not tried different model, so i fired up nvfp4 quant instead of awq. Same non deterministic. So it must be something vllm does differently internally, right? So i fired up dense qwen 27b model, started with fp16, then fp8, then awq and nvfp4, fully deterministic, identical results. So it is model issue, specifically something up with moe, i tested 122b fp16, fp8, 4bit quants - non deterministic, fp16 less, got worse with 4 bit quants.
Next thing i wanted to see how if 397b fp16 behaves, vastai? Well i did not get to it really lol. I found 35% REAP pruned from huggingface, had FP8 version too but it did not work so i quantized fp16 reap myself, only experts got FP8 treatment, reason being experts are where space savings really live and left everything else fp16. So i ran it, scored as expected, deterministic, identical results every time. So this lead me to a question .. worth chasing what made 4bit 397b quants non deterministic? Do i need to? So given all that, what i want to know now, would 35% REAP + FP8 beat 4bit quants on reliability or quality? on 4x 96GB blackwells? …