Qwen3.5-397B-A17B · cumulative tier @ tokens · the ratio-agnostic saliency state every prune decision is sliced from
The observer watches every routed token and accumulates per-expert statistics. Nothing here is tied to a compression ratio — the ratio is applied later, at prune time, by cutting the lowest-saliency experts.
Slice any ratio from this one checkpoint. Saliency deleted stays well below traffic deleted — that gap is REAP preferentially keeping high-contribution experts. The saliency cost steepens past ~0.5: the cheap predictor of the quality cliff.
If you pruned at random both lines would track the diagonal. Saliency selection bends the pink line down.
Correlation of the reap score with each candidate driver, averaged across layers. Saliency is almost entirely output magnitude — not how often an expert is used. REAP fires the quiet experts, not the rarely-routed ones.
Where the prunable slack lives. The saliency floor (cutoff value at 34%) is near-zero in early layers and climbs steeply late — early layers are redundant, late layers specialized. The tail-risk line shows the % of dropped experts that spike occasionally (peak in global top-25%) — the experts an average metric can't protect.
0 = every expert used equally. Higher = a few hot experts dominate.
% of a dropped expert's co-firing traffic that lands on a surviving expert. High = the work gets reabsorbed.
The shape of the expert pool, early vs mid vs late. Dashed line = the 34% cut. Early layers (L4) pile up near zero with a long thin tail; late layers (L53) sit far from zero — even their weakest third is genuinely contributing.
Generated from . Curves are cheap proxies extracted from the observer state — they predict where to spend expensive benchmark evals, they don't replace them.