REAP Observer — what the calibration captured

Qwen3.5-397B-A17B · cumulative tier @ tokens · the ratio-agnostic saliency state every prune decision is sliced from

1 · The variables it records

The observer watches every routed token and accumulates per-expert statistics. Nothing here is tied to a compression ratio — the ratio is applied later, at prune time, by cutting the lowest-saliency experts.

2 · The cut tradeoff curve

Slice any ratio from this one checkpoint. Saliency deleted stays well below traffic deleted — that gap is REAP preferentially keeping high-contribution experts. The saliency cost steepens past ~0.5: the cheap predictor of the quality cliff.

routing traffic deletedsaliency mass deleted

If you pruned at random both lines would track the diagonal. Saliency selection bends the pink line down.

3 · What actually drives saliency

Correlation of the reap score with each candidate driver, averaged across layers. Saliency is almost entirely output magnitude — not how often an expert is used. REAP fires the quiet experts, not the rarely-routed ones.

4 · Layer geography

Where the prunable slack lives. The saliency floor (cutoff value at 34%) is near-zero in early layers and climbs steeply late — early layers are redundant, late layers specialized. The tail-risk line shows the % of dropped experts that spike occasionally (peak in global top-25%) — the experts an average metric can't protect.

saliency floor @34% (left)tail-risk % (right)

Routing balance (CV) per layer

0 = every expert used equally. Higher = a few hot experts dominate.

Redundancy coverage per layer

% of a dropped expert's co-firing traffic that lands on a surviving expert. High = the work gets reabsorbed.

5 · Saliency distributions

The shape of the expert pool, early vs mid vs late. Dashed line = the 34% cut. Early layers (L4) pile up near zero with a long thin tail; late layers (L53) sit far from zero — even their weakest third is genuinely contributing.

Generated from . Curves are cheap proxies extracted from the observer state — they predict where to spend expensive benchmark evals, they don't replace them.