▍ Developing · thinking out loud
Trying to understand REAP in Qwen3.5-397B-A17B
Developing story — raw notes, updated as I go.
Jun 3, post nr 3. “I am trying to understand REAP in the context of Qwen3.5-397B-A17B.”
Last 24 hours, i built a benchmark, reap pipeline on my hardware, 50M tokens processed, next checkpoint at 100M, will run 25 hours and stops at 402M token mark, 16k context flowing through the weights. Dissecting the 50M checkpoint and trying to build an understanding how it all clicks together. I have measured 807GB FP16 qwen model, cost me 100 dollars, 4xB300 on vastai, assembled the benchmark that i thought made sense, nothing too fancy, but enough, that would tell me how far any variant derived from FP16 falls …
EDIT (30 min later) all of this i imagine assumes cut uniform each 60 layer, 34% cut would probably leave more on table from both ends? What if we had cuts 5%, 10%, 15% etc, uniform, what each would bring on the table? small cuts obviously are less lossy but it comes from the cost of space. there is probably a sweetspot somewhere. original 0xSero 35% cut, 384GB vram system had 1.6 million kv cache. So on my system, sweetspot i imagine is 500k vllm kvcache, i could reap less. I could have actually multiple prunes — eventually REAP (on my 384GB the end model would be REAP FP16 + FP8 experts only quantized) variant + 50k kv cache, 100k kv cache, 256k full model max kv cache. 30% cut looks like where the lossiness starts to accelerate? so that would be maybe 1.5 million token kv cache. How to maximise the tradeoffs? Going as far as custom vllm patches, i did it a year ago, less refined process, on 2x 96GB cards, experts where variable per layer, no uniform, variablity allowed more space savings — Qwen3.5-397B-A17B-REAP-28-NVFP4.