I want to write up my work as I’m working on it, and not all at once. In that spirit, before I start working on pruning the FlexOlmo models to try using them with variable sized experts, I want to get a baseline of pruning models. I’m going to prune Olmo 3 7B. I think I’ll just prune it down to 1/2 of it’s size. I’ll be implementing basically these two Nvidia papers (they basically follow the same process):
I’ll run a few experiments:
Note that only pruning width seems to slightly defeat the purpose of making smaller models, in my opinion. Nvidia has released a paper on exactly this: Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models: “While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups…we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier.”
Here’s a quick latency benchmark comparing aspect ratios, trying to hold the parameters approximately the same gist of the code.
| Config | Layers | Hidden | Params | Latency |
|---|---|---|---|---|
| Very Deep | 48 | 1536 | 1.52B | 1921ms |
| Deep | 36 | 1792 | 1.57B | 1874ms |
| Balanced | 24 | 2176 | 1.59B | 1836ms |
| Wide | 16 | 2560 | 1.52B | 1695ms |
| Very Wide | 12 | 2944 | 1.55B | 1606ms |