Pruning Olmo 3 7B

I want to write up my work as I’m working on it, and not all at once. In that spirit, before I start working on pruning the FlexOlmo models to try using them with variable sized experts, I want to get a baseline of pruning models. I’m going to prune Olmo 3 7B. I think I’ll just prune it down to 1/2 of it’s size. I’ll be implementing basically these two Nvidia papers (they basically follow the same process):

Experiments

I’ll run a few experiments:

Note that only pruning width seems to slightly defeat the purpose of making smaller models, in my opinion. Nvidia has released a paper on exactly this: Nemotron-Flash: Towards Latency-Optimal Hybrid Small Language Models: “While previous work on SLM design has primarily focused on reducing the number of parameters to achieve parameter-optimal SLMs, parameter efficiency does not necessarily translate into proportional real-device speed-ups…we first study latency-optimal depth-width ratios, with the key finding that although deep-thin models generally achieve better accuracy under the same parameter budget, they may not lie on the accuracy-latency trade-off frontier.”

Here’s a quick latency benchmark comparing aspect ratios, trying to hold the parameters approximately the same gist of the code.

Config Layers Hidden Params Latency
Very Deep 48 1536 1.52B 1921ms
Deep 36 1792 1.57B 1874ms
Balanced 24 2176 1.59B 1836ms
Wide 16 2560 1.52B 1695ms
Very Wide 12 2944 1.55B 1606ms