Does the Teacher Matter?

February 12, 2026

Work in progress.

The plan is to pretrain a fully distilled NanoGPT-aloid using off-policy distillation, starting with OLMo 3 7B’s base, instruct, and thinking variants to see if post-training variant makes a difference, then testing quantized teachers, and finally moving on to larger, smarter models like OLMo 3 32B, Qwen, GLM 4.7 Flash, and GPT-OSS 120B with whatever setup wins. We’ll have to switch tokenizers along the way, but that seems like an ok trade-off.

Questions

Does the teacher’s post-training variant matter (base vs instruct vs think)?
Does quantizing the teacher matter?
Does the teacher model and size matter?
Does tokenizer mismatch matter?

Model Setup, Baseline

That gets us to ~75M parameters in the transformer core and ~154M embedding parameters (hahaha), for a total of ~229M. To get a nice baseline, even though the model is dominated by the embedding parameters, we train on 5B tokens — a little over Chinchilla-optimal for the total parameter count, ignoring the fact that so many parameters come from the embeddings. That gets us down to a val BPB of 0.96437.

Experiment Plan

Phase	Teacher	Architecture	Quant	Tokenizer	Testing
1	OLMo 3 7B Base	Dense 7B	bf16	OLMo	Post-training variant
1	OLMo 3 7B Instruct	Dense 7B	bf16	OLMo	Post-training variant
1	OLMo 3 7B Think	Dense 7B	bf16	OLMo	Post-training variant
2	OLMo 3 7B (phase 1 winner)	Dense 7B	4-bit	OLMo	Quantization
3	OLMo 3 32B	Dense 32B	best	OLMo	Bigger same-family
3	Qwen 3 32B	Dense 32B	best	Qwen	Dense, different family
3	Qwen 3 30B-A3B	MoE 30B-A3B	best	Qwen	Dense vs MoE (same family)
3	GLM 4.7 Flash	MoE 31B-A3B	best	GLM	Strongest MoE
3	GPT-OSS 120B	MoE (sparse)	best	GPT-OSS	Massive scale
4	Qwen base vs instruct	—	best	Qwen	Spot-check

TODO

Calculate teacher FLOPs per token for each model once we settle on token budget (chinchilla optimal for 125M NanoGPT is ~2.5B tokens, but distillation should need far fewer)

Results

The baseline model is defined as follows: