Andrej Karpathy’s NanoGPT is a hackable library for training language models. In his inimitable style, Karpathy shows anyone who wants to learn exactly how pretraining for LLMs is done. Here, I’d like to add support for Mixture of Experts (MoE) style models. Over the next (generic period of time), I’ll be working on learning more about MoE models. Extending NanoGPT with MoE support feels like a good place to start. I’m also interested in upcycling something like SmolLM, Not 100% sure how much compute that would take. Additionally, I’m fascinated by what’s really going on inside these kinds of models. Are they actually learning some sort of expertise? For example, in a given MoE model, is there some notion of a “math expert”?
Let’s back up. What is an expert in the context of LLMs? Each layer of a standard transformer model consists of two main parts: the attention block, and the feedforward or MLP (multilayer perceptron) block. The feedforward block is what we modify in MoE models. In a regular (most commonly known as dense) transformer model, this feedforward block is a fully connected (linear) layer that expands to 4x the hidden size of the model. Then, an activation function (most often SwiGLU or GeGLU now, GPT-2/NanoGPT uses GeLU) is applied. Finally, we use another fully connected layer to bring us back down to the hidden size of the model.
MoEs follow this same process, but instead of having one large linear layer, they use a set of smaller ones, known as experts. We also add in a fully connected layer before the MLP block, known as the router. As the name suggests, the router is responsible for deciding which experts should be active for a given token.
The vanilla Hugging Face transformers version of MoEs loops over the experts. They’re fixing this, but for now this is not really workable if we want to get the training efficiency gains that MoEs are well known for. OLMoE’s paper reports that their setup uses ~3x fewer FLOPs than the dense comparison, which equated to ~2x faster training.
So, we turn to Megablocks. Megablocks, using some clever tricks, grants a huge speedup over a for loop version. Since we’re dealing with matrices, it’s totally possible to parallelize the computation of the experts by essentially stacking them into one big tensor. This comes with a two big drawbacks:
Here’s my first pass at the Triton kernels to do the forward pass. This first one corresponds to the matrix multiplication that takes the batch of tokens and multiplies it by the expert matrix. There is a little bit of tensor manipulation that we do in PyTorch to put everything in the correct place and get the right shapes; see the full implementation if you’re interested.
The second takes that sparse matrix and sends it back to the original hidden size of the model: