Efficient Mixtures of Experts with Block-Sparse Matrices: MegaBlocks
A new compute paradigm makes sparse Mixtures of Experts more scalable than ever before
The sparse Mixtures of Experts (MoE) algorithm has been a game-changer in Machine Learning, allowing us to scale up modeling capacity with almost constant computational complexity, resulting in a new generation of LLMs with unprecedented performance, including the Switch Transformer, GPT-4, Mixtral-8x7b, and more. Really, we’re just starting to see the impact of this powerful innovation.
However, Machine Learning runs on GPUs, and GPUs are optimized for dense, not sparse, operations. One of the most challenging research questions around sparse MoE is therefore how to run it efficiently on machines that were never intended to run sparse computations in the first place.
This week, we’ll meet MegaBlocks (Gale et al 2022), the perhaps most efficient and scalable sparse MoE implementation in the industry today. The key idea is to re-formulate sparse MoE as a single block-sparse matrix multiplication instead of multiple dense matrix multiplications by leveraging a data format specifically designed to store and manipulate sparse matrices. The resulting algorithm handles the dynamic computational demands in sparse MoE without the need for load balancing, padding, or token dropping, as was standard in previous work.
But first, a little bit of historical context.
The curse of the capacity factor
In sparse MoE, the router determines which tokens in a batch of data to route to which expert. This can introduce expert imbalance, that is, some experts may receive a large number of tokens, and some very few. This is a problem because in order to run the forward passed over all of the experts most efficiently, we want to run all of them at the same time in a single, fused, operation, and this requires the inputs to all experts to be of the same shape.
The Switch Transformer solved this problem by introducing the capacity factor f, a new hyperparamter that controls how many tokens are allowed per experts. Formally:
capacity = f x T/E
where T is the number of tokens and E is the total number of experts. For example, with 10K tokens, 100 experts, and f=1, we’d force all active experts to process 100 tokens per batch, no more and no less.
If an expert ends up with more tokens than its capacity allows, we need to drop the extra tokens. In contrast, if the expert ends up with too few tokens, we pad the input such as to ensure consistent input sizes. Thus, no matter how we tune f, we’ll take a hit in performance: if f is too large, the hit is in efficiency due to all that extra padding. If f is too small, the hit is in modeling accuracy due to dropped tokens.
This is the curse of the capacity factor: no matter what we do, it’s not really ideal - ideally, we want to use neither padding nor drop tokens.
Enter the block-sparse compute paradigm.
Sparse MoEs as block-sparse matrix multiplication
In the Switch Transformer, we formulate the forward pass of the MoE layer as E matrix multiplications with identical inputs and outputs, which can be illustrated as follows:
The “trick” in MegaBlocks is to re-formulate this computation as a single block-sparse matrix multiplication, which can be illustrated as follows:
The left side of this figure shows the token representations (blue) and the expert hidden layers (orange), where the dashed lines show the boundaries between experts. Because each token is being processed by just one expert, this multiplication results in the sparse matrix as shown on the right, where almost all cells are zero (gray), and only a few are non-zero (pink).
Here, a “cell” is fixed-size sub-matrix that is computed by a group of threads on the GPU, a thread block. The size of this cell is one of the hyperparameters in MegaBlocks and needs to be carefully tuned: too small, and we don’t fully take advantage of GPU acceleration. Too large, and we’re limiting the expert granularity, that is, the smallest number of tokens that can be routed to a single expert, which can limit model performance. Empirically, the authors find best performance with a block size of 128x128.
The BCSR data format
In order to make these block-sparse computations work, MegaBlocks introduces a dedicated data format, the blocked compressed sparse row (BCSR) format, illustrated below:
In this format we store the non-empty blocks of the sparse matrix along with their indices (where they’re located) and offsets (how far to jump to get from one block to the next block). This is extremely efficient: instead of running a calculation over the entire matrix, we just have to loop over the non-empty blocks, where the indices and offsets tell us where in the matrix we are currently located.
By the way, native PyTorch doesn’t have a way to use this block-sparse format, so the authors implemented their own PyTorch compiler.
Dropless MoE
While MegaBlocks refers to a new computational framework that leverages the block-sparse compute paradigm, the particular sparse MoE implementation enabled with this framework is “dropless MoE” or dMoE, referring to the fact that no tokens need to be dropped for this to work, unlike other solutions such as the Switch Transformer.
With 2-layered experts, a hidden layer and an output layer, the dropless MoE implementation then boils down to just 7 lines of code:
At a high level, this is how it works:
Line 7: runs the forward pass over the router, returning the token weights that determine which token is being routed to which expert.
Line 12: creates the block-sparse matrix topology using the block size. This topology is used in the block-sparse matrix multiplication.
Lines 15, 26: are the permute (gather) and un-permute (scatter) steps. At a high level, the first of these steps sorts all of the tokens by expert, and the second of the steps is undoing the sorting. This step is critical because it creates the nice, large blocks in the block-sparse matrix. (“Padded” because we have to pad the expert inputs to at least match the block size.)
Lines 22, 23: here we run the block-sparse matrix calculations, which are called SDD and DSD:
SDD takes as inputs two dense matrices (experts and tokens) and returns a block-sparse matrix, which corresponds to the expert outputs, and
DSD takes as inputs a block-sparse matrix (the first layer output) and a dense matrix (the second layer’s neurons) and returns another dense matrix, which corresponds to the output from the second layer, which is, in this case, the MoE output.
This example is for two-layered experts, and we can add more layers simply by adding more SDD-DSD pairs.
Lastly, you may be wondering why we multiply the outputs with the weights (Line 27). In fact, in hard routing this is not strictly necessary, but the weights still carry additional information that may be useful for the model: if the expert’s weight was 0.99, say, we know that the router was much more certain in selecting this particular expert than if it’s just 0.5. It’s a way to propagate router uncertainty into the final model prediction.
Experimental results
The authors compared the performance of MegaBlocks against the dense Megatron language model (Shoeybi et al 2019), as well as against Tutel MoE (Hwang et al 2022) another sparse MoE implementation that’s more similar to the Switch Transformer, albeit with self-tuning capacity factor. All models were trained on The Pile on a cluster of NVIDIA A100 SXM4 GPUs with 80GB memory.
Here’s what they found:
compared to Megatron, MegaBlocks dMoE achieves 2.4x speed-up, that is, it can reach the same accuracy in less than half the time, and
compared to Tutel MoE, MegaBlocks dMoE achieves up to 1.4x speed-up,
where the speed-up depends on the model size: the bigger the model size, the higher the speed-up. MegaBlocks, in other words, benefits much more from scale than the other models.
Summary
Let’s recap:
The key idea behind MegaBlocks is to re-formulate the sparse MoE as a single, block-sparse matrix multiplication instead of one matrix multiplication per expert.
In order for this to work, we need an efficient way to store and manipulate block-sparse matrices on GPUs: that’s the BCSR (block compressed sparse row) format.
In contrast to competing algorithms, MegaBlocks dropless MoE allows us to scale up Transformer-based LLMs without the need for capacity factor or load balancing losses. The underlying compute paradigm is built to handle imbalanced expert utilization.
The authors find a 2.4x speed-up compared to a dense Transformer model (Megatron-LM) and 1.4x speed-up compared to Tutel MoE.
MegaBlocks is a fascinating innovation and a good example of thinking outside the box: while previous works have tried to find ways to force experts to receive inputs of the same shapes so that we can run them efficiently, the authors of MegaBlocks instead re-phrased the problem, essentially into a compiler problem.
And that, with success: MegaBlocks is one of the most efficient sparse MoE implementations in the industry today.
👉 More about Mixtures of Experts in Machine Learning Frontiers: