DeepSeek has, without a doubt, shaken up the AI world. DeepSeek-V3 — a large MoE model with 671B total parameters and 37B activated parameters trained on 14.8T tokens — is the strongest open-source LLM in the industry today, achieving performance on par with closed-source competitors like GPT-4o and Claude-3.5 Sonnet, all while staying relatively economical, requiring less than 3M H800 GPU hours for its full training.
The success of the model can be attributed to a combination of innovations in model architecture, training infrastructure, and the training protocol. This is a lot of content to cover, too much for a single blog post, so here we’ll focus just on the architectural innovations, in particular multi-head latent attention, DeepSeekMoE (DeepSeek’s Mixture of Experts implementation), and multi-token prediction. Buckle up.
Multi-head latent attention
As a reminder, attention is simply a formula for assigning relevance scores to all tokens in a sequence given a new, “query”, token as follows:
where K and V are vectors representing the content of the sequence generated using projection matrices, i.e. the key and value matrix, respectively.
When a language model generates text using multi-head attention, it predicts one token at a time. Naively, the computational complexity of doing this would be O(N^2) because we’d need to re-compute the keys, values, and attention scores for all tokens in the sequence for each new query. In practice though we can reduce this to O(N) by simply storing the keys and values in GPU memory, a trick known as “KV caching”.
The key idea behind DeepSeek’s multi-head latent attention (MLA) is to apply low-rank compression to the keys and values prior to caching, hence reducing the memory footprint, which in turn allows the model to scale to longer sequences more efficiently. Technically, the compression is done by introducing additional matrices,
a down-projection matrix WDKV for keys and values,
an up-projection matrix WUK for keys,
another up-projection matrix WUV for values.
Here is how these matrices are being used in the computation of the value embeddings during inference:
How much memory does this save? Assuming 128 attention heads with a dimension of 128 each, and a latent (compressed) dimension of 512, the compression factor would be 128x128/512 = 32. In other words, using the same amount of memory, MLA would be able to model 32x longer sequences than standard (uncompressed) multi-head attention.
DeepSeekMoE
DeepSeek V3 is a Mixture of Experts (MoE) model, which means that instead of a single FFN layer following the attention block, it has a large number of FFN “experts”, each specializing in different input tokens. Thanks to hard routing, i.e. only activating the top k experts for each input, this allows us to scale up modeling capacity with ~O(1) computational complexity by parallelizing experts across our training cluster.
This in and of itself is nothing new, and we’ve covered MoE-style LLMs at length in our discussions of the Switch Transformer, MegaBlocks, BASE, and Mixtral of Experts. However, DeepSeek introduces three notable novelties that make their MoE model stand out from the rest of the industry today:
1 - Hybrid routing strategy. DeepSeek uses a hybrid of soft routing and hard routing. As a reminder, in soft routing we compute the weighted sum over all expert outputs, whereas in hard routing we limit the sum to the top k experts with the highest routing scores. In the hybrid version, we have a combination of shared experts and a pool of routed experts, of which only the top k are activated for each input token. Then, the output of the MoE layer is a weighted sum over the shared and routed experts, where the shared experts’ weights are 1 and the routed experts’ weights are the router scores, as follows:
Unlike standard MoE implementations, DeepSeek uses a per-expert Sigmoid instead of a Softmax to compute the router scores. This decouples the router scores from each other, which is important for the next trick, dynamic load balancing.
2 - Dynamic load balancing. Load balancing — making sure all experts and hence all GPUs inside the training cluster receive the same number of tokens during training — has been one of the most difficult challenges in sparse MoEs. So far, the status quo has been to introduce either load balancing losses (e.g. Switch Transformer) or customized compilers (e.g. MegaBlocks). DeepSeek’s MoE demonstrated for the first time a third solution, namely dynamic load balancing.
The trick is to add a bias term b to each expert’s router scores prior to taking the top-k. If an expert is “overloaded” (i.e. receiving more tokens than the total number of tokens divided by the total number of experts), we reduce that expert’s bias by 𝛾, resulting in a smaller chance of the expert being selected by the router. In contrast, if the expert is underloaded, we increase the expert’s bias by 𝛾, increasing the chance of the expert being selected. Thus, as training progresses, eventually expert loads are mathematically guaranteed to eventually reaching perfect balance, where the time it takes to reach balance depends on the size of the update 𝛾.
3 - Sequence-wise balancing. Unlike other MoE models, DeepSeekMoE adds a novel auxiliary loss term that ensures expert balance not just across the entire batch but even more fine-grained across each individual token sequence inside the batch. For example, given a sequence of 100 tokens and a pool of 4 routed experts with k=1, ideally we want each expert to be activated for 25/100 tokens.
Technically, the loss is simply summing the product of the expert’s average routing scores (Pi) and the number of times the expert was selected with some normalization constants (fi) over the entire sequence:
This loss is then accumulated over all sequences seen during model training.
Multi-token prediction
In standard self-supervised training we simply predict the (single) next token in the sequence. In contrast, DeepSeek predicts the next N tokens in the sequence. Technically, these additional predictions are computed by adding extra Transformer modules to the model, each with its own independent parameters with the exception of the embedding layer and the output head layer, both of which are shared.
The authors call these additional modules MTP (multi-task prediction) modules, where MTP-1 predicts the next next token, MTP-2 predicts the next next next token, and so on. Importantly, the input the next MTP module always includes the output from the previous MTP module, so that we preserve the causal chain of the generated sequence:
During training, we then take the average over all of the MTP’s losses and add them to the main loss with some regularization factor, helping with performance of the main model. During inference, we can in principle drop the additional MTP modules, as they’re no longer needed — after all, we only need to predict one token at a time. In practice, DeepSeek-V3 keeps the first MTP module for speculative decoding, which increases their decoding speed by 1.8X.
Coda
DeepSeek-V3 represents a significant leap in open-source AI, demonstrating that cutting-edge innovation isn’t confined to closed ecosystems. By integrating multi-head latent attention, hybrid Mixture of Experts routing with aux-loss free load balancing, and multi-token prediction, DeepSeek has pushed the boundaries of what is possible in large-scale language modeling, all while staying relatively economical compared to competitors.
These innovations not only improve model performance but also set new standards for how future LLMs can be trained and optimized. DeepSeek-V3 is not just a milestone — it’s a sign of what’s to come.
In the words of the authors,
DeepSeek consistently adheres to the route of open-source models with longtermism, aiming to steadily approach the ultimate goal of AGI (Artificial General Intelligence).