DeepSeek has, without a doubt, shaken up the AI world. DeepSeek-V3 — a large MoE model with 671B total parameters and 37B activated parameters trained on 14.8T tokens — is the strongest open-source LLM in the industry today, achieving performance on par with closed-source competitors like GPT-4o and Claude-3.5 Sonnet, all while staying relatively economical, requiring less than 3M H800 GPU hours for its full training.
The success of the model can be attributed to a combination of innovations in model architecture, training infrastructure, and the training protocol. This is a lot of content to cover, too much for a single blog post, so here we’ll focus just on the architectural innovations, in particular multi-head latent attention, DeepSeekMoE (DeepSeek’s Mixture of Experts implementation), and multi-token prediction. Buckle up.
Keep reading with a 7-day free trial
Subscribe to Machine Learning Frontiers to keep reading this post and get 7 days of free access to the full post archives.