Machine Learning Frontiers

Machine Learning Frontiers

Understanding DeepSeek-V3

Multi-head latent attention, DeepSeekMoE, and multi-token prediction

Samuel Flender's avatar
Samuel Flender
Feb 10, 2025
∙ Paid
16
4
Share
A hyper-speed DeepSeek Whale racing through the cosmos at incredible velocity. The whale's sleek, cybernetic body glows with intense neon blue and purple streaks, leaving behind a blazing trail of energy. Motion blur and light streaks enhance the feeling of extreme speed as the whale propels through a backdrop of swirling galaxies, vibrant nebulae, and streaking stars. The scene conveys an overwhelming sense of acceleration, cosmic power, and futuristic adventure.

DeepSeek has, without a doubt, shaken up the AI world. DeepSeek-V3 — a large MoE model with 671B total parameters and 37B activated parameters trained on 14.8T tokens — is the strongest open-source LLM in the industry today, achieving performance on par with closed-source competitors like GPT-4o and Claude-3.5 Sonnet, all while staying relatively economical, requiring less than 3M H800 GPU hours for its full training.

The success of the model can be attributed to a combination of innovations in model architecture, training infrastructure, and the training protocol. This is a lot of content to cover, too much for a single blog post, so here we’ll focus just on the architectural innovations, in particular multi-head latent attention, DeepSeekMoE (DeepSeek’s Mixture of Experts implementation), and multi-token prediction. Buckle up.

Keep reading with a 7-day free trial

Subscribe to Machine Learning Frontiers to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Samuel Flender
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture