Understanding DeepSeek-V3

Multi-head latent attention, DeepSeekMoE, and multi-token prediction

Feb 10, 2025

∙ Paid

A hyper-speed DeepSeek Whale racing through the cosmos at incredible velocity. The whale's sleek, cybernetic body glows with intense neon blue and purple streaks, leaving behind a blazing trail of energy. Motion blur and light streaks enhance the feeling of extreme speed as the whale propels through a backdrop of swirling galaxies, vibrant nebulae, and streaking stars. The scene conveys an overwhelming sense of acceleration, cosmic power, and futuristic adventure.

DeepSeek has, without a doubt, shaken up the AI world. DeepSeek-V3 — a large MoE model with 671B total parameters and 37B activated parameters trained on 14.8T tokens — is the strongest open-source LLM in the industry today, achieving performance on par with closed-source competitors like GPT-4o and Claude-3.5 Sonnet, all while staying relatively economical, requiring less than 3M H800 GPU hours for its full training.

The success of the model can be attributed to a combination of innovations in model architecture, training infrastructure, and the training protocol. This is a lot of content to cover, too much for a single blog post, so here we’ll focus just on the architectural innovations, in particular multi-head latent attention, DeepSeekMoE (DeepSeek’s Mixture of Experts implementation), and multi-token prediction. Buckle up.

Keep reading with a 7-day free trial

Subscribe to Machine Learning Frontiers to keep reading this post and get 7 days of free access to the full post archives.