HomeTrends#mixture-of-experts
Back to Trends
🔀
mixture-of-experts

Mixture of Experts

An architecture that routes inputs to specialized sub-networks, enabling massive scale efficiently.

1 article0 related tools5 related tagstechnology
Key Facts
LLaMA 4-400B Active Params~40B per token
Efficiency Gain~10× vs dense equivalent
Expert Count64 experts (LLaMA 4)
Router MechanismTop-2 of N expert selection
Pioneer Paper"Outrageously Large NNs" (2017)
Inference HW2× A100 for 400B MoE

Mixture-of-Experts (MoE) is a neural network architecture where the model consists of many 'expert' sub-networks and a learned 'router' that decides which experts to activate for each input token. Only a small fraction of experts are active at any time, making MoE models far more computationally efficient than dense models of the same parameter count. LLaMA 4-400B uses MoE — activating ~40B parameters per token despite having 400B total. GPT-4 is widely believed to be a MoE model internally. MoE is now considered essential for building frontier-scale models economically.