Vishal Pandey | Applied ML Research

Mixture of Experts (MoE) [Theory and Implementation]

In this, we will learn about the MoEs and Sparse MoEs; both are mostly the same, but different in use cases. This can be used for the last-minute revision for the interviews OR learn with maths intuition...

This is not a new concept made by DeepSeek. The MoE was first coined by Prof. Geoffrey Hinton and later on used by DeepSeek tactically in the FFNs.

Paper Link: DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

My Implementation from scratch: MoE_Implementation

Mixture of Experts Architecture

MoEs in the Large Neural Networks mixture_of_experts_architecture

MoEs in the DeepSeek moe_in_deepseek

MoEs in Switch Transformer mow_in_switch_transformer


The Mixture of Experts (MoE) is a neural network architecture that uses a set of expert networks to solve tasks, while a gating network determines which experts to use for each input. Instead of using all experts for every input, the MoE model uses a sparse of experts(usually a small number) to make predictions, improving efficiency and scalability.

It is divided into three parts: Expert Networks, Gating Networks, Combination of Outputs

Important Formulas


1. The Core Intuition of MoE

At the simplest level:

Standard neural nets: every parameter is active for every input.

MoE: only a small subset of parameters (experts) are activated depending on the input.

Think of it as a committee of specialists. A question comes in → the "gating mechanism" picks which specialists should answer → the final output is the weighted combination of their answers.

This gives:


2. Basic Math of MoE

Let’s formalize.

Suppose:

Input token representation: xd

We have E experts, each is a function fi(x). For simplicity, think of each expert as an MLP layer.

The MoE layer output is:

y=i=1Egi(x)·fi(x)

Where:

Gating Function:

g(x)=Softmax(Wgx)

Here Wg is a learnable matrix.

In sparse MoE (most popular in practice, e.g., Switch Transformer, GLaM):

We don’t use all experts.

Instead, we select the Top-K experts (say, 1 or 2) with highest gate score.

So effectively:

y=iTopK(g(x))gi(x)·fi(x)

3. Intuition of Mathematics

The gating mechanism is like a router that decides which expert to call.

Training encourages experts to specialize in handling different types of tokens (e.g., numbers vs. code vs. natural language).

By using Top-K, we make the computation cheap: we don’t run all experts, just a few.

Efficiency gain:

Suppose E=64 experts, each with 50M parameters → 3.2B parameters total.

If we use Top-2 experts per input, the effective compute per token is ~100M parameters (instead of 3.2B).

So you get the capacity of a giant model with the compute of a smaller one.


4. Architectures of MoE

(A) Shazeer et al. (2017) – Sparsely Gated MoE

(B) Switch Transformer (Google, 2021)

Equation:

y=fi*(x)where i*=\argmaxg(x)

5. Training Challenges & Solutions

Load Imbalance:

Some experts get all tokens, others are idle.

Solution: Add auxiliary loss to encourage balanced usage.

Example:

Lbalance=α·Var(expert usage)

Routing Instability:

Gate may change suddenly → unstable learning.

Solution: Use noise, temperature scaling, or smooth routing.

Communication Overhead:

Sending tokens to multiple experts across GPUs is expensive.

Solution: Use all-to-all optimized kernels (e.g., DeepSpeed-MoE).


6. Backpropagation Derivation (Single Token, Single MoE Layer)

Setup & Notation

Step 1: Gradients w.r.t. Expert Outputs and Combine Weights

Since y=iSαihi:

Step 2: Gradients Inside Each Expert

Let ui=Lhi=αig

Then:

Expert contribution to input gradient:

Lx|expert i=W1,i((W2,i(αig))σ(ai))

Only experts iS receive/use these gradients.

Step 3: Gradients w.r.t. Masked Softmax Weights α

We treat α as a softmax over masked logits:

zi={i/τiSiS,αi=ezijSezjfor iS

Softmax Jacobian (restricted to S):

αizj=αi(δijαj),i,jS

Chain rule to logits:

αij=1ταi(δijαj),jS

Then:

Lj=iSLαiαij
=1τiS(ghi)αi(δijαj)
=1τ((ghj)αjαjiSαi(ghi))
=1ταj(ghjgy)
=1ταjg(hjy),jS

And Lj=0 for jS.

Why Use δij Instead of Just 1?

Softmax is multi-dimensional, not independent.

Each output depends on all logits, so the derivative is a Jacobian.

Using δij gives both in one clean formula:

αizj=αi(δijαj)

Step 4: Router Gradients

Given =Wrx+br:

Step 5: Total Input Gradient

Sum of expert and router contributions:

Lx=iSW1,i((W2,i(αig))σ(ai))+Wr(L)

7. What Interviewers Love to Probe in MoE Architectures

Who gets the gradient?

Router–Input Coupling

Top-1 vs Top-K Trade-Off