What is MoE? The sparse expert trick behind DeepSeek and Mixtral

DeepSeek V3 has 671 billion parameters. GPT-5.5 is estimated around 100 billion. Yet DeepSeek's API price is a fraction of GPT-5.5's, and on many benchmarks the gap in quality is surprisingly narrow. The explanation isn't magic or subsidy: it's architecture. DeepSeek is a Mixture of Experts model. Mixtral is another. Once you understand what that means, the pricing makes immediate sense, and so do the limitations.

What MoE actually is

Every standard transformer layer contains a feedforward network (FFN) that processes each token after the attention step. In a dense model, that FFN is one large set of weights. The same weights fire for every token, whether you're translating Python to JavaScript or writing a haiku.

MoE replaces that single FFN with multiple smaller FFNs, each called an expert. A learned routing function (a small neural network attached to each layer) reads each incoming token and selects a handful of experts to handle it. The other experts receive nothing.

In Mixtral 8x7B, each layer has 8 experts and the router picks 2 of them per token. In DeepSeek V3, each layer has 256 routed experts and the router picks 8. Both approaches share the same basic principle: you compute far less than you store.

Why MoE models cost less to run

Inference cost is dominated by two factors: how many floating point operations (FLOPs) the model performs per token, and how much memory bandwidth it consumes reading weights.

For dense models, both scale with parameter count. A 100B dense model does roughly 200B FLOPs per forward pass and reads ~200GB of weights per pass (at FP8).

For MoE models, FLOPs scale with active parameters, not total parameters. DeepSeek V3's 671B total parameters activate only 37B per token. That's roughly the FLOP count of a 37B dense model. The 671B figure reflects total capacity (the knowledge distributed across all experts), but each token only taxes 37B of it.

The result is that you get the expressive capacity of a very large model at the inference cost of a much smaller one. That's why DeepSeek can charge substantially less per token than a similarly capable dense model, and why the math holds even when the parameter counts are so far apart.

DeepSeek's fine-grained approach

DeepSeek's MoE design differs from Mixtral's in two important ways: scale and granularity.

Scale is straightforward: 256 routed experts versus Mixtral's 8. More experts means finer specialization: each expert sees a narrower slice of the input distribution during training and can develop a tighter specialization.

Granularity refers to expert size. Rather than a few large experts, DeepSeek uses many small ones. With 256 total and 8 active, the ratio of active to total experts is 3.1%. This is a more aggressive sparsity than Mixtral's 25% (2 of 8). The practical effect is that the routing decision becomes more fine-grained, and the risk of any single expert becoming a bottleneck decreases.

DeepSeek V3 also adds one shared expert per layer that always fires, in addition to the 8 routed ones. This shared expert handles tokens that need generalist processing regardless of routing, which stabilizes quality on common patterns and gives the routed experts room to specialize on rarer ones.

The combination explains how DeepSeek V3 achieves competitive benchmark scores against models with far higher per-token compute budgets. To see exactly where that quality advantage shows up in practice, our head-to-head benchmark against Llama 4 Maverick puts both MoE models through 15 real coding and reasoning tasks with full judge reasoning. For context on where DeepSeek's free API access sits relative to other options, our comparison of free LLMs in 2026 covers rate limits and privacy trade-offs for each tier.

Mixtral's simpler trade-off

Mixtral 8x7B, released by Mistral AI under an Apache 2.0 license in late 2023, uses a more conservative structure: 8 experts per layer, 2 active per token. Total parameter count is about 46.7 billion; active parameters per token are about 12.9 billion.

The design choice reflects a different constraint. Mixtral was built to fit and run efficiently on two 80GB A100s, making it practical for teams with existing hardware and self-hosting requirements. DeepSeek's architecture requires multi-node serving infrastructure; Mixtral does not.

Mixtral's approach also involves less routing complexity. With 8 experts and a top-2 router, the gating function is straightforward and the load balancing problem is tractable. Training stability with 256 experts requires specific techniques like auxiliary balancing losses and expert grouping; DeepSeek solved these, but they add engineering overhead that Mistral's smaller expert count simply doesn't need.

When MoE gets expensive

MoE's cost advantage depends on a specific condition: your compute is the bottleneck, not your memory.

When you serve many users simultaneously, each request occupies GPU compute for a brief window and GPU memory the entire time. At sufficient batch size, the active-parameters advantage is real: you're doing 37B worth of arithmetic while accessing 671B worth of weights. The arithmetic savings dominate.

At very small batch sizes (single-digit requests in flight), the equation shifts. You still load all 671B parameters into VRAM. You still pay the memory bandwidth cost of reading expert weights per layer. But the arithmetic savings shrink because there's less parallelism to amortize over. For providers running large shared fleets, this is rarely the bottleneck; batch sizes are naturally high. For teams self-hosting a MoE model to serve a low-traffic internal tool, the per-request cost can exceed a well-chosen dense model.

VRAM requirement is the other catch. Running Mixtral 8x7B at FP16 requires about 90GB of VRAM, so two A100s at minimum. DeepSeek V3 requires multiple high-memory nodes. You load the full parameter count, even though only a fraction fires per token. If you're comparing "MoE model at $X per call on a hosted API" versus "dense model self-hosted on cheaper hardware," the hosted API price already reflects efficient batch utilization. Self-hosting a MoE model on underutilized hardware typically costs more per call, not less.

This is one of the hidden variables in LLM cost comparisons. The three LLM costs nobody talks about covers the others: thinking tokens, JSON retries, and prompt bloat, all of which quietly inflate bills beyond the per-token rate.

For developers building on top of these models through an API, MoE is an implementation detail. You pay the hosted rate, the provider absorbs the infrastructure trade-offs, and you get quality at a price that reflects efficient large-scale serving. If you want to run your own benchmarks across MoE and dense models to see how they compare on your specific workload, the LLMTest proxy routes to both from a single endpoint.

FAQ

What does "Mixture of Experts" mean in plain terms?

Instead of one large neural network processing every token, an MoE model has many smaller networks (the experts) and a router that decides which ones handle each token. Only a small subset do any work for each token, which reduces compute without reducing the total capacity of the model.

How many experts are active at once in models like DeepSeek and Mixtral?

Mixtral 8x7B activates 2 of its 8 experts per token (25% of total experts). DeepSeek V3 activates 8 of its 256 routed experts plus 1 shared expert per token (roughly 3.5% of total experts). Fewer active experts means fewer FLOPs per token, which is the source of the inference cost advantage.

Why is DeepSeek so much cheaper than GPT-5.5 if it has more parameters?

Parameter count and inference cost are different things for MoE models. DeepSeek V3 has 671B total parameters but activates only 37B per token. GPT-5.5 is a dense model where all parameters contribute to every token. You're comparing the active compute of 37B (DeepSeek) versus the active compute of ~100B (estimated GPT-5.5). That gap in active compute maps directly to the gap in inference cost.

Does MoE affect quality on tasks outside each expert's specialty?

Potentially, yes. If a token's characteristics don't closely match any expert's specialization, the routing function may assign it to experts that aren't well-suited to the task. Well-trained MoE models mitigate this with load balancing during training to ensure all experts develop competency on common patterns, and with shared experts that always fire as a baseline. In practice, frontier MoE models like DeepSeek V3 show no obvious quality cliffs on diverse tasks, but niche or unusual inputs are where routing uncertainty is highest.

Can I run a MoE model on a single GPU?

Depends on the model and the GPU. Mixtral 8x7B at 4-bit quantization fits on a single 48GB GPU, and 8-bit quantization fits across two 24GB consumer GPUs. DeepSeek V3 requires multi-GPU, multi-node infrastructure at any quality quantization. Smaller MoE models like Mixtral 8x7B are the practical choice for single-machine self-hosting.

Are all cheap LLMs MoE models?

No. Some cheap models are simply smaller dense models with fewer total parameters. GPT-4o-mini and Claude Haiku 4.5 are dense models at lower parameter counts. The difference matters for scaling: a dense model's quality scales predictably with parameter count; a MoE model's quality scales with total capacity (all experts) while cost scales with active parameters. Understanding which type you're using helps predict where each model will hit quality ceilings relative to its price. Our model selection guide covers how to use cost, latency, and quality dimensions together to pick the right architecture for a given workload.

What MoE actually is

Why MoE models cost less to run

DeepSeek's fine-grained approach

Mixtral's simpler trade-off

When MoE gets expensive

FAQ

Ship LLM features without burning your budget.

Related articles