Chapter 4 • 9 min read • Last reviewed: June 2026

Scaling Efficiency: MoE & Quantization

As AI models get bigger, running them gets wildly expensive. A dense 175-billion-parameter model needs multiple enterprise GPUs just to output one word at a time. To make these models usable in products and on smaller hardware, engineers lean on two major efficiency plays: Mixture of Experts (MoE) and Quantization.

Mixture of Experts (MoE)

In a standard "Dense" model, every parameter activates for every word. That is wasteful; the model does not need its entire math brain for a comma or a pronoun.

An MoE architecture turns a dense model into a "Sparse" model by splitting it into specialized compartments called Experts (usually inside feed-forward layers). Instead of sending every word through every path, a dynamic Gating Network (Router) decides which experts should handle each token.

Sparse routing in action

Imagine a model with 8 separate experts. When a token comes in:

If the token is Python code, the router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
If the token is French, the router sends it to Expert 1 (Translation specialist).

Usually, the router selects only the Top-2 Experts for each token. If the model has 8x 7B experts, or 56B total parameters, it might activate only about 12B per token. You get huge total capacity with per-token compute closer to a smaller model.

Where MoE gets messy

MoE is not free. It brings real engineering headaches:

RAM Overhead: Even if only 12B parameters are active at a moment, the whole 56B model still has to sit in GPU memory. MoE can need much more VRAM than a dense model with similar active compute.
Routing Collapse: Early in training, the router can overuse one expert. That expert gets better, so the router sends it even more traffic. Engineers need load-balancing tricks so every expert learns.

Quantization

Neural networks store learned weights as high-precision decimals called floating-point numbers. During training, these usually use 16-bit precision (FP16 or BF16).

At 16-bit precision, each parameter needs 2 bytes of GPU memory. A 70B model needs at least 140GB of VRAM just to load, which is way beyond most consumer GPUs.

Quantization compresses weights by lowering numerical precision, mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or custom formats like FP4.

Quantization intuition

Quantization is like lowering the color depth of a photo. Convert 24-bit true color to an 8-bit palette and the file shrinks hard. It looks a bit less smooth, but the shapes and meaning are still obvious.

Similarly, quantizing a model from 16-bit to 4-bit cuts size by 75%. A 70B model that needed 140GB of VRAM can fit around 35GB. Because neural networks have lots of redundancy, the reasoning hit can be surprisingly small.

Modern quantization formats

Several standard formats run compressed models:

GGUF (formerly GGML): Optimized for CPU execution, so large models can run on consumer laptops like Apple Silicon MacBooks using system RAM instead of GPU VRAM.
GPTQ / AWQ: GPU-focused quantized formats that keep compressed models generating quickly on standard desktop graphics cards.

Serving Efficiency in Real Systems

MoE and quantization are only part of the deployment story. Production inference stacks also rely on KV-cache reuse, batching, speculative decoding, model distillation, and careful routing between small and large models. A customer-support bot might use a small fast model for classification, a retrieval model for evidence, and a larger reasoning model only when the case is complex.

The practical question is not "what is the biggest model we can run?" It is "what is the cheapest system that meets the quality, latency, privacy, and reliability target?" Efficient AI products mix model sizes, precision levels, retrieval, caching, and fallback paths instead of sending every request to the same expensive endpoint.

Sources

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al.
GGUF format documentation — ggml