Chapter 4 • 9 min read

Scaling Efficiency: MoE & Quantization

As AI models grow larger, running them becomes incredibly expensive. A dense 175-billion-parameter model requires multiple high-end enterprise GPUs running concurrently just to output a single word. To make these models practical for commercial use and deployable on smaller hardware, engineers rely on two massive efficiency breakthroughs: **Mixture of Experts (MoE)** and **Quantization**.

Mixture of Experts (MoE)

In a standard **"Dense"** model, every single parameter (the neural connections) is activated for every single word processed. This is highly inefficient; a model doesn't need to invoke its entire mathematical knowledge base to process a simple punctuation mark or pronoun.

An **MoE** architecture turns a dense model into a **"Sparse"** model by breaking it up into specialized compartments called **Experts** (typically inside the Feed-Forward Network layers). Instead of passing a word through all pathways, a dynamic **Gating Network (Router)** decides which experts should handle which word.

Sparse Routing in Action

Imagine a model with 8 distinct "Experts." When a token is processed:

  • If the token is a line of Python code, the Router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
  • If the token is a word in French, the Router sends it to Expert 1 (Translation specialist).

Typically, the Router selects only the Top-2 Experts for each token. If a model has a total of 8x 7B experts (56B total parameters), it only activates roughly 12B parameters per token. This gives the model the vast knowledge capacity of a 56B model, but with the fast generation speed and compute cost of a much smaller 12B model.

The Challenges of MoE

MoE is not a free lunch. It introduces several hard engineering hurdles:

  1. RAM Overhead: Although only 12B parameters are active at any millisecond, the entire 56B parameter model must still be loaded into the GPU's memory (VRAM). This means MoE requires significantly more memory than dense models of equivalent speed.
  2. Routing Collapse: During early training, the router might favor one expert, making it smarter, which causes the router to send even more traffic to it. Engineers must write custom algorithms to force load-balancing so all experts are trained evenly.

Quantization

Neural networks represent their learned weights as high-precision decimals called floating-point numbers. During training, these are typically represented in 16-bit precision (**FP16** or **BF16**).

Storing weights in 16-bit precision means every single parameter requires 2 bytes of GPU memory. A 70-billion-parameter model requires at least 140 gigabytes of VRAM just to load, which exceeds the capacity of almost all consumer GPUs.

**Quantization** is the process of compressing these weights by reducing their numerical precision—mapping them to smaller formats like 8-bit integers (**INT8**), 4-bit integers (**INT4**), or even custom formats like **FP4**.

The Intuition Behind Quantization

Think of quantization like reducing the color depth of a digital photo. If you convert a photo from 24-bit true color to an 8-bit color palette, the file size shrinks by 66%. The image looks slightly less smooth, but the shapes, objects, and overall context are still perfectly recognizable.

Similarly, when we quantize a model from 16-bit to 4-bit, we decrease its size by 75%. A 70B model that once required 140GB of VRAM can now fit into roughly 35GB of VRAM. Remarkably, due to the high mathematical redundancy in neural networks, this massive compression results in only a tiny degradation in reasoning capability.

Modern Quantization Formats

Several standard file formats are used to run these compressed models: