Chapter 1 • 8 min read • Last reviewed: June 2026

The Transformer Core

Before 2017, natural language processing (NLP) models read text like humans do: one word at a time, from left to right. These models, known as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, kept a running "mental state" that updated with each new word. While intuitive, this approach had a catastrophic bottleneck: it could not be parallelized. Because you needed the state of word n-1 to compute the state of word n, GPUs could not process entire blocks of text simultaneously.

Then came the landmark paper, "Attention Is All You Need", which introduced the Transformer architecture. The Transformer discarded recurrence entirely, opting to process all words in a sequence at the exact same time. To do this, it relied on a mathematical shortcut called Self-Attention.

The Core Breakthrough: Self-Attention

Self-attention allows a model to look at a specific word and determine which other words in the sentence are most relevant to it, regardless of how far apart they are. Consider this example:

"The bank robber ran to the river bank because he saw the police."

How does the model know that the first "bank" refers to a financial institution, while the second "bank" refers to the edge of a river? In an RNN, by the time the model reached the end of the sentence, the context of "bank robber" might have faded. In a Transformer, the first "bank" is compared to all other words in the sentence concurrently, finding a strong mathematical connection to "robber." The second "bank" finds a connection to "river." The model dynamically contextualizes each word based on its surroundings.

Key Concept: The Query-Key-Value (QKV) Analogy

To compute attention, the Transformer assigns three vectors to every single word:

Query (Q): What the word is looking for (e.g., "I am a pronoun, where is my noun?").
Key (K): What the word represents or offers (e.g., "I am a noun, I describe a person").
Value (V): The actual content of the word (the semantic meaning).

The model multiplies the Query vector of a word by the Key vectors of all other words. The higher the score, the more attention that word gets. The final representation is a weighted sum of the Value vectors based on these attention scores.

Multi-Head Attention

Instead of doing this attention calculation once, the Transformer does it multiple times in parallel. Each calculation is called an Attention Head. This is known as Multi-Head Attention.

By using multiple heads, the model can look at different aspects of the text at the same time. For example:

Head 1 might focus on grammatical relationships (finding the verb for each noun).
Head 2 might focus on coreference resolution (matching "he" or "it" to the correct entity).
Head 3 might focus on physical proximity (local descriptors like adjectives).

Combined, these heads build a highly dimensional and accurate understanding of language.

Positional Encoding

Since a Transformer processes all words simultaneously, it has no natural understanding of order. To a pure attention mechanism, "The dog bit the man" and "The man bit the dog" look identical because the words are the same.

To fix this, the Transformer uses Positional Encodings—a set of mathematical values added to each word's embedding that act as a coordinate. These coordinates tell the model exactly where each word sits in the sentence, allowing it to preserve the structural grammar of the text.

Encoder vs. Decoder Architectures

The original Transformer consisted of two halves: an Encoder (which reads and understands text) and a Decoder (which writes new text). Depending on the task, modern advancements have split these into three variants:

Encoder-Only Models (e.g., BERT): Excellent for understanding, classifying, and extracting information from text. They look in both directions (left and right) simultaneously.
Decoder-Only Models (e.g., GPT, LLaMA): Excellent for generating text. They are autoregressive, meaning they generate one word at a time, looking only at past words (left-to-right masking) to predict the next word.
Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization, where an input sequence is processed entirely, and a brand new output sequence is generated.

Many frontier text-first Large Language Models (LLMs) use decoder-only architectures, optimized for generating text by predicting the next token with massive efficiency.

Why This Still Matters in Products

Attention is the reason a model can connect a bug report to a stack trace, a legal question to a clause thirty pages earlier, or a chart caption to the data it describes. It is also the reason long prompts get expensive: every token competes for attention with many other tokens.

When a system fails, the root cause is often not "the model is dumb." It may be that the relevant tokens were missing, buried under noisy context, truncated by the context window, or split across modalities the model cannot read together. Good AI products treat context as a scarce design surface: include what matters, remove what does not, and make the important evidence easy for attention to find.

Sources

Attention Is All You Need — Vaswani et al.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.

Chapter 2 • 9 min read • Last reviewed: June 2026

LLM Training & Alignment

Creating a modern AI assistant like ChatGPT or Gemini is not a single-step process. It requires taking raw, chaotic web data and refining it through multiple training stages. The journey from raw math to a helpful assistant is divided into three major milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.

Phase 1: Pre-training (Creating the "Base Model")

The foundation of any LLM is the pre-trained base model. During this stage, the model is fed petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.

For example, given the text:

"The cat sat on the..."

The model calculates probability distributions over its entire vocabulary and predicts "mat" (or "sofa", "bed", etc.). By repeating this trillions of times across vast supercomputer clusters, the model builds a rich internal map of language, grammar, reasoning patterns, and encyclopedic facts. However, a base model is not an assistant; it is a text completer. If you ask a base model "Write a recipe for chocolate cake," it might reply with a second question: "And write a recipe for apple pie," because it is mimicking lists of recipes found on the internet.

Phase 2: Supervised Fine-Tuning (Creating the "Instruct Model")

To turn a text completer into an interactive assistant, engineers perform Supervised Fine-Tuning (SFT). In this phase, the base model is trained on a curated dataset of high-quality conversational prompts and responses, written by human experts.

A typical training sample looks like:

Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.

By training on tens of thousands of these conversational examples, the model learns the "instruct" behavior: it recognizes when it is being asked a question and understands that it must respond with helpful answers, adopting a conversational and polite tone.

Phase 3: Alignment (RLHF and DPO)

Even after SFT, a model can still produce toxic, biased, incorrect, or unhelpful output. SFT only teaches the model to imitate the training dialogues. To ensure the model is helpful, honest, and harmless, engineers "align" it with human preferences using two primary techniques:

1. Reinforcement Learning from Human Feedback (RLHF)

RLHF works by using a grading system. The process involves three steps:

Generate Options: The model generates multiple candidate answers to a prompt.
Train a Reward Model: Human evaluators rate these candidate answers from best to worst. A separate neural network—the Reward Model—is trained to predict what score a human would give to any given response.
Reinforce: Using an RL algorithm (typically PPO), the LLM's parameters are updated to maximize the score predicted by the Reward Model. Responses that humans like are rewarded, and disliked responses are penalized.

2. Direct Preference Optimization (DPO)

While RLHF is highly effective, it is notoriously unstable, expensive, and complex to train because it requires maintaining multiple models simultaneously (the LLM, the Reward Model, and reference models).

In 2023, researchers introduced Direct Preference Optimization (DPO). DPO bypasses the reward model entirely. It mathematically proves that you can optimize the LLM policy directly using a dataset of paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so that the probability of the chosen response increases relative to the rejected response, creating a much simpler, faster, and more stable alignment loop.

Key Concept: Kaplan vs. Chinchilla Scaling Laws

How do we make models smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which suggested that parameter size was the single most important factor—urging engineers to build larger models, even if they couldn't afford to train them on more data.

In 2022, DeepMind published the Chinchilla scaling laws. They proved that for optimal performance, parameter count and training data (tokens) should scale in equal proportion. Most models were actually under-trained on too little data. This shifted the industry toward training smaller, highly efficient models (like LLaMA or Mistral) for much longer on high-quality tokens, making them far cheaper to run on standard hardware.

Choosing the Right Adaptation Method

Most product teams do not train foundation models from scratch. They choose among smaller adaptation levers:

Prompting: Best for behavior that can be described in instructions and examples.
RAG: Best when the answer depends on changing, private, or auditable knowledge.
Fine-tuning: Best when the model needs a consistent style, format, domain vocabulary, or task habit that is hard to teach in every prompt.
Preference tuning: Best when several answers are plausible but the product has a clear preference for one kind of response.
Guardrails and evals: Necessary when mistakes are expensive, no matter which training method is used.

A useful rule: do not fine-tune just to add facts. Facts change and should usually live in retrieval, tools, or databases. Fine-tune when you want the model to behave differently even when the same facts are already present.

Sources

Scaling Laws for Neural Language Models — Kaplan et al.
Training Compute-Optimal Large Language Models — Hoffmann et al.
Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.
Direct Preference Optimization — Rafailov et al.

Chapter 3 • 9 min read • Last reviewed: June 2026

RAG & Context Windows

An AI model has two kinds of memory. The first is Parametric Memory—information baked directly into the model's weights during training. The second is Working Memory—the space available in the immediate input prompt, known as the Context Window. To build reliable systems that do not hallucinate, engineers use these two memories in tandem through RAG and ultra-long context architectures.

The Limitation of Parametric Memory

Relying purely on what a model has memorized has three massive drawbacks:

Knowledge Cutoff: The model only knows what existed before its training run finished.
Hallucinations: When asked about obscure facts, models often confidently guess, creating plausible-sounding falsehoods.
No Access to Private Data: Models cannot read your local PDFs, company emails, or secure databases.

Retrieval-Augmented Generation (RAG)

RAG solves this by turning the model into an open-book test taker. Instead of answering from memory, the system searches an external database for the answer, pastes the relevant documents directly into the context window, and asks the model to read them to generate the answer.

How the RAG Pipeline Works

Chunking: Large documents (like a 100-page manual) are broken down into small, digestible paragraphs (chunks).
Embeddings: An Embedding Model converts each text chunk into a string of numbers (a vector) representing its semantic meaning.
Vector Database: These vectors are stored in a specialized database (like Pinecone, Chroma, or pgvector).
Retrieval (Semantic Search): When a user asks a question, the system converts their question into a vector and finds the text chunks in the database that are mathematically closest to the question's meaning.
Augmentation & Generation: The system fetches those text chunks, inserts them into a prompt alongside the user's question, and sends it to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."

The Evolution of Context Windows

If RAG is so powerful, why not just feed the entire database directly into the model? Historically, this was impossible because of the way attention works.

The memory and computation cost of standard Self-Attention scales quadratically ($O(N^2)$) with the length of the input. If you double the length of your input, it takes four times more compute and memory to process. Early models were capped at a context window of just 2,048 tokens (roughly 1,500 words).

Recent architectural and serving breakthroughs have broken this barrier. Frontier systems now commonly offer hundreds of thousands to millions of tokens of working memory, making it possible to analyze long PDFs, codebases, transcripts, and multi-file projects in a single request. The main pillars of this scaling are:

1. FlashAttention

Introduced by Tri Dao, FlashAttention is a software-level optimization. Rather than changing the math of attention, it changes how the GPU handles memory. Standard attention writes massive intermediate tables back and forth between slow GPU High Bandwidth Memory (HBM) and fast on-chip SRAM. FlashAttention computes attention in small blocks, keeping data in the fast SRAM cache as much as possible. This reduces memory traffic by up to 20x, allowing context windows to scale dramatically without running out of GPU memory.

2. Rotary Position Embeddings (RoPE)

Older absolute positional systems could not handle context lengths longer than what they were trained on. RoPE represents positions by rotating the word vectors in a multi-dimensional mathematical space. Because rotation is relative, the model can understand the distance between words even if the total text length is far longer than the training parameters, allowing context window sizes to be scaled up post-training with minimal fine-tuning.

The "Needle in a Haystack" Test

Just because a model can accept a million tokens doesn't mean it is actually reading them. To evaluate long-context retrieval, researchers use the Needle in a Haystack (NIAH) test.

A random, unrelated fact (the "needle") is hidden somewhere inside a massive text dump of documents (the "haystack"). The model is then asked a question that can only be answered using that specific fact. Modern models must achieve near 100% accuracy, finding the needle regardless of whether it is hidden at the beginning, middle, or end of the document stack.

However, long context is not a free replacement for retrieval. Million-token prompts can still be slower, more expensive, and harder to audit than a well-built RAG pipeline. In production systems, engineers often combine both: use retrieval to select the most relevant evidence, then use a long-context model when the task requires cross-document synthesis, codebase-wide reasoning, or comparison across many artifacts.

RAG vs. Long Context: How to Choose

Use RAG when the corpus is large, frequently changing, permissioned, or needs precise citations. Retrieval keeps prompts smaller, makes source selection auditable, and lets the application enforce access control before the model sees anything.

Use long context when the task requires comparing many pieces at once: reviewing a pull request across files, reconciling a contract with its exhibits, summarizing a full transcript, or finding contradictions across a small document set.

The most reliable pattern is often hybrid: retrieve the best candidates first, rerank them, then give a long-context model enough surrounding material to synthesize rather than quote isolated snippets. The eval should check both steps: did retrieval find the right evidence, and did generation stay faithful to it?

Sources

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.
FlashAttention — Dao et al.
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al.
Gemini 1.5 and long-context model capabilities — Google

Chapter 4 • 9 min read • Last reviewed: June 2026

Scaling Efficiency: MoE & Quantization

As AI models grow larger, running them becomes incredibly expensive. A dense 175-billion-parameter model requires multiple high-end enterprise GPUs running concurrently just to output a single word. To make these models practical for commercial use and deployable on smaller hardware, engineers rely on two massive efficiency breakthroughs: Mixture of Experts (MoE) and Quantization.

Mixture of Experts (MoE)

In a standard "Dense" model, every single parameter (the neural connections) is activated for every single word processed. This is highly inefficient; a model doesn't need to invoke its entire mathematical knowledge base to process a simple punctuation mark or pronoun.

An MoE architecture turns a dense model into a "Sparse" model by breaking it up into specialized compartments called Experts (typically inside the Feed-Forward Network layers). Instead of passing a word through all pathways, a dynamic Gating Network (Router) decides which experts should handle which word.

Sparse Routing in Action

Imagine a model with 8 distinct "Experts." When a token is processed:

If the token is a line of Python code, the Router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
If the token is a word in French, the Router sends it to Expert 1 (Translation specialist).

Typically, the Router selects only the Top-2 Experts for each token. If a model has a total of 8x 7B experts (56B total parameters), it only activates roughly 12B parameters per token. This gives the model the vast knowledge capacity of a 56B model, but with the fast generation speed and compute cost of a much smaller 12B model.

The Challenges of MoE

MoE is not a free lunch. It introduces several hard engineering hurdles:

RAM Overhead: Although only 12B parameters are active at any millisecond, the entire 56B parameter model must still be loaded into the GPU's memory (VRAM). This means MoE requires significantly more memory than dense models of equivalent speed.
Routing Collapse: During early training, the router might favor one expert, making it smarter, which causes the router to send even more traffic to it. Engineers must write custom algorithms to force load-balancing so all experts are trained evenly.

Quantization

Neural networks represent their learned weights as high-precision decimals called floating-point numbers. During training, these are typically represented in 16-bit precision (FP16 or BF16).

Storing weights in 16-bit precision means every single parameter requires 2 bytes of GPU memory. A 70-billion-parameter model requires at least 140 gigabytes of VRAM just to load, which exceeds the capacity of almost all consumer GPUs.

Quantization is the process of compressing these weights by reducing their numerical precision—mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or even custom formats like FP4.

The Intuition Behind Quantization

Think of quantization like reducing the color depth of a digital photo. If you convert a photo from 24-bit true color to an 8-bit color palette, the file size shrinks by 66%. The image looks slightly less smooth, but the shapes, objects, and overall context are still perfectly recognizable.

Similarly, when we quantize a model from 16-bit to 4-bit, we decrease its size by 75%. A 70B model that once required 140GB of VRAM can now fit into roughly 35GB of VRAM. Remarkably, due to the high mathematical redundancy in neural networks, this massive compression results in only a tiny degradation in reasoning capability.

Modern Quantization Formats

Several standard file formats are used to run these compressed models:

GGUF (formerly GGML): Optimized specifically for CPU execution, allowing users to run large models on consumer laptops (like Apple Silicon Macbooks) by leveraging system RAM instead of expensive GPU VRAM.
GPTQ / AWQ: Formats optimized for GPU-accelerated quantized inference, ensuring that compressed models generate text at blisteringly fast speeds on standard desktop graphic cards.

Serving Efficiency in Real Systems

MoE and quantization are only part of the deployment story. Production inference stacks also rely on KV-cache reuse, batching, speculative decoding, model distillation, and careful routing between small and large models. A customer-support bot might use a small fast model for classification, a retrieval model for evidence, and a larger reasoning model only when the case is complex.

The practical question is not "what is the biggest model we can run?" It is "what is the cheapest system that meets the quality, latency, privacy, and reliability target?" Efficient AI products usually mix model sizes, precision levels, retrieval, caching, and fallback paths instead of sending every request to the same expensive endpoint.

Sources

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al.
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al.
GGUF format documentation — ggml

Chapter 5 • 8 min read • Last reviewed: June 2026

Diffusion & Generative Media

Generative AI for images and videos has undergone a massive transformation. Early image generators, called GANs (Generative Adversarial Networks), were notoriously difficult to train, often failing to produce coherent pictures. Today, almost all modern image and video generators (Stable Diffusion, Midjourney, Sora, Flux) rely on a mathematical concept called Diffusion.

The Diffusion Paradigm

Instead of trying to draw an image from scratch, a diffusion model is trained to do one thing: remove static noise. The process is split into two phases: the forward process and the reverse process.

1. The Forward Process (Destroying Information)

We take a clean photograph (say, of a golden retriever) and add a tiny layer of random mathematical noise. We repeat this step-by-step, perhaps 1,000 times, until the original dog is completely obliterated, leaving nothing but a block of pure gray static. This process requires no neural network; it is pure math.

2. The Reverse Process (Creating Information)

This is where the neural network lives. We show the model a noisy image and ask it: "Can you predict exactly how much noise was added in this step?"

By training the model on millions of pairs of clean and noisy images, it learns to recognize subtle structures within noise. When we want to generate a new image, we feed the model a block of pure, random noise and a text prompt (e.g., "A golden retriever playing in the grass"). The model subtracts a sliver of estimated noise. We repeat this subtraction loop 20 to 50 times. Bit by bit, structures appear, and a completely unique, high-resolution image emerges.

Key Concept: Latent Diffusion

Early diffusion models operated in "Pixel Space." Generating a 1024x1024 pixel image meant calculating noise values for over a million pixels at every step. This made early models incredibly slow and memory-intensive.

The breakthrough was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into a highly dense representation called "latent space" (shrinking a 512x512 image down to a 64x64 grid). The diffusion model does all its heavy lifting in this low-resolution space, and the VAE decodes the final latents back into pixels at the very end. This saved 90%+ of the compute, making image generation run on consumer laptops.

Classifier-Free Guidance (CFG)

How does the model make sure the image it generates actually matches your prompt, instead of wandering off on its own? This is controlled by Classifier-Free Guidance (CFG).

During training, the model is occasionally trained without text prompts (unconditioned). During generation, the model predicts two things: what the noise removal should look like with the prompt, and what it should look like without it. The CFG scale decides how much weight to give to the difference.

Low CFG (1 to 3): Gives the model creative freedom. The image will be artistic but might ignore parts of your prompt.
Medium CFG (7 to 9): The sweet spot for high-quality, prompt-adhering images.
High CFG (15+): Forces strict prompt adherence, though it can make the image look oversaturated and digitally artificial.

The Shift to Diffusion Transformers (DiT)

Traditional diffusion models used a convolutional network backbone called a U-Net to predict noise. However, U-Nets struggled to scale efficiently with massive datasets and compute budgets.

In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a standard Transformer backbone. By dividing the latent image into patches (similar to how an LLM divides text into tokens), DiT models can scale predictably: adding more parameters and compute directly correlates with better image and video fidelity. This architecture underpins the latest state-of-the-art models like OpenAI's Sora, Stable Diffusion 3, and Flux.

What Matters in Media Products

Real generative-media tools are rarely one prompt and one output. They combine text prompts with reference images, masks, control signals, style constraints, safety filters, and editing loops. The user may generate a rough image, inpaint one region, extend the canvas, upscale the result, then use a separate model to caption or moderate it.

The same idea extends to video and design workflows: the valuable product feature is often control, not raw generation. Teams need predictable character identity, readable text, brand-safe style, provenance metadata, and review tools for rights, likeness, and safety. Diffusion explains the engine; product constraints decide whether the output is usable.

Sources

Denoising Diffusion Probabilistic Models — Ho et al.
High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.
Classifier-Free Diffusion Guidance — Ho and Salimans
Scalable Diffusion Models with Transformers — Peebles and Xie

Chapter 6 • 9 min read • Last reviewed: June 2026

Agentic AI & Reasoning

For the first few years of the LLM boom, AI models were treated as passive chatbots: you write a prompt, and the model instantly outputs a response. Today, the frontier has shifted toward Agentic AI. Instead of answering statically, agentic systems act as autonomous software entities that can plan, use external tools, inspect their own output, and run in loops to solve multi-step problems.

Tool Use & Function Calling

LLMs are notoriously bad at precise math (like multiplying two 8-digit numbers) and cannot fetch live data or interact with the physical world because they are just word-prediction engines.

Tool Use (or Function Calling) overcomes this limitation by letting the host application expose specific capabilities. The model is provided with a list of available tools, described in plain text. For example:

Available Tool: calculate_weather(location, date)
- Returns the temperature forecast for a location.

If the user asks: "Should I wear a coat in Chicago tomorrow?", the LLM recognizes it cannot answer from memory. Instead of guessing, it outputs a structured instruction:

{
  "call": "calculate_weather",
  "arguments": { "location": "Chicago", "date": "tomorrow" }
}

The host application intercepts this JSON, runs the actual weather API, receives the result (e.g., "Chicago: 41°F, Rain"), and appends it to the model's chat history. The LLM reads the result and finishes its response: "Yes, you should wear a coat. Chicago will be 41°F and raining tomorrow."

Reasoning Loops: ReAct and Reflection

To solve complex tasks, agents use structured loops rather than generating answers in a single pass.

1. The ReAct (Reason + Act) Loop

ReAct forces the model to document its thinking before taking actions. The loop proceeds as follows:

Thought: The model explains its plan (e.g., "I need to find the population of France, then multiply it by 0.12").
Action: The model calls a search engine or calculator tool.
Observation: The model reads the tool's output and updates its plan, looping back to Thought until the task is complete.

2. Reflection and Self-Correction

If a model writes a block of code, it may contain a bug. A reflection agent doesn't send the code to the user immediately. Instead, it runs the code in an isolated environment, catches any error logs, feeds those errors back to itself, and rewrites the code to fix the bug. This cyclic feedback loop dramatically boosts task success rates.

What Makes an Agent Safe Enough to Ship

An agent is more than a prompt with tools. The product around it needs boundaries:

Scoped tools: Each tool should do one clear thing with the least permission needed.
Typed arguments: The host application validates tool inputs before execution.
Approval gates: Irreversible actions such as sending emails, charging cards, deleting data, or changing permissions should require confirmation.
State and memory rules: The system should decide what is saved, what expires, and what the model may read later.
Trace logs: Operators need to see prompts, tool calls, observations, errors, and final answers when debugging failures.

The model can decide which tool to request, but the application must decide whether that request is allowed. Instructions are not access control.

System 1 vs. System 2 Thinking in AI

Cognitive psychologist Daniel Kahneman famously split human thinking into two modes:

System 1 (Fast): Fast, intuitive, automatic actions (e.g., answering "2+2=?", reading a familiar road sign).
System 2 (Slow): Slow, deliberate, logical reasoning (e.g., solving "17 × 24", filling out a tax form).

Standard LLMs operate mostly like System 1. They output the next token immediately without much opportunity to plan, test, or revise. If they start a sentence poorly, they cannot truly rewind the generation path.

Modern System 2 Reasoning Models spend extra inference-time compute before producing a final answer. Current reasoning systems point toward the same design shift: models are being trained and served to plan, use tools, check intermediate work, and keep going across longer workflows. Some expose controllable reasoning effort; others keep the reasoning internal while returning a concise answer.

Sources

ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al.
Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al.
Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al.
Learning to Reason with LLMs — OpenAI
DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning — Nature

Chapter 7 • 9 min read • Last reviewed: June 2026

Future Frontiers & Physical AI

We are entering a new era of artificial intelligence. The frontier is no longer just about making models bigger. Researchers are expanding models into the physical world, training them to process multiple senses natively, and shifting from one-shot answers toward agents that can take action over time.

Native Multimodality

Early multimodal systems were "Stitched" together. For example, to let an AI "see" an image, engineers would run an image-captioning model to generate a text description, and then feed that text to the LLM. This was incredibly lossy; a text caption cannot capture the precise spatial layout of a room, the emotional expression on a face, or the specific pitch of a sound.

Modern state-of-the-art models are increasingly Natively Multimodal. They are built with a unified architecture or tightly integrated model system. Text, pixels, audio waveforms, video frames, and tool outputs are converted into a shared mathematical language (embeddings) and routed through models that can reason across them.

This allows the model to reason across modalities simultaneously. A native multimodal model can watch a video, listen to the speaker's sarcasm, read the slides in the background, and output a unified analysis in real time, catching nuances that stitched systems miss entirely.

The "Data Wall" and Synthetic Data

For a decade, AI progress was fueled by feeding models more data. However, the industry is hitting a Data Wall: LLMs have already consumed almost all high-quality, publicly available human-written text on the internet.

To continue training, researchers are turning to Synthetic Data—data generated by AI models to train other AI models.

The Promise and Danger of Synthetic Data

If models train on unverified synthetic data, they risk Model Collapse—a phenomenon where errors, biases, and weird linguistic quirks compound over generations, causing the model to become increasingly stupid and disconnected from reality.

To prevent this, engineers use Verified Synthesis: using external environments to validate the AI's data. For example:

An AI generates code, which is then run in a compiler to verify it works. Only working code is used for training.
An AI solves a math problem. The solution is validated using formal math verifiers.
An AI reasons about physical properties. The scenario is run through a physics engine to make sure it follows real-world laws.

Robotics and Physical Grounding

For AI to truly understand the world, it must interact with it. By combining multimodal LLMs with robotic control systems, researchers have developed Vision-Language-Action (VLA) models such as Google's RT-2 and Gemini Robotics.

A VLA model doesn't just output text; it outputs physical actions for a robot's joints and grippers. When you tell a VLA-enabled robot arm: "Pick up the yellow banana and put it in the basket," the model processes the camera feed (pixels), matches the words to the objects, calculates the spatial path, and controls the robot's motors directly. The LLM acts as the robot's planning layer, giving it common-sense reasoning and adaptability to new environments without custom programming for every object.

The Next Paradigm: Test-Time Compute

Pre-training scaling laws (adding more parameters and GPUs during training) are no longer the only axis of progress. The newer vector is Test-Time Compute (scaling at inference time).

Instead of forcing a model to answer within a fraction of a second, test-time compute lets the model spend extra compute planning, checking, searching, or coordinating tools. This is why frontier model releases increasingly emphasize agentic coding, computer use, document work, and scientific workflows rather than only chat benchmark scores. The practical question is becoming: how much thought should the system buy for this task?

What Is Relevant Now

The frontier is becoming less about one chatbot box and more about systems that coordinate perception, memory, tools, and verification. The most relevant product questions are practical:

Can the model read the actual modality the user cares about, or is information lost in conversion?
Can synthetic data be checked by compilers, tests, simulators, formal verifiers, humans, or trusted datasets?
Can a robot or agent fail safely when perception is uncertain?
Is extra test-time compute buying real accuracy, or just slower answers?
Can the team observe and evaluate the full workflow rather than only the final response?

These questions make the chapter more concrete: multimodality, robotics, and reasoning are not separate trends. They are ways of giving AI systems better inputs, better actions, and better checks.

Sources

RT-2: New model translates vision and language into action — Google DeepMind
Gemini Robotics: Bringing AI into the Physical World — Google DeepMind
AI models collapse when trained on recursively generated data — Nature
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
Learning to Reason with LLMs — OpenAI

Chapter 8 • 10 min read • Last reviewed: June 2026

Evaluation, Safety & Production AI

Building an AI demo is mostly about capability: can the model answer, retrieve, reason, or act? Shipping an AI product is about repeatability: can the system keep doing the right thing when prompts change, documents drift, users attack it, models upgrade, and costs spike?

Production AI teams treat the model as one component in a larger system. The surrounding product needs evals, guardrails, observability, rollback plans, and human review for cases where model confidence is not enough.

Eval-Driven Development

An eval is a repeatable test for behavior that matters. Instead of asking "does the answer look good?", an eval asks a narrower question: did the assistant cite the right policy section, refuse the unsafe request, preserve the JSON schema, choose the right tool, or solve the task within the latency budget?

Useful eval suites mix several test types:

Golden examples: known prompts with expected answers, labels, or rubrics.
Regression cases: failures from production that must not come back after a prompt, retrieval, or model change.
Adversarial cases: inputs designed to trigger jailbreaks, prompt injection, data leakage, or unsafe tool calls.
Performance cases: examples that measure cost, latency, refusal rate, and answer length, not just correctness.

The important habit is to run evals before changing models, prompts, retrieval settings, or tool permissions. In AI systems, a "minor" prompt edit can behave like a code change across thousands of hidden branches.

The Production Eval Loop

The loop is simple: collect examples, run evals, block risky releases, monitor live behavior, review failures, and add those failures back to the suite.

Groundedness and Source Verification

For RAG systems, the most common failure is not a totally random answer. It is an answer that sounds plausible but is only partly supported by the retrieved evidence. A groundedness check compares each important claim against the source passages the system provided.

Good groundedness evaluation asks:

Does every factual claim have supporting evidence in the retrieved context?
Did the answer cite the specific source that supports the claim?
Did the model ignore conflicting evidence or overstate uncertainty?
Should the system answer, ask a clarifying question, retrieve again, or refuse?

This is why citations are not just decoration. A citation should be a checkable pointer to the evidence that justifies the answer. If the pointer is wrong, the system is teaching users to trust the wrong thing.

Prompt Injection and Tool Safety

Prompt injection happens when untrusted text tries to override the system's instructions. In a RAG app, the attack might live inside a PDF. In an agent, it might appear on a web page the agent browses. The dangerous pattern is the same: the model reads attacker-controlled text and treats it like an instruction from the product owner.

Tool use makes this risk sharper. A model that can only write text can mislead a user; a model with tools can email customers, change records, run code, or expose private data. Production systems reduce that risk with least-privilege tool scopes, allowlists, confirmation steps, output validation, and audit logs.

A strong rule of thumb: model instructions are not access control. The host application must enforce permissions outside the model.

Observability for AI Apps

Traditional logs often show an HTTP request, a status code, and a response time. AI observability needs more: prompt versions, retrieved chunks, model names, tool calls, token usage, evaluator scores, refusals, user feedback, and traces across the full agent loop.

Without traces, teams cannot answer basic production questions: Did retrieval fail? Did the model ignore good evidence? Did a tool return bad data? Did a prompt change increase cost? Did a new model improve benchmark scores but hurt real support tickets?

Human Review and Launch Gates

Human review is not a failure of automation; it is a control surface. High-impact workflows often need human approval for irreversible actions, sensitive domains, edge cases, and low-confidence answers. The product should make review efficient by showing the prompt, evidence, model answer, tool actions, and eval signals in one place.

Before launch, teams usually define gates: minimum eval scores, maximum hallucination rate, maximum latency, acceptable cost per task, security test coverage, and rollback criteria. After launch, sampled production traces become new tests so the system gets harder to break over time.

A Practical Launch Checklist

Before putting an AI feature in front of users, a team should be able to answer these questions:

What are the top user tasks, and which eval cases represent them?
What failures are unacceptable, and how are they detected before release?
Which sources, tools, and permissions can the model access?
What does the system do when retrieval is empty, sources conflict, tools fail, or confidence is low?
Who reviews risky outputs, and what information do they see?
How quickly can the team roll back a prompt, model, retrieval index, or tool permission change?

This checklist matters because AI quality is distributed across the full stack. The model, prompt, retrieval index, tools, UI, logs, evals, and review process all decide whether the product is trustworthy.

Sources

Working with evals — OpenAI API docs
OpenAI Evals — OpenAI
AI Risk Management Framework — NIST
AI RMF Generative AI Profile — NIST
OWASP Top 10 for LLM Applications — OWASP Foundation
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.
RAGTruth: A Hallucination Corpus for Retrieval-Augmented Language Models — Niu et al.

AI 101 Guide

The Transformer Core

The Core Breakthrough: Self-Attention

Key Concept: The Query-Key-Value (QKV) Analogy

Multi-Head Attention

Positional Encoding

Encoder vs. Decoder Architectures

Why This Still Matters in Products

Sources

LLM Training & Alignment

Phase 1: Pre-training (Creating the "Base Model")

Phase 2: Supervised Fine-Tuning (Creating the "Instruct Model")

Phase 3: Alignment (RLHF and DPO)

1. Reinforcement Learning from Human Feedback (RLHF)

2. Direct Preference Optimization (DPO)

Key Concept: Kaplan vs. Chinchilla Scaling Laws

Choosing the Right Adaptation Method

Sources

RAG & Context Windows

The Limitation of Parametric Memory

Retrieval-Augmented Generation (RAG)

How the RAG Pipeline Works

The Evolution of Context Windows

1. FlashAttention

2. Rotary Position Embeddings (RoPE)

The "Needle in a Haystack" Test

RAG vs. Long Context: How to Choose

Sources

Scaling Efficiency: MoE & Quantization

Mixture of Experts (MoE)

Sparse Routing in Action

The Challenges of MoE

Quantization

The Intuition Behind Quantization

Modern Quantization Formats

Serving Efficiency in Real Systems

Sources

Diffusion & Generative Media

The Diffusion Paradigm

1. The Forward Process (Destroying Information)

2. The Reverse Process (Creating Information)

Key Concept: Latent Diffusion

Classifier-Free Guidance (CFG)

The Shift to Diffusion Transformers (DiT)

What Matters in Media Products

Sources

Agentic AI & Reasoning

Tool Use & Function Calling

Reasoning Loops: ReAct and Reflection

1. The ReAct (Reason + Act) Loop

2. Reflection and Self-Correction

What Makes an Agent Safe Enough to Ship

System 1 vs. System 2 Thinking in AI

Sources

Future Frontiers & Physical AI

Native Multimodality

The "Data Wall" and Synthetic Data

The Promise and Danger of Synthetic Data

Robotics and Physical Grounding

The Next Paradigm: Test-Time Compute

What Is Relevant Now

Sources

Evaluation, Safety & Production AI

Eval-Driven Development

The Production Eval Loop

Groundedness and Source Verification

Prompt Injection and Tool Safety

Observability for AI Apps

Human Review and Launch Gates

A Practical Launch Checklist

Sources