Printable edition
AI 101 Guide
A complete single-page edition assembled from the canonical chapters. Use browser print, save as PDF, or read it as one continuous Kindle-friendly document.
The Transformer Core
Before 2017, natural language processing (NLP) models read text like humans do: one word at a time, from left to right. These models, known as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, kept a running "mental state" that updated with each new word. While intuitive, this approach had a catastrophic bottleneck: it could not be parallelized. Because you needed the state of word n-1 to compute the state of word n, GPUs could not process entire blocks of text simultaneously.
Then came the landmark paper, "Attention Is All You Need", which introduced the Transformer architecture. The Transformer discarded recurrence entirely, opting to process all words in a sequence at the exact same time. To do this, it relied on a mathematical shortcut called Self-Attention.
The Core Breakthrough: Self-Attention
Self-attention allows a model to look at a specific word and determine which other words in the sentence are most relevant to it, regardless of how far apart they are. Consider this example:
"The bank robber ran to the river bank because he saw the police."
How does the model know that the first "bank" refers to a financial institution, while the second "bank" refers to the edge of a river? In an RNN, by the time the model reached the end of the sentence, the context of "bank robber" might have faded. In a Transformer, the first "bank" is compared to all other words in the sentence concurrently, finding a strong mathematical connection to "robber." The second "bank" finds a connection to "river." The model dynamically contextualizes each word based on its surroundings.
Key Concept: The Query-Key-Value (QKV) Analogy
To compute attention, the Transformer assigns three vectors to every single word:
- Query (Q): What the word is looking for (e.g., "I am a pronoun, where is my noun?").
- Key (K): What the word represents or offers (e.g., "I am a noun, I describe a person").
- Value (V): The actual content of the word (the semantic meaning).
The model multiplies the Query vector of a word by the Key vectors of all other words. The higher the score, the more attention that word gets. The final representation is a weighted sum of the Value vectors based on these attention scores.
Multi-Head Attention
Instead of doing this attention calculation once, the Transformer does it multiple times in parallel. Each calculation is called an Attention Head. This is known as Multi-Head Attention.
By using multiple heads, the model can look at different aspects of the text at the same time. For example:
- Head 1 might focus on grammatical relationships (finding the verb for each noun).
- Head 2 might focus on coreference resolution (matching "he" or "it" to the correct entity).
- Head 3 might focus on physical proximity (local descriptors like adjectives).
Combined, these heads build a highly dimensional and accurate understanding of language.
Positional Encoding
Since a Transformer processes all words simultaneously, it has no natural understanding of order. To a pure attention mechanism, "The dog bit the man" and "The man bit the dog" look identical because the words are the same.
To fix this, the Transformer uses Positional Encodings—a set of mathematical values added to each word's embedding that act as a coordinate. These coordinates tell the model exactly where each word sits in the sentence, allowing it to preserve the structural grammar of the text.
Encoder vs. Decoder Architectures
The original Transformer consisted of two halves: an Encoder (which reads and understands text) and a Decoder (which writes new text). Depending on the task, modern advancements have split these into three variants:
- Encoder-Only Models (e.g., BERT): Excellent for understanding, classifying, and extracting information from text. They look in both directions (left and right) simultaneously.
- Decoder-Only Models (e.g., GPT, LLaMA): Excellent for generating text. They are autoregressive, meaning they generate one word at a time, looking only at past words (left-to-right masking) to predict the next word.
- Encoder-Decoder Models (e.g., T5, BART): Often used for translation or summarization, where an input sequence is processed entirely, and a brand new output sequence is generated.
Many frontier text-first Large Language Models (LLMs) use decoder-only architectures, optimized for generating text by predicting the next token with massive efficiency.
Sources
- Attention Is All You Need — Vaswani et al.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al.
LLM Training & Alignment
Creating a modern AI assistant like ChatGPT or Gemini is not a single-step process. It requires taking raw, chaotic web data and refining it through multiple training stages. The journey from raw math to a helpful assistant is divided into three major milestones: Pre-training, Supervised Fine-Tuning (SFT), and Alignment.
Phase 1: Pre-training (Creating the "Base Model")
The foundation of any LLM is the pre-trained base model. During this stage, the model is fed petabytes of raw text from books, articles, code repositories, and web pages. The training objective is simple: predict the next word (token) in a sentence.
For example, given the text:
"The cat sat on the..."
The model calculates probability distributions over its entire vocabulary and predicts "mat" (or "sofa", "bed", etc.). By repeating this trillions of times across vast supercomputer clusters, the model builds a rich internal map of language, grammar, reasoning patterns, and encyclopedic facts. However, a base model is not an assistant; it is a text completer. If you ask a base model "Write a recipe for chocolate cake," it might reply with a second question: "And write a recipe for apple pie," because it is mimicking lists of recipes found on the internet.
Phase 2: Supervised Fine-Tuning (Creating the "Instruct Model")
To turn a text completer into an interactive assistant, engineers perform Supervised Fine-Tuning (SFT). In this phase, the base model is trained on a curated dataset of high-quality conversational prompts and responses, written by human experts.
A typical training sample looks like:
Prompt: Explain photosynthesis in one sentence.
Response: Photosynthesis is the process by which plants use sunlight, water, and carbon dioxide to create oxygen and energy in the form of sugar.
By training on tens of thousands of these conversational examples, the model learns the "instruct" behavior: it recognizes when it is being asked a question and understands that it must respond with helpful answers, adopting a conversational and polite tone.
Phase 3: Alignment (RLHF and DPO)
Even after SFT, a model can still produce toxic, biased, incorrect, or unhelpful output. SFT only teaches the model to imitate the training dialogues. To ensure the model is helpful, honest, and harmless, engineers "align" it with human preferences using two primary techniques:
1. Reinforcement Learning from Human Feedback (RLHF)
RLHF works by using a grading system. The process involves three steps:
- Generate Options: The model generates multiple candidate answers to a prompt.
- Train a Reward Model: Human evaluators rate these candidate answers from best to worst. A separate neural network—the Reward Model—is trained to predict what score a human would give to any given response.
- Reinforce: Using an RL algorithm (typically PPO), the LLM's parameters are updated to maximize the score predicted by the Reward Model. Responses that humans like are rewarded, and disliked responses are penalized.
2. Direct Preference Optimization (DPO)
While RLHF is highly effective, it is notoriously unstable, expensive, and complex to train because it requires maintaining multiple models simultaneously (the LLM, the Reward Model, and reference models).
In 2023, researchers introduced Direct Preference Optimization (DPO). DPO bypasses the reward model entirely. It mathematically proves that you can optimize the LLM policy directly using a dataset of paired choices: a prompt, a preferred (chosen) response, and a disliked (rejected) response. DPO adjusts the weights so that the probability of the chosen response increases relative to the rejected response, creating a much simpler, faster, and more stable alignment loop.
Key Concept: Kaplan vs. Chinchilla Scaling Laws
How do we make models smarter? For a long time, the industry followed Kaplan's scaling laws (2020), which suggested that parameter size was the single most important factor—urging engineers to build larger models, even if they couldn't afford to train them on more data.
In 2022, DeepMind published the Chinchilla scaling laws. They proved that for optimal performance, parameter count and training data (tokens) should scale in equal proportion. Most models were actually under-trained on too little data. This shifted the industry toward training smaller, highly efficient models (like LLaMA or Mistral) for much longer on high-quality tokens, making them far cheaper to run on standard hardware.
Sources
- Scaling Laws for Neural Language Models — Kaplan et al.
- Training Compute-Optimal Large Language Models — Hoffmann et al.
- Training Language Models to Follow Instructions with Human Feedback — Ouyang et al.
- Direct Preference Optimization — Rafailov et al.
RAG & Context Windows
An AI model has two kinds of memory. The first is Parametric Memory—information baked directly into the model's weights during training. The second is Working Memory—the space available in the immediate input prompt, known as the Context Window. To build reliable systems that do not hallucinate, engineers use these two memories in tandem through RAG and ultra-long context architectures.
The Limitation of Parametric Memory
Relying purely on what a model has memorized has three massive drawbacks:
- Knowledge Cutoff: The model only knows what existed before its training run finished.
- Hallucinations: When asked about obscure facts, models often confidently guess, creating plausible-sounding falsehoods.
- No Access to Private Data: Models cannot read your local PDFs, company emails, or secure databases.
Retrieval-Augmented Generation (RAG)
RAG solves this by turning the model into an open-book test taker. Instead of answering from memory, the system searches an external database for the answer, pastes the relevant documents directly into the context window, and asks the model to read them to generate the answer.
How the RAG Pipeline Works
- Chunking: Large documents (like a 100-page manual) are broken down into small, digestible paragraphs (chunks).
- Embeddings: An Embedding Model converts each text chunk into a string of numbers (a vector) representing its semantic meaning.
- Vector Database: These vectors are stored in a specialized database (like Pinecone, Chroma, or pgvector).
- Retrieval (Semantic Search): When a user asks a question, the system converts their question into a vector and finds the text chunks in the database that are mathematically closest to the question's meaning.
- Augmentation & Generation: The system fetches those text chunks, inserts them into a prompt alongside the user's question, and sends it to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."
The Evolution of Context Windows
If RAG is so powerful, why not just feed the entire database directly into the model? Historically, this was impossible because of the way attention works.
The memory and computation cost of standard Self-Attention scales quadratically ($O(N^2)$) with the length of the input. If you double the length of your input, it takes four times more compute and memory to process. Early models were capped at a context window of just 2,048 tokens (roughly 1,500 words).
Recent architectural and serving breakthroughs have broken this barrier. By 2026, frontier systems commonly offer hundreds of thousands to millions of tokens of working memory; OpenAI lists a 1 million token API context window for GPT-5.5, while Google's Gemini line has pushed long-context reasoning into mainstream multimodal products. The main pillars of this scaling are:
1. FlashAttention
Introduced by Tri Dao, FlashAttention is a software-level optimization. Rather than changing the math of attention, it changes how the GPU handles memory. Standard attention writes massive intermediate tables back and forth between slow GPU High Bandwidth Memory (HBM) and fast on-chip SRAM. FlashAttention computes attention in small blocks, keeping data in the fast SRAM cache as much as possible. This reduces memory traffic by up to 20x, allowing context windows to scale dramatically without running out of GPU memory.
2. Rotary Position Embeddings (RoPE)
Older absolute positional systems could not handle context lengths longer than what they were trained on. RoPE represents positions by rotating the word vectors in a multi-dimensional mathematical space. Because rotation is relative, the model can understand the distance between words even if the total text length is far longer than the training parameters, allowing context window sizes to be scaled up post-training with minimal fine-tuning.
The "Needle in a Haystack" Test
Just because a model can accept a million tokens doesn't mean it is actually reading them. To evaluate long-context retrieval, researchers use the Needle in a Haystack (NIAH) test.
A random, unrelated fact (the "needle") is hidden somewhere inside a massive text dump of documents (the "haystack"). The model is then asked a question that can only be answered using that specific fact. Modern models must achieve near 100% accuracy, finding the needle regardless of whether it is hidden at the beginning, middle, or end of the document stack.
However, long context is not a free replacement for retrieval. Million-token prompts can still be slower, more expensive, and harder to audit than a well-built RAG pipeline. In production systems, engineers often combine both: use retrieval to select the most relevant evidence, then use a long-context model when the task requires cross-document synthesis, codebase-wide reasoning, or comparison across many artifacts.
Sources
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.
- FlashAttention — Dao et al.
- RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al.
- Gemini 1.5 and long-context model capabilities — Google
- Introducing GPT-5.5 — OpenAI
- Gemini 3.5: frontier intelligence with action — Google
Scaling Efficiency: MoE & Quantization
As AI models grow larger, running them becomes incredibly expensive. A dense 175-billion-parameter model requires multiple high-end enterprise GPUs running concurrently just to output a single word. To make these models practical for commercial use and deployable on smaller hardware, engineers rely on two massive efficiency breakthroughs: Mixture of Experts (MoE) and Quantization.
Mixture of Experts (MoE)
In a standard "Dense" model, every single parameter (the neural connections) is activated for every single word processed. This is highly inefficient; a model doesn't need to invoke its entire mathematical knowledge base to process a simple punctuation mark or pronoun.
An MoE architecture turns a dense model into a "Sparse" model by breaking it up into specialized compartments called Experts (typically inside the Feed-Forward Network layers). Instead of passing a word through all pathways, a dynamic Gating Network (Router) decides which experts should handle which word.
Sparse Routing in Action
Imagine a model with 8 distinct "Experts." When a token is processed:
- If the token is a line of Python code, the Router sends it to Expert 3 (Code specialist) and Expert 5 (Logic specialist).
- If the token is a word in French, the Router sends it to Expert 1 (Translation specialist).
Typically, the Router selects only the Top-2 Experts for each token. If a model has a total of 8x 7B experts (56B total parameters), it only activates roughly 12B parameters per token. This gives the model the vast knowledge capacity of a 56B model, but with the fast generation speed and compute cost of a much smaller 12B model.
The Challenges of MoE
MoE is not a free lunch. It introduces several hard engineering hurdles:
- RAM Overhead: Although only 12B parameters are active at any millisecond, the entire 56B parameter model must still be loaded into the GPU's memory (VRAM). This means MoE requires significantly more memory than dense models of equivalent speed.
- Routing Collapse: During early training, the router might favor one expert, making it smarter, which causes the router to send even more traffic to it. Engineers must write custom algorithms to force load-balancing so all experts are trained evenly.
Quantization
Neural networks represent their learned weights as high-precision decimals called floating-point numbers. During training, these are typically represented in 16-bit precision (FP16 or BF16).
Storing weights in 16-bit precision means every single parameter requires 2 bytes of GPU memory. A 70-billion-parameter model requires at least 140 gigabytes of VRAM just to load, which exceeds the capacity of almost all consumer GPUs.
Quantization is the process of compressing these weights by reducing their numerical precision—mapping them to smaller formats like 8-bit integers (INT8), 4-bit integers (INT4), or even custom formats like FP4.
The Intuition Behind Quantization
Think of quantization like reducing the color depth of a digital photo. If you convert a photo from 24-bit true color to an 8-bit color palette, the file size shrinks by 66%. The image looks slightly less smooth, but the shapes, objects, and overall context are still perfectly recognizable.
Similarly, when we quantize a model from 16-bit to 4-bit, we decrease its size by 75%. A 70B model that once required 140GB of VRAM can now fit into roughly 35GB of VRAM. Remarkably, due to the high mathematical redundancy in neural networks, this massive compression results in only a tiny degradation in reasoning capability.
Modern Quantization Formats
Several standard file formats are used to run these compressed models:
- GGUF (formerly GGML): Optimized specifically for CPU execution, allowing users to run large models on consumer laptops (like Apple Silicon Macbooks) by leveraging system RAM instead of expensive GPU VRAM.
- GPTQ / AWQ: Formats optimized for GPU-accelerated quantized inference, ensuring that compressed models generate text at blisteringly fast speeds on standard desktop graphic cards.
The 2025-2026 open-weight wave made this efficiency story concrete. OpenAI's gpt-oss models, for example, use Transformer MoE architectures where the 117B-parameter model activates only 5.1B parameters per token and the 21B-parameter model activates 3.6B. That design lets model builders expose large total capacity while keeping the per-token compute closer to a much smaller dense model.
Sources
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity — Fedus et al.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — Lin et al.
- GGUF format documentation — ggml
- Introducing gpt-oss — OpenAI
Diffusion & Generative Media
Generative AI for images and videos has undergone a massive transformation. Early image generators, called GANs (Generative Adversarial Networks), were notoriously difficult to train, often failing to produce coherent pictures. Today, almost all modern image and video generators (Stable Diffusion, Midjourney, Sora, Flux) rely on a mathematical concept called Diffusion.
The Diffusion Paradigm
Instead of trying to draw an image from scratch, a diffusion model is trained to do one thing: remove static noise. The process is split into two phases: the forward process and the reverse process.
1. The Forward Process (Destroying Information)
We take a clean photograph (say, of a golden retriever) and add a tiny layer of random mathematical noise. We repeat this step-by-step, perhaps 1,000 times, until the original dog is completely obliterated, leaving nothing but a block of pure gray static. This process requires no neural network; it is pure math.
2. The Reverse Process (Creating Information)
This is where the neural network lives. We show the model a noisy image and ask it: "Can you predict exactly how much noise was added in this step?"
By training the model on millions of pairs of clean and noisy images, it learns to recognize subtle structures within noise. When we want to generate a new image, we feed the model a block of pure, random noise and a text prompt (e.g., "A golden retriever playing in the grass"). The model subtracts a sliver of estimated noise. We repeat this subtraction loop 20 to 50 times. Bit by bit, structures appear, and a completely unique, high-resolution image emerges.
Key Concept: Latent Diffusion
Early diffusion models operated in "Pixel Space." Generating a 1024x1024 pixel image meant calculating noise values for over a million pixels at every step. This made early models incredibly slow and memory-intensive.
The breakthrough was Latent Diffusion (popularized by Stable Diffusion). It uses a Variational Autoencoder (VAE) to compress the image into a highly dense representation called "latent space" (shrinking a 512x512 image down to a 64x64 grid). The diffusion model does all its heavy lifting in this low-resolution space, and the VAE decodes the final latents back into pixels at the very end. This saved 90%+ of the compute, making image generation run on consumer laptops.
Classifier-Free Guidance (CFG)
How does the model make sure the image it generates actually matches your prompt, instead of wandering off on its own? This is controlled by Classifier-Free Guidance (CFG).
During training, the model is occasionally trained without text prompts (unconditioned). During generation, the model predicts two things: what the noise removal should look like with the prompt, and what it should look like without it. The CFG scale decides how much weight to give to the difference.
- Low CFG (1 to 3): Gives the model creative freedom. The image will be artistic but might ignore parts of your prompt.
- Medium CFG (7 to 9): The sweet spot for high-quality, prompt-adhering images.
- High CFG (15+): Forces strict prompt adherence, though it can make the image look oversaturated and digitally artificial.
The Shift to Diffusion Transformers (DiT)
Traditional diffusion models used a convolutional network backbone called a U-Net to predict noise. However, U-Nets struggled to scale efficiently with massive datasets and compute budgets.
In 2023, researchers introduced the Diffusion Transformer (DiT). DiT replaces the U-Net with a standard Transformer backbone. By dividing the latent image into patches (similar to how an LLM divides text into tokens), DiT models can scale predictably: adding more parameters and compute directly correlates with better image and video fidelity. This architecture underpins the latest state-of-the-art models like OpenAI's Sora, Stable Diffusion 3, and Flux.
Sources
- Denoising Diffusion Probabilistic Models — Ho et al.
- High-Resolution Image Synthesis with Latent Diffusion Models — Rombach et al.
- Classifier-Free Diffusion Guidance — Ho and Salimans
- Scalable Diffusion Models with Transformers — Peebles and Xie
Agentic AI & Reasoning
For the first few years of the LLM boom, AI models were treated as passive chatbots: you write a prompt, and the model instantly outputs a response. Today, the frontier has shifted toward Agentic AI. Instead of answering statically, agentic systems act as autonomous software entities that can plan, use external tools, inspect their own output, and run in loops to solve multi-step problems.
Tool Use & Function Calling
LLMs are notoriously bad at precise math (like multiplying two 8-digit numbers) and cannot fetch live data or interact with the physical world because they are just word-prediction engines.
Tool Use (or Function Calling) overcomes this limitation by giving models hands. The model is provided with a list of available tools, described in plain text. For example:
Available Tool: calculate_weather(location, date)
- Returns the temperature forecast for a location.
If the user asks: "Should I wear a coat in Chicago tomorrow?", the LLM recognizes it cannot answer from memory. Instead of guessing, it outputs a structured instruction:
{
"call": "calculate_weather",
"arguments": { "location": "Chicago", "date": "tomorrow" }
}
The host application intercepts this JSON, runs the actual weather API, receives the result (e.g., "Chicago: 41°F, Rain"), and appends it to the model's chat history. The LLM reads the result and finishes its response: "Yes, you should wear a coat. Chicago will be 41°F and raining tomorrow."
Reasoning Loops: ReAct and Reflection
To solve complex tasks, agents use structured loops rather than generating answers in a single pass.
1. The ReAct (Reason + Act) Loop
ReAct forces the model to document its thinking before taking actions. The loop proceeds as follows:
- Thought: The model explains its plan (e.g., "I need to find the population of France, then multiply it by 0.12").
- Action: The model calls a search engine or calculator tool.
- Observation: The model reads the tool's output and updates its plan, looping back to Thought until the task is complete.
2. Reflection and Self-Correction
If a model writes a block of code, it may contain a bug. A reflection agent doesn't send the code to the user immediately. Instead, it runs the code in an isolated environment, catches any error logs, feeds those errors back to itself, and rewrites the code to fix the bug. This cyclic feedback loop dramatically boosts task success rates.
System 1 vs. System 2 Thinking in AI
Cognitive psychologist Daniel Kahneman famously split human thinking into two modes:
- System 1 (Fast): Fast, intuitive, automatic actions (e.g., answering "2+2=?", reading a familiar road sign).
- System 2 (Slow): Slow, deliberate, logical reasoning (e.g., solving "17 × 24", filling out a tax form).
Standard LLMs operate mostly like System 1. They output the next token immediately without much opportunity to plan, test, or revise. If they start a sentence poorly, they cannot truly rewind the generation path.
Modern System 2 Reasoning Models spend extra inference-time compute before producing a final answer. OpenAI's GPT-5.5, Google's Gemini 3.5 Flash, DeepSeek-R1, and related reasoning systems all point toward the same design shift: models are being trained and served to plan, use tools, check intermediate work, and keep going across longer workflows. Some expose a visible plan or controllable "thinking" effort; others keep the reasoning internal while returning a concise answer.
Sources
- ReAct: Synergizing Reasoning and Acting in Language Models — Yao et al.
- Toolformer: Language Models Can Teach Themselves to Use Tools — Schick et al.
- Reflexion: Language Agents with Verbal Reinforcement Learning — Shinn et al.
- Learning to Reason with LLMs — OpenAI
- Introducing GPT-5.5 — OpenAI
- Gemini 3.5: frontier intelligence with action — Google
- DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning — Nature
Future Frontiers & Physical AI
We are entering a new era of artificial intelligence. The frontier is no longer just about making models bigger. Researchers are expanding models into the physical world, training them to process multiple senses natively, and shifting from one-shot answers toward agents that can take action over time.
Native Multimodality
Early multimodal systems were "Stitched" together. For example, to let an AI "see" an image, engineers would run an image-captioning model to generate a text description, and then feed that text to the LLM. This was incredibly lossy; a text caption cannot capture the precise spatial layout of a room, the emotional expression on a face, or the specific pitch of a sound.
Modern state-of-the-art models (like Gemini and GPT-5.5) are Natively Multimodal. They are built with a unified architecture or tightly integrated model system. Text, pixels, audio waveforms, video frames, and tool outputs are converted into a shared mathematical language (embeddings) and routed through models that can reason across them.
This allows the model to reason across modalities simultaneously. A native multimodal model can watch a video, listen to the speaker's sarcasm, read the slides in the background, and output a unified analysis in real time, catching nuances that stitched systems miss entirely.
The "Data Wall" and Synthetic Data
For a decade, AI progress was fueled by feeding models more data. However, the industry is hitting a Data Wall: LLMs have already consumed almost all high-quality, publicly available human-written text on the internet.
To continue training, researchers are turning to Synthetic Data—data generated by AI models to train other AI models.
The Promise and Danger of Synthetic Data
If models train on unverified synthetic data, they risk Model Collapse—a phenomenon where errors, biases, and weird linguistic quirks compound over generations, causing the model to become increasingly stupid and disconnected from reality.
To prevent this, engineers use Verified Synthesis: using external environments to validate the AI's data. For example:
- An AI generates code, which is then run in a compiler to verify it works. Only working code is used for training.
- An AI solves a math problem. The solution is validated using formal math verifiers.
- An AI reasons about physical properties. The scenario is run through a physics engine to make sure it follows real-world laws.
Robotics and Physical Grounding
For AI to truly understand the world, it must interact with it. By combining multimodal LLMs with robotic control systems, researchers have developed Vision-Language-Action (VLA) models such as Google's RT-2 and Gemini Robotics.
A VLA model doesn't just output text; it outputs physical actions for a robot's joints and grippers. When you tell a VLA-enabled robot arm: "Pick up the yellow banana and put it in the basket," the model processes the camera feed (pixels), matches the words to the objects, calculates the spatial path, and controls the robot's motors directly. The LLM acts as the robot's planning layer, giving it common-sense reasoning and adaptability to new environments without custom programming for every object.
The Next Paradigm: Test-Time Compute
Pre-training scaling laws (adding more parameters and GPUs during training) are no longer the only axis of progress. The newer vector is Test-Time Compute (scaling at inference time).
Instead of forcing a model to answer within a fraction of a second, test-time compute lets the model spend extra compute planning, checking, searching, or coordinating tools. This is why frontier model releases increasingly emphasize agentic coding, computer use, document work, and scientific workflows rather than only chat benchmark scores. The practical question is becoming: how much thought should the system buy for this task?
Sources
- RT-2: New model translates vision and language into action — Google DeepMind
- Gemini Robotics: Bringing AI into the Physical World — Google DeepMind
- AI models collapse when trained on recursively generated data — Nature
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters — Snell et al.
- Learning to Reason with LLMs — OpenAI
- Introducing GPT-5.5 — OpenAI
Evaluation, Safety & Production AI
Building an AI demo is mostly about capability: can the model answer, retrieve, reason, or act? Shipping an AI product is about repeatability: can the system keep doing the right thing when prompts change, documents drift, users attack it, models upgrade, and costs spike?
Production AI teams treat the model as one component in a larger system. The surrounding product needs evals, guardrails, observability, rollback plans, and human review for cases where model confidence is not enough.
Eval-Driven Development
An eval is a repeatable test for behavior that matters. Instead of asking "does the answer look good?", an eval asks a narrower question: did the assistant cite the right policy section, refuse the unsafe request, preserve the JSON schema, choose the right tool, or solve the task within the latency budget?
Useful eval suites mix several test types:
- Golden examples: known prompts with expected answers, labels, or rubrics.
- Regression cases: failures from production that must not come back after a prompt, retrieval, or model change.
- Adversarial cases: inputs designed to trigger jailbreaks, prompt injection, data leakage, or unsafe tool calls.
- Performance cases: examples that measure cost, latency, refusal rate, and answer length, not just correctness.
The important habit is to run evals before changing models, prompts, retrieval settings, or tool permissions. In AI systems, a "minor" prompt edit can behave like a code change across thousands of hidden branches.
The Production Eval Loop
The loop is simple: collect examples, run evals, block risky releases, monitor live behavior, review failures, and add those failures back to the suite.
Groundedness and Source Verification
For RAG systems, the most common failure is not a totally random answer. It is an answer that sounds plausible but is only partly supported by the retrieved evidence. A groundedness check compares each important claim against the source passages the system provided.
Good groundedness evaluation asks:
- Does every factual claim have supporting evidence in the retrieved context?
- Did the answer cite the specific source that supports the claim?
- Did the model ignore conflicting evidence or overstate uncertainty?
- Should the system answer, ask a clarifying question, retrieve again, or refuse?
This is why citations are not just decoration. A citation should be a checkable pointer to the evidence that justifies the answer. If the pointer is wrong, the system is teaching users to trust the wrong thing.
Prompt Injection and Tool Safety
Prompt injection happens when untrusted text tries to override the system's instructions. In a RAG app, the attack might live inside a PDF. In an agent, it might appear on a web page the agent browses. The dangerous pattern is the same: the model reads attacker-controlled text and treats it like an instruction from the product owner.
Tool use makes this risk sharper. A model that can only write text can mislead a user; a model with tools can email customers, change records, run code, or expose private data. Production systems reduce that risk with least-privilege tool scopes, allowlists, confirmation steps, output validation, and audit logs.
A strong rule of thumb: model instructions are not access control. The host application must enforce permissions outside the model.
Observability for AI Apps
Traditional logs often show an HTTP request, a status code, and a response time. AI observability needs more: prompt versions, retrieved chunks, model names, tool calls, token usage, evaluator scores, refusals, user feedback, and traces across the full agent loop.
Without traces, teams cannot answer basic production questions: Did retrieval fail? Did the model ignore good evidence? Did a tool return bad data? Did a prompt change increase cost? Did a new model improve benchmark scores but hurt real support tickets?
Human Review and Launch Gates
Human review is not a failure of automation; it is a control surface. High-impact workflows often need human approval for irreversible actions, sensitive domains, edge cases, and low-confidence answers. The product should make review efficient by showing the prompt, evidence, model answer, tool actions, and eval signals in one place.
Before launch, teams usually define gates: minimum eval scores, maximum hallucination rate, maximum latency, acceptable cost per task, security test coverage, and rollback criteria. After launch, sampled production traces become new tests so the system gets harder to break over time.
Sources
- Working with evals — OpenAI API docs
- OpenAI Evals — OpenAI
- AI Risk Management Framework — NIST
- AI RMF Generative AI Profile — NIST
- OWASP Top 10 for LLM Applications — OWASP Foundation
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.
- RAGTruth: A Hallucination Corpus for Retrieval-Augmented Language Models — Niu et al.