Chapter 8 • 10 min read • Last reviewed: June 2026

Evaluation, Safety & Production AI

Building an AI demo is mostly about capability: can the model answer, retrieve, reason, or act? Shipping an AI product is about repeatability: can the system keep doing the right thing when prompts change, documents drift, users attack it, models upgrade, and costs spike?

Production AI teams treat the model as one component in a larger system. The surrounding product needs evals, guardrails, observability, rollback plans, and human review for cases where model confidence is not enough.

Eval-Driven Development

An eval is a repeatable test for behavior that matters. Instead of asking "does the answer look good?", an eval asks a narrower question: did the assistant cite the right policy section, refuse the unsafe request, preserve the JSON schema, choose the right tool, or solve the task within the latency budget?

Useful eval suites mix several test types:

Golden examples: known prompts with expected answers, labels, or rubrics.
Regression cases: failures from production that must not come back after a prompt, retrieval, or model change.
Adversarial cases: inputs designed to trigger jailbreaks, prompt injection, data leakage, or unsafe tool calls.
Performance cases: examples that measure cost, latency, refusal rate, and answer length, not just correctness.

The big-deal habit is to run evals before changing models, prompts, retrieval settings, or tool permissions. In AI systems, a "minor" prompt edit can behave like a code change across thousands of hidden branches.

The Production Eval Loop

The loop is simple: collect examples, run evals, block risky releases, monitor live behavior, review failures, and add those failures back to the suite.

Groundedness and Source Verification

For RAG systems, the most common failure is not a totally random answer. It is an answer that sounds plausible but is only partly supported by the retrieved evidence. A groundedness check compares each big-deal claim against the source passages the system provided.

Good groundedness evaluation asks:

Does every factual claim have supporting evidence in the retrieved context?
Did the answer cite the specific source that supports the claim?
Did the model ignore conflicting evidence or overstate uncertainty?
Should the system answer, ask a clarifying question, retrieve again, or refuse?

This is why citations are not just decoration. A citation should be a checkable pointer to the evidence that justifies the answer. If the pointer is wrong, the system is teaching users to trust the wrong thing.

Prompt Injection and Tool Safety

Prompt injection happens when untrusted text tries to override the system's instructions. In a RAG app, the attack might live inside a PDF. In an agent, it might appear on a web page the agent browses. The dangerous pattern is the same: the model reads attacker-controlled text and treats it like an instruction from the product owner.

Tool use makes this risk sharper. A model that can only write text can mislead a user; a model with tools can email customers, change records, run code, or expose private data. Production systems reduce that risk with least-privilege tool scopes, allowlists, confirmation steps, output validation, and audit logs.

A strong rule of thumb: model instructions are not access control. The host application must enforce permissions outside the model.

Observability for AI Apps

Traditional logs often show an HTTP request, a status code, and a response time. AI observability needs more: prompt versions, retrieved chunks, model names, tool calls, token usage, evaluator scores, refusals, user feedback, and traces across the full agent loop.

Without traces, teams cannot answer basic production questions: Did retrieval fail? Did the model ignore good evidence? Did a tool return bad data? Did a prompt change increase cost? Did a new model improve benchmark scores but hurt real support tickets?

Human Review and Launch Gates

Human review is not a failure of automation; it is a control surface. High-impact workflows often need human approval for irreversible actions, sensitive domains, edge cases, and low-confidence answers. The product should make review efficient by showing the prompt, evidence, model answer, tool actions, and eval signals in one place.

Before launch, teams usually define gates: minimum eval scores, maximum hallucination rate, maximum latency, acceptable cost per task, security test coverage, and rollback criteria. After launch, sampled production traces become new tests so the system gets harder to break over time.

A Practical Launch Checklist

Before putting an AI feature in front of users, a team should be able to answer these questions:

What are the top user tasks, and which eval cases represent them?
What failures are unacceptable, and how are they detected before release?
Which sources, tools, and permissions can the model access?
What does the system do when retrieval is empty, sources conflict, tools fail, or confidence is low?
Who reviews risky outputs, and what information do they see?
How quickly can the team roll back a prompt, model, retrieval index, or tool permission change?

This checklist matters because AI quality is distributed across the full stack. The model, prompt, retrieval index, tools, UI, logs, evals, and review process all decide whether the product is trustworthy.

Sources

Working with evals — OpenAI API docs
OpenAI Evals — OpenAI
AI Risk Management Framework — NIST
AI RMF Generative AI Profile — NIST
OWASP Top 10 for LLM Applications — OWASP Foundation
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al.
RAGTruth: A Hallucination Corpus for Retrieval-Augmented Language Models — Niu et al.