Chapter 3 • 9 min read • Last reviewed: June 2026

RAG & Context Windows

An AI model has two kinds of memory. The first is Parametric Memory—information baked directly into the model's weights during training. The second is Working Memory—the space available in the immediate input prompt, known as the Context Window. To build reliable systems that do not hallucinate, engineers use these two memories in tandem through RAG and ultra-long context architectures.

The Limitation of Parametric Memory

Relying purely on what a model has memorized has three massive drawbacks:

Knowledge Cutoff: The model only knows what existed before its training run finished.
Hallucinations: When asked about obscure facts, models often confidently guess, creating plausible-sounding falsehoods.
No Access to Private Data: Models cannot read your local PDFs, company emails, or secure databases.

Retrieval-Augmented Generation (RAG)

RAG solves this by turning the model into an open-book test taker. Instead of answering from memory, the system searches an external database for the answer, pastes the relevant documents directly into the context window, and asks the model to read them to generate the answer.

How the RAG Pipeline Works

Chunking: Large documents (like a 100-page manual) are broken down into small, digestible paragraphs (chunks).
Embeddings: An Embedding Model converts each text chunk into a string of numbers (a vector) representing its semantic meaning.
Vector Database: These vectors are stored in a specialized database (like Pinecone, Chroma, or pgvector).
Retrieval (Semantic Search): When a user asks a question, the system converts their question into a vector and finds the text chunks in the database that are mathematically closest to the question's meaning.
Augmentation & Generation: The system fetches those text chunks, inserts them into a prompt alongside the user's question, and sends it to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."

The Evolution of Context Windows

If RAG is so powerful, why not just feed the entire database directly into the model? Historically, this was impossible because of the way attention works.

The memory and computation cost of standard Self-Attention scales quadratically ($O(N^2)$) with the length of the input. If you double the length of your input, it takes four times more compute and memory to process. Early models were capped at a context window of just 2,048 tokens (roughly 1,500 words).

Recent architectural and serving breakthroughs have broken this barrier. Frontier systems now commonly offer hundreds of thousands to millions of tokens of working memory, making it possible to analyze long PDFs, codebases, transcripts, and multi-file projects in a single request. The main pillars of this scaling are:

1. FlashAttention

Introduced by Tri Dao, FlashAttention is a software-level optimization. Rather than changing the math of attention, it changes how the GPU handles memory. Standard attention writes massive intermediate tables back and forth between slow GPU High Bandwidth Memory (HBM) and fast on-chip SRAM. FlashAttention computes attention in small blocks, keeping data in the fast SRAM cache as much as possible. This reduces memory traffic by up to 20x, allowing context windows to scale dramatically without running out of GPU memory.

2. Rotary Position Embeddings (RoPE)

Older absolute positional systems could not handle context lengths longer than what they were trained on. RoPE represents positions by rotating the word vectors in a multi-dimensional mathematical space. Because rotation is relative, the model can understand the distance between words even if the total text length is far longer than the training parameters, allowing context window sizes to be scaled up post-training with minimal fine-tuning.

The "Needle in a Haystack" Test

Just because a model can accept a million tokens doesn't mean it is actually reading them. To evaluate long-context retrieval, researchers use the Needle in a Haystack (NIAH) test.

A random, unrelated fact (the "needle") is hidden somewhere inside a massive text dump of documents (the "haystack"). The model is then asked a question that can only be answered using that specific fact. Modern models must achieve near 100% accuracy, finding the needle regardless of whether it is hidden at the beginning, middle, or end of the document stack.

However, long context is not a free replacement for retrieval. Million-token prompts can still be slower, more expensive, and harder to audit than a well-built RAG pipeline. In production systems, engineers often combine both: use retrieval to select the most relevant evidence, then use a long-context model when the task requires cross-document synthesis, codebase-wide reasoning, or comparison across many artifacts.

RAG vs. Long Context: How to Choose

Use RAG when the corpus is large, frequently changing, permissioned, or needs precise citations. Retrieval keeps prompts smaller, makes source selection auditable, and lets the application enforce access control before the model sees anything.

Use long context when the task requires comparing many pieces at once: reviewing a pull request across files, reconciling a contract with its exhibits, summarizing a full transcript, or finding contradictions across a small document set.

The most reliable pattern is often hybrid: retrieve the best candidates first, rerank them, then give a long-context model enough surrounding material to synthesize rather than quote isolated snippets. The eval should check both steps: did retrieval find the right evidence, and did generation stay faithful to it?

Sources

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — Lewis et al.
FlashAttention — Dao et al.
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al.
Gemini 1.5 and long-context model capabilities — Google