RAG & Context Windows
An AI model has two kinds of memory. The first is **Parametric Memory**—information baked directly into the model's weights during training. The second is **Working Memory**—the space available in the immediate input prompt, known as the **Context Window**. To build reliable systems that do not hallucinate, engineers use these two memories in tandem through RAG and ultra-long context architectures.
The Limitation of Parametric Memory
Relying purely on what a model has memorized has three massive drawbacks:
- Knowledge Cutoff: The model only knows what existed before its training run finished.
- Hallucinations: When asked about obscure facts, models often confidently guess, creating plausible-sounding falsehoods.
- No Access to Private Data: Models cannot read your local PDFs, company emails, or secure databases.
Retrieval-Augmented Generation (RAG)
**RAG** solves this by turning the model into an open-book test taker. Instead of answering from memory, the system searches an external database for the answer, pastes the relevant documents directly into the context window, and asks the model to read them to generate the answer.
How the RAG Pipeline Works
- Chunking: Large documents (like a 100-page manual) are broken down into small, digestible paragraphs (chunks).
- Embeddings: An Embedding Model converts each text chunk into a string of numbers (a vector) representing its semantic meaning.
- Vector Database: These vectors are stored in a specialized database (like Pinecone, Chroma, or pgvector).
- Retrieval (Semantic Search): When a user asks a question, the system converts their question into a vector and finds the text chunks in the database that are mathematically closest to the question's meaning.
- Augmentation & Generation: The system fetches those text chunks, inserts them into a prompt alongside the user's question, and sends it to the LLM: "Here is the context: [chunks]. Answer this question based on that context: [query]."
The Evolution of Context Windows
If RAG is so powerful, why not just feed the entire database directly into the model? Historically, this was impossible because of the way attention works.
The memory and computation cost of standard Self-Attention scales **quadratically** ($O(N^2)$) with the length of the input. If you double the length of your input, it takes four times more compute and memory to process. Early models were capped at a context window of just 2,048 tokens (roughly 1,500 words).
Recent architectural breakthroughs have broken this barrier, enabling models like Gemini to handle 1 to 2 million tokens of active memory. The main pillars of this scaling are:
1. FlashAttention
Introduced by Tri Dao, FlashAttention is a software-level optimization. Rather than changing the math of attention, it changes how the GPU handles memory. Standard attention writes massive intermediate tables back and forth between slow GPU High Bandwidth Memory (HBM) and fast on-chip SRAM. FlashAttention computes attention in small blocks, keeping data in the fast SRAM cache as much as possible. This reduces memory traffic by up to 20x, allowing context windows to scale dramatically without running out of GPU memory.
2. Rotary Position Embeddings (RoPE)
Older absolute positional systems could not handle context lengths longer than what they were trained on. **RoPE** represents positions by rotating the word vectors in a multi-dimensional mathematical space. Because rotation is relative, the model can understand the distance between words even if the total text length is far longer than the training parameters, allowing context window sizes to be scaled up post-training with minimal fine-tuning.
The "Needle in a Haystack" Test
Just because a model can accept a million tokens doesn't mean it is actually reading them. To evaluate long-context retrieval, researchers use the **Needle in a Haystack (NIAH)** test.
A random, unrelated fact (the "needle") is hidden somewhere inside a massive text dump of documents (the "haystack"). The model is then asked a question that can only be answered using that specific fact. Modern models must achieve near 100% accuracy, finding the needle regardless of whether it is hidden at the beginning, middle, or end of the document stack.