Context windows keep growing, but infinite memory isn't here yet. This guide explains LLM context limits, how RAG fills the gap, and - critically - when to use each approach in 2026.
Every LLM processes text inside a context window - the total number of tokens it can hold at once, combining your prompt, any documents you attach, and its own response. Claude Sonnet 4.6 supports roughly 200,000 tokens; Gemini 3.1 Pro pushes to 1 million; some research models exceed that. A token is roughly 0.75 words, so 200K tokens holds about 150,000 words - two or three average novels.
That sounds like a lot. But enterprise knowledge bases contain millions of documents. Customer support teams have years of ticket history. Legal teams have entire case libraries. No context window - regardless of how large - will hold it all at once. And even when the data would technically fit, stuffing a context window completely tends to degrade quality: the model "loses the thread" of what's most relevant when everything is crammed in.
This is the problem context management is trying to solve.
Retrieval-Augmented Generation (RAG) was the dominant answer to the context problem before large context windows became widespread - and it remains highly relevant in 2026. The core idea is simple: instead of putting your entire knowledge base into the prompt, you search it at query time and inject only the chunks that are relevant to this specific question.
The retrieval step is typically powered by a vector database - a system that stores documents as high-dimensional embeddings and finds the nearest matches to your query vector. Common options include Pinecone, Weaviate, Chroma, and pgvector (if you're already on Postgres).
Neither approach is universally better. This is the framework we use when advising on architecture decisions:
| Dimension | RAG | Long-context window |
|---|---|---|
| Cost per query | Low - only relevant chunks in prompt | High - entire document set on every call |
| Latency | Adds retrieval step (50-200ms typical) | Single inference call, no retrieval |
| Accuracy (focused) | High - fewer distractions in prompt | Can degrade with very full contexts |
| Knowledge freshness | Update the store, not the model | Requires re-prompting with new docs |
| Setup complexity | Vector DB, embeddings pipeline, chunking | Just put the docs in the prompt |
| Auditability | Can cite specific retrieved chunks | Source tracing harder with full context |
| Multi-document reasoning | Depends on retrieval quality | Model sees everything, can cross-reference |
| Knowledge base size | Scales to millions of documents | Hard limit at context window size |
The most sophisticated production systems in 2026 use both. RAG handles the retrieval layer - pulling fresh, relevant chunks from large knowledge stores. The long context window handles deep reasoning - giving the model enough room to cross-reference those chunks, maintain conversation history, and produce nuanced answers.
This is sometimes called "context engineering": deliberately constructing the context window rather than either filling it entirely or leaving retrieval as the only mechanism.
Enterprise finding: a consistent pattern in 2026 enterprise deployments is that RAG and large context windows are complementary, not competing. Teams use RAG to select which documents enter the window, then give the model a generous window to reason over them. The retrieval layer handles scale; the context window handles depth.
Where your data lives determines what retrieval stack makes sense:
GitHub repositories, Confluence wikis, internal API docs. GitHub MCP is a practical option for giving LLMs direct access to code context without manual retrieval plumbing - it handles the connection between your Claude agent and your codebase.
PDFs, Word docs, Notion pages. These need a chunking and embedding pipeline before they can be searched. LlamaIndex has become the go-to library for this - it handles document ingestion, chunking strategy, embedding, and querying with good defaults.
The Model Context Protocol (MCP) is one of the cleanest ways to implement retrieval for Claude-based agents in 2026. Instead of building custom retrieval plumbing, MCP servers give your agent structured access to specific data sources - GitHub, Notion, databases - with a standardised interface. The agent queries the MCP server; the server handles retrieval. See our MCP guide for a full breakdown of available servers.
A context window is the maximum amount of text an LLM can process in a single request - both your input and its output combined. Claude 3.7 Sonnet supports around 200K tokens; some models reach 1M+. Larger windows let you feed more documents but cost more per call and can degrade focus on the parts that matter.
Yes. Large context windows reduce the cases where RAG is mandatory, but they don't eliminate its advantages. RAG retrieves only relevant chunks, keeping costs and latency low. It also lets you update the knowledge store without reprocessing everything. Enterprise teams typically use a hybrid: long context for reasoning depth, RAG for keeping retrieval fresh and cheap.
RAG injects external knowledge at inference time - the model doesn't change. Fine-tuning bakes knowledge into the model's weights during training. RAG is better when your knowledge changes frequently or you need citations and auditability. Fine-tuning is better when you need to instil a specific style, tone, or domain behaviour that prompting alone can't achieve.
For most teams starting out: Pinecone (managed, no infra) or Weaviate (open-source, self-host or cloud). If you're already on Postgres, pgvector lets you skip a separate service entirely. Chroma is popular for prototyping locally. Pick based on where your data lives and how much you want to manage.