What is Retrieval Augmented Generation (RAG)? (2026)
RAG is the most practical way to make AI models work with your data. Instead of fine-tuning a model on your documents, you retrieve relevant information at query time and feed it to the model as context. Here's how it works and when to use it.
The Problem RAG Solves
LLMs like GPT-4 and Claude have a knowledge cutoff. They don't know about:
- Your company's internal documents
- Your product documentation
- Your customer data
- Anything after their training date
You could fine-tune a model on your data, but that's expensive, slow, and the model might still hallucinate.
RAG solves this by giving the model your data at query time. The model generates answers based on actual retrieved documents, not just its training data.
How RAG Works (Simple Version)
- User asks a question: "What's our refund policy?"
- System retrieves relevant documents: Searches your knowledge base for refund-related docs
- System sends question + documents to LLM: "Based on these documents, answer this question"
- LLM generates an answer: Grounded in your actual documents, with citations
User Question → Retrieve Documents → LLM + Documents → Answer
How RAG Works (Technical)
Step 1: Indexing (One-Time Setup)
Take your documents and convert them to vector embeddings:
Documents → Chunk into pieces → Generate embeddings → Store in vector database
Chunking: Split documents into overlapping pieces (e.g., 500 tokens each with 50 token overlap). Too large = irrelevant context. Too small = missing context.
Embeddings: Convert text chunks to numbers (vectors) using an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3). Similar text produces similar vectors.
Vector Database: Store embeddings in a vector database (Pinecone, Qdrant, Weaviate, pgvector, Chroma).
Step 2: Retrieval (Every Query)
When a user asks a question:
Query → Generate embedding → Search vector DB → Return top-K similar chunks
The vector database finds the most semantically similar document chunks to the question. This is called semantic search — it matches meaning, not just keywords.
Step 3: Generation (Every Query)
Send the question + retrieved chunks to the LLM:
System: Answer based on the following context. If the answer isn't in the context, say so.
Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]
User: What's our refund policy?
The LLM generates an answer grounded in the provided context.
Implementation Example
Here's a minimal RAG system using OpenAI + Pinecone:
from openai import OpenAI
from pinecone import Pinecone
client = OpenAI()
pc = Pinecone()
index = pc.Index("my-docs")
def ask(question):
# 1. Embed the question
embedding = client.embeddings.create(
input=question,
model="text-embedding-3-small"
).data[0].embedding
# 2. Search vector database
results = index.query(vector=embedding, top_k=5, include_metadata=True)
# 3. Build context from results
context = "\n\n".join([match.metadata["text"] for match in results.matches])
# 4. Generate answer
response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": f"Answer based on this context:\n\n{context}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
RAG vs Fine-Tuning
| RAG | Fine-Tuning | |
|---|---|---|
| Cost | Low (embedding + retrieval) | High (training run) |
| Setup time | Hours | Days-weeks |
| Data freshness | Real-time (update docs anytime) | Stale (re-train to update) |
| Hallucination | Lower (grounded in documents) | Higher (model may confabulate) |
| Best for | Q&A, search, support | Style/tone, specialized tasks |
Use RAG when: You need factual answers from your data. Use fine-tuning when: You need the model to behave differently (style, format, domain expertise).
Vector Databases for RAG
| Database | Type | Free Tier | Best For |
|---|---|---|---|
| Pinecone | Managed | ✅ (100K vectors) | Easiest setup |
| Qdrant | Self-hosted / Cloud | ✅ | Performance |
| Weaviate | Self-hosted / Cloud | ✅ | Hybrid search |
| Chroma | Embedded (Python) | Open source | Prototyping |
| pgvector | Postgres extension | Part of Postgres | Already using Postgres |
Recommendation: Start with pgvector if you already have Postgres (Neon, Supabase). Use Pinecone if you want managed infrastructure.
Chunking Strategies
How you split documents matters more than you think:
Fixed-Size Chunks
Split every 500 tokens. Simple but may break mid-sentence.
Recursive Character Splitting
Split by paragraph → sentence → character. Respects natural boundaries.
Semantic Chunking
Use embeddings to find natural topic boundaries. Most accurate but slowest.
Document-Aware Chunking
Split by headers (Markdown H2/H3), code blocks, or other structure. Best for technical docs.
Recommendation: Start with recursive character splitting (LangChain's default). Move to semantic chunking if retrieval quality is poor.
Advanced RAG Techniques
Hybrid Search
Combine vector search (semantic) with keyword search (BM25). Catches both meaning and exact terms.
Re-Ranking
Retrieve 20 results, then use a re-ranker model (Cohere Rerank, cross-encoder) to pick the best 5. Significantly improves accuracy.
Query Transformation
Rewrite the user's question before searching. "What's the cancellation policy?" → "refund policy cancellation terms conditions". Improves retrieval.
Multi-Query RAG
Generate multiple versions of the question, search each, combine results. Catches different phrasings.
Agentic RAG
Use an AI agent to decide what to search, evaluate results, and search again if needed. Most powerful but most complex.
Common Mistakes
1. Chunks Too Large
2,000 token chunks include too much irrelevant information. The LLM gets confused. Stick to 200-500 tokens.
2. No Overlap
Chunks without overlap lose context at boundaries. Use 10-20% overlap.
3. Ignoring Metadata
Filter by metadata (date, category, author) before vector search. "What changed last month?" should only search recent documents.
4. No Evaluation
Build evaluation datasets. Measure retrieval accuracy (are the right chunks being found?) and answer quality (is the LLM answering correctly?).
5. Stuffing Too Much Context
Sending 20 chunks to the LLM dilutes relevance. 3-5 high-quality chunks beat 20 mediocre ones.
When to Use RAG
✅ Good Use Cases
- Customer support chatbot (answer from knowledge base)
- Internal document search (company wiki, policies)
- Product documentation Q&A
- Legal document analysis
- Research paper synthesis
❌ Poor Use Cases
- Creative writing (doesn't need retrieval)
- Simple classification tasks (fine-tuning is better)
- Real-time data (use API calls instead)
- Very small datasets (<10 documents — just put them all in context)
FAQ
How much does RAG cost?
Embedding: ~$0.02 per million tokens. Storage: $0-50/month for most vector DBs. LLM calls: $0.01-0.10 per query. Total: very cheap for most use cases.
Can RAG eliminate hallucinations?
It reduces them significantly but doesn't eliminate them. The LLM can still misinterpret context or fill gaps. Always include "based on the provided documents" instructions.
What embedding model should I use?
text-embedding-3-small (OpenAI) for most cases. text-embedding-3-large for maximum accuracy. Cohere embed-v3 for multilingual.
How do I keep RAG data fresh?
Re-embed documents when they change. Set up a pipeline: document update → re-chunk → re-embed → update vector DB.
Can I use RAG without a vector database?
Yes. For small datasets, compute embeddings on the fly and use cosine similarity. But vector databases are much faster at scale.
Bottom Line
RAG is the standard way to build AI apps that work with your data in 2026. It's cheaper than fine-tuning, more accurate than prompting alone, and works with any LLM. Start simple: chunk your docs, embed with OpenAI, store in pgvector, retrieve top 5, and generate.
The 80/20 of RAG: get chunking and retrieval right, and the generation takes care of itself.