What is Retrieval Augmented Generation (RAG)? (2026)

Q: How much does RAG cost

Embedding: ~$0.02 per million tokens. Storage: $0-50/month for most vector DBs. LLM calls: $0.01-0.10 per query. Total: very cheap for most use cases.

Q: Can RAG eliminate hallucinations

It reduces them significantly but doesn't eliminate them. The LLM can still misinterpret context or fill gaps. Always include "based on the provided documents" instructions.

Q: What embedding model should I use

`text-embedding-3-small` (OpenAI) for most cases. `text-embedding-3-large` for maximum accuracy. Cohere `embed-v3` for multilingual.

Q: How do I keep RAG data fresh

Re-embed documents when they change. Set up a pipeline: document update → re-chunk → re-embed → update vector DB.

Q: Can I use RAG without a vector database

Yes. For small datasets, compute embeddings on the fly and use cosine similarity. But vector databases are much faster at scale.

RAG is the most practical way to make AI models work with your data. Instead of fine-tuning a model on your documents, you retrieve relevant information at query time and feed it to the model as context. Here's how it works and when to use it.

The Problem RAG Solves

LLMs like GPT-4 and Claude have a knowledge cutoff. They don't know about:

Your company's internal documents
Your product documentation
Your customer data
Anything after their training date

You could fine-tune a model on your data, but that's expensive, slow, and the model might still hallucinate.

RAG solves this by giving the model your data at query time. The model generates answers based on actual retrieved documents, not just its training data.

How RAG Works (Simple Version)

User asks a question: "What's our refund policy?"
System retrieves relevant documents: Searches your knowledge base for refund-related docs
System sends question + documents to LLM: "Based on these documents, answer this question"
LLM generates an answer: Grounded in your actual documents, with citations

User Question → Retrieve Documents → LLM + Documents → Answer

How RAG Works (Technical)

Step 1: Indexing (One-Time Setup)

Take your documents and convert them to vector embeddings:

Documents → Chunk into pieces → Generate embeddings → Store in vector database

Chunking: Split documents into overlapping pieces (e.g., 500 tokens each with 50 token overlap). Too large = irrelevant context. Too small = missing context.

Embeddings: Convert text chunks to numbers (vectors) using an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3). Similar text produces similar vectors.

Vector Database: Store embeddings in a vector database (Pinecone, Qdrant, Weaviate, pgvector, Chroma).

Step 2: Retrieval (Every Query)

When a user asks a question:

Query → Generate embedding → Search vector DB → Return top-K similar chunks

The vector database finds the most semantically similar document chunks to the question. This is called semantic search — it matches meaning, not just keywords.

Step 3: Generation (Every Query)

Send the question + retrieved chunks to the LLM:

System: Answer based on the following context. If the answer isn't in the context, say so.

Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]

User: What's our refund policy?

The LLM generates an answer grounded in the provided context.

Implementation Example

Here's a minimal RAG system using OpenAI + Pinecone:

from openai import OpenAI
from pinecone import Pinecone

client = OpenAI()
pc = Pinecone()
index = pc.Index("my-docs")

def ask(question):
    # 1. Embed the question
    embedding = client.embeddings.create(
        input=question,
        model="text-embedding-3-small"
    ).data[0].embedding
    
    # 2. Search vector database
    results = index.query(vector=embedding, top_k=5, include_metadata=True)
    
    # 3. Build context from results
    context = "\n\n".join([match.metadata["text"] for match in results.matches])
    
    # 4. Generate answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

RAG vs Fine-Tuning

	RAG	Fine-Tuning
Cost	Low (embedding + retrieval)	High (training run)
Setup time	Hours	Days-weeks
Data freshness	Real-time (update docs anytime)	Stale (re-train to update)
Hallucination	Lower (grounded in documents)	Higher (model may confabulate)
Best for	Q&A, search, support	Style/tone, specialized tasks

Use RAG when: You need factual answers from your data. Use fine-tuning when: You need the model to behave differently (style, format, domain expertise).

Vector Databases for RAG

Database	Type	Free Tier	Best For
Pinecone	Managed	✅ (100K vectors)	Easiest setup
Qdrant	Self-hosted / Cloud	✅	Performance
Weaviate	Self-hosted / Cloud	✅	Hybrid search
Chroma	Embedded (Python)	Open source	Prototyping
pgvector	Postgres extension	Part of Postgres	Already using Postgres

Recommendation: Start with pgvector if you already have Postgres (Neon, Supabase). Use Pinecone if you want managed infrastructure.

Chunking Strategies

How you split documents matters more than you think:

Fixed-Size Chunks

Split every 500 tokens. Simple but may break mid-sentence.

Recursive Character Splitting

Split by paragraph → sentence → character. Respects natural boundaries.

Semantic Chunking

Use embeddings to find natural topic boundaries. Most accurate but slowest.

Document-Aware Chunking

Split by headers (Markdown H2/H3), code blocks, or other structure. Best for technical docs.

Recommendation: Start with recursive character splitting (LangChain's default). Move to semantic chunking if retrieval quality is poor.

Advanced RAG Techniques

Hybrid Search

Combine vector search (semantic) with keyword search (BM25). Catches both meaning and exact terms.

Re-Ranking

Retrieve 20 results, then use a re-ranker model (Cohere Rerank, cross-encoder) to pick the best 5. Significantly improves accuracy.

Query Transformation

Rewrite the user's question before searching. "What's the cancellation policy?" → "refund policy cancellation terms conditions". Improves retrieval.

Multi-Query RAG

Generate multiple versions of the question, search each, combine results. Catches different phrasings.

Agentic RAG

Use an AI agent to decide what to search, evaluate results, and search again if needed. Most powerful but most complex.

Common Mistakes

1. Chunks Too Large

2,000 token chunks include too much irrelevant information. The LLM gets confused. Stick to 200-500 tokens.

2. No Overlap

Chunks without overlap lose context at boundaries. Use 10-20% overlap.

3. Ignoring Metadata

Filter by metadata (date, category, author) before vector search. "What changed last month?" should only search recent documents.

4. No Evaluation

Build evaluation datasets. Measure retrieval accuracy (are the right chunks being found?) and answer quality (is the LLM answering correctly?).

5. Stuffing Too Much Context

Sending 20 chunks to the LLM dilutes relevance. 3-5 high-quality chunks beat 20 mediocre ones.

When to Use RAG

✅ Good Use Cases

Customer support chatbot (answer from knowledge base)
Internal document search (company wiki, policies)
Product documentation Q&A
Legal document analysis
Research paper synthesis

❌ Poor Use Cases

Creative writing (doesn't need retrieval)
Simple classification tasks (fine-tuning is better)
Real-time data (use API calls instead)
Very small datasets (<10 documents — just put them all in context)

FAQ

How much does RAG cost?