← Back to articles

What is Retrieval Augmented Generation (RAG)? (2026)

RAG is the most practical way to make AI models work with your data. Instead of fine-tuning a model on your documents, you retrieve relevant information at query time and feed it to the model as context. Here's how it works and when to use it.

The Problem RAG Solves

LLMs like GPT-4 and Claude have a knowledge cutoff. They don't know about:

  • Your company's internal documents
  • Your product documentation
  • Your customer data
  • Anything after their training date

You could fine-tune a model on your data, but that's expensive, slow, and the model might still hallucinate.

RAG solves this by giving the model your data at query time. The model generates answers based on actual retrieved documents, not just its training data.

How RAG Works (Simple Version)

  1. User asks a question: "What's our refund policy?"
  2. System retrieves relevant documents: Searches your knowledge base for refund-related docs
  3. System sends question + documents to LLM: "Based on these documents, answer this question"
  4. LLM generates an answer: Grounded in your actual documents, with citations
User Question → Retrieve Documents → LLM + Documents → Answer

How RAG Works (Technical)

Step 1: Indexing (One-Time Setup)

Take your documents and convert them to vector embeddings:

Documents → Chunk into pieces → Generate embeddings → Store in vector database

Chunking: Split documents into overlapping pieces (e.g., 500 tokens each with 50 token overlap). Too large = irrelevant context. Too small = missing context.

Embeddings: Convert text chunks to numbers (vectors) using an embedding model (OpenAI text-embedding-3-small, Cohere embed-v3). Similar text produces similar vectors.

Vector Database: Store embeddings in a vector database (Pinecone, Qdrant, Weaviate, pgvector, Chroma).

Step 2: Retrieval (Every Query)

When a user asks a question:

Query → Generate embedding → Search vector DB → Return top-K similar chunks

The vector database finds the most semantically similar document chunks to the question. This is called semantic search — it matches meaning, not just keywords.

Step 3: Generation (Every Query)

Send the question + retrieved chunks to the LLM:

System: Answer based on the following context. If the answer isn't in the context, say so.

Context:
[Retrieved chunk 1]
[Retrieved chunk 2]
[Retrieved chunk 3]

User: What's our refund policy?

The LLM generates an answer grounded in the provided context.

Implementation Example

Here's a minimal RAG system using OpenAI + Pinecone:

from openai import OpenAI
from pinecone import Pinecone

client = OpenAI()
pc = Pinecone()
index = pc.Index("my-docs")

def ask(question):
    # 1. Embed the question
    embedding = client.embeddings.create(
        input=question,
        model="text-embedding-3-small"
    ).data[0].embedding
    
    # 2. Search vector database
    results = index.query(vector=embedding, top_k=5, include_metadata=True)
    
    # 3. Build context from results
    context = "\n\n".join([match.metadata["text"] for match in results.matches])
    
    # 4. Generate answer
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n\n{context}"},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

RAG vs Fine-Tuning

RAGFine-Tuning
CostLow (embedding + retrieval)High (training run)
Setup timeHoursDays-weeks
Data freshnessReal-time (update docs anytime)Stale (re-train to update)
HallucinationLower (grounded in documents)Higher (model may confabulate)
Best forQ&A, search, supportStyle/tone, specialized tasks

Use RAG when: You need factual answers from your data. Use fine-tuning when: You need the model to behave differently (style, format, domain expertise).

Vector Databases for RAG

DatabaseTypeFree TierBest For
PineconeManaged✅ (100K vectors)Easiest setup
QdrantSelf-hosted / CloudPerformance
WeaviateSelf-hosted / CloudHybrid search
ChromaEmbedded (Python)Open sourcePrototyping
pgvectorPostgres extensionPart of PostgresAlready using Postgres

Recommendation: Start with pgvector if you already have Postgres (Neon, Supabase). Use Pinecone if you want managed infrastructure.

Chunking Strategies

How you split documents matters more than you think:

Fixed-Size Chunks

Split every 500 tokens. Simple but may break mid-sentence.

Recursive Character Splitting

Split by paragraph → sentence → character. Respects natural boundaries.

Semantic Chunking

Use embeddings to find natural topic boundaries. Most accurate but slowest.

Document-Aware Chunking

Split by headers (Markdown H2/H3), code blocks, or other structure. Best for technical docs.

Recommendation: Start with recursive character splitting (LangChain's default). Move to semantic chunking if retrieval quality is poor.

Advanced RAG Techniques

Hybrid Search

Combine vector search (semantic) with keyword search (BM25). Catches both meaning and exact terms.

Re-Ranking

Retrieve 20 results, then use a re-ranker model (Cohere Rerank, cross-encoder) to pick the best 5. Significantly improves accuracy.

Query Transformation

Rewrite the user's question before searching. "What's the cancellation policy?" → "refund policy cancellation terms conditions". Improves retrieval.

Multi-Query RAG

Generate multiple versions of the question, search each, combine results. Catches different phrasings.

Agentic RAG

Use an AI agent to decide what to search, evaluate results, and search again if needed. Most powerful but most complex.

Common Mistakes

1. Chunks Too Large

2,000 token chunks include too much irrelevant information. The LLM gets confused. Stick to 200-500 tokens.

2. No Overlap

Chunks without overlap lose context at boundaries. Use 10-20% overlap.

3. Ignoring Metadata

Filter by metadata (date, category, author) before vector search. "What changed last month?" should only search recent documents.

4. No Evaluation

Build evaluation datasets. Measure retrieval accuracy (are the right chunks being found?) and answer quality (is the LLM answering correctly?).

5. Stuffing Too Much Context

Sending 20 chunks to the LLM dilutes relevance. 3-5 high-quality chunks beat 20 mediocre ones.

When to Use RAG

✅ Good Use Cases

  • Customer support chatbot (answer from knowledge base)
  • Internal document search (company wiki, policies)
  • Product documentation Q&A
  • Legal document analysis
  • Research paper synthesis

❌ Poor Use Cases

  • Creative writing (doesn't need retrieval)
  • Simple classification tasks (fine-tuning is better)
  • Real-time data (use API calls instead)
  • Very small datasets (<10 documents — just put them all in context)

FAQ

How much does RAG cost?

Embedding: ~$0.02 per million tokens. Storage: $0-50/month for most vector DBs. LLM calls: $0.01-0.10 per query. Total: very cheap for most use cases.

Can RAG eliminate hallucinations?

It reduces them significantly but doesn't eliminate them. The LLM can still misinterpret context or fill gaps. Always include "based on the provided documents" instructions.

What embedding model should I use?

text-embedding-3-small (OpenAI) for most cases. text-embedding-3-large for maximum accuracy. Cohere embed-v3 for multilingual.

How do I keep RAG data fresh?

Re-embed documents when they change. Set up a pipeline: document update → re-chunk → re-embed → update vector DB.

Can I use RAG without a vector database?

Yes. For small datasets, compute embeddings on the fly and use cosine similarity. But vector databases are much faster at scale.

Bottom Line

RAG is the standard way to build AI apps that work with your data in 2026. It's cheaper than fine-tuning, more accurate than prompting alone, and works with any LLM. Start simple: chunk your docs, embed with OpenAI, store in pgvector, retrieve top 5, and generate.

The 80/20 of RAG: get chunking and retrieval right, and the generation takes care of itself.

Get AI tool guides in your inbox

Weekly deep-dives on the best AI coding tools, automation platforms, and productivity software.