How to Build a RAG Chatbot (2026)

RAG (Retrieval-Augmented Generation) chatbots answer questions using your data. Instead of relying on the LLM's training data, RAG fetches relevant documents from your knowledge base and uses them to generate accurate, sourced answers.

How RAG Works

User asks: "What's the refund policy?"

1. RETRIEVE: Search your documents for "refund policy"
   → Finds: refund-policy.md (similarity score: 0.92)

2. AUGMENT: Add the retrieved document to the LLM prompt
   → "Using this context: [refund policy document], answer: What's the refund policy?"

3. GENERATE: LLM answers using the provided context
   → "Our refund policy allows full refunds within 30 days of purchase..."

Without RAG: LLM makes up an answer (or says "I don't know"). With RAG: LLM answers accurately from your actual documents.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  User Query  │────▶│  Embedding   │────▶│  Vector DB  │
│              │     │  Model       │     │  (Search)   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                    Top K results│
                                                │
┌─────────────┐     ┌──────────────┐     ┌──────▼──────┐
│  Response    │◀────│     LLM      │◀────│  Prompt +   │
│  to User     │     │  (Generate)  │     │  Context    │
└─────────────┘     └──────────────┘     └─────────────┘

Step 1: Prepare Your Data

What Data Works

Source	Format	Notes
Documentation	Markdown, HTML	Best results — structured, clear
Knowledge base	Articles, FAQs	High-value for support chatbots
PDFs	Text-extractable	OCR for scanned documents
Notion/Confluence	Export as markdown	Clean up formatting
Slack/Discord	Message exports	Noisy — filter for quality
Code	Source files	For developer-facing chatbots

Chunking Strategy

Documents need to be split into chunks for embedding:

// Simple: Fixed-size chunks with overlap
const chunks = splitText(document, {
  chunkSize: 500,     // ~500 tokens per chunk
  chunkOverlap: 50,   // 50 token overlap between chunks
});

Chunk size guidelines:

Too small (< 200 tokens): Loses context. Answers are fragmented.
Too large (> 1000 tokens): Retrieval is less precise. Wastes context window.
Sweet spot (300-500 tokens): Good balance of context and precision.

Better: Semantic chunking. Split at natural boundaries (headings, paragraphs) rather than fixed character counts.

Step 2: Generate Embeddings

Embeddings convert text into numerical vectors that capture meaning:

import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';

const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-small'),
  value: 'Our refund policy allows returns within 30 days.',
});
// embedding: [0.023, -0.041, 0.018, ...] (1536 dimensions)

Embedding models:

Model	Dimensions	Cost	Best For
OpenAI text-embedding-3-small	1536	$0.02/1M tokens	General purpose
OpenAI text-embedding-3-large	3072	$0.13/1M tokens	Maximum accuracy
Cohere embed-v3	1024	$0.10/1M tokens	Multilingual
BGE (local)	768	Free	Privacy, cost-sensitive

Step 3: Store in Vector Database

Options

Database	Type	Best For	Price
Pinecone	Managed	Production, scale	Free-$70/mo
Supabase pgvector	Managed	Already using Supabase	Included
Neon pgvector	Managed	Already using Neon	Included
Chroma	Local/self-hosted	Development, privacy	Free
Weaviate	Managed/self-hosted	Advanced search	Free-$25/mo
Qdrant	Managed/self-hosted	Performance	Free-$25/mo

Example: Supabase pgvector

-- Enable the extension
create extension vector;

-- Create a table for documents
create table documents (
  id bigserial primary key,
  content text,
  metadata jsonb,
  embedding vector(1536)
);

-- Create an index for fast search
create index on documents using ivfflat (embedding vector_cosine_ops);

// Insert document chunks
for (const chunk of chunks) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: chunk.content,
  });
  
  await supabase.from('documents').insert({
    content: chunk.content,
    metadata: { source: chunk.source, page: chunk.page },
    embedding,
  });
}

Step 4: Build the Retrieval Pipeline

async function retrieveContext(query: string, topK = 5) {
  // 1. Embed the user's question
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });
  
  // 2. Search for similar documents
  const { data: documents } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.7,
    match_count: topK,
  });
  
  // 3. Return formatted context
  return documents.map(doc => doc.content).join('\n\n---\n\n');
}

Step 5: Generate the Answer

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

async function answerQuestion(userQuery: string) {
  const context = await retrieveContext(userQuery);
  
  const { text } = await generateText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: `You are a helpful assistant that answers questions based on the provided context.
      
      Rules:
      - Only answer based on the provided context
      - If the context doesn't contain the answer, say "I don't have information about that"
      - Cite which document the answer comes from
      - Be concise and direct`,
    prompt: `Context:\n${context}\n\nQuestion: ${userQuery}`,
  });
  
  return text;
}

Step 6: Add Chat Interface

Using Vercel AI SDK

// app/api/chat/route.ts
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastMessage = messages[messages.length - 1].content;
  
  const context = await retrieveContext(lastMessage);
  
  const result = streamText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: `Answer using this context:\n${context}`,
    messages,
  });
  
  return result.toDataStreamResponse();
}

// app/page.tsx
'use client';
import { useChat } from 'ai/react';

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();
  
  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} placeholder="Ask a question..." />
        <button type="submit">Send</button>
      </form>
    </div>
  );
}

Improving RAG Quality

1. Hybrid Search

Combine vector similarity with keyword search:

Final score = 0.7 × semantic_similarity + 0.3 × keyword_match (BM25)

2. Re-ranking

After retrieval, re-rank results with a cross-encoder model for better relevance.

3. Query Expansion

Rephrase the user's question for better retrieval: "What's the return policy?" → Also search: "refund", "money back", "exchange"

4. Metadata Filtering

Filter results by category, date, or source before vector search. "What changed in the latest version?" → Filter: docs from last 30 days.

5. Evaluation

Test with real questions. Track: answer accuracy, retrieval relevance, and user satisfaction.

FAQ

How much data do I need?

Even 10 documents make a useful RAG chatbot. Quality matters more than quantity. Start small, expand as you identify gaps.

How much does it cost to run?

Embedding 1,000 pages: ~$1. Vector storage: $0-25/month. LLM per query: ~$0.01-0.05. For a chatbot handling 1,000 queries/month: ~$10-50 total.

How do I keep the data updated?

Re-embed documents when they change. For frequently updated sources: schedule re-embedding (daily/weekly). For static docs: embed once.

Can RAG handle multiple languages?

Yes. Use multilingual embedding models (Cohere embed-v3). The retrieval works across languages — a question in English can retrieve documents in Spanish.

RAG vs fine-tuning?

RAG: best for factual Q&A from specific documents. Easy to update. Fine-tuning: best for changing the model's behavior or style. Expensive to update. For most use cases: RAG is the right choice.

Bottom Line

Building a RAG chatbot in 2026 is straightforward with modern tools. Vercel AI SDK + Supabase pgvector + Claude gives you a production-ready chatbot in an afternoon. The quality depends on: data preparation (clean, well-chunked documents), retrieval tuning (relevance thresholds, hybrid search), and prompt engineering (clear instructions to the LLM).

Build it today: Prepare 10-20 key documents → embed with OpenAI → store in Supabase pgvector → build chat UI with Vercel AI SDK. Total development time: 4-8 hours. Total cost: ~$20-50/month to run.