← Back to articles

How to Build a RAG Chatbot (2026)

RAG (Retrieval-Augmented Generation) chatbots answer questions using your data. Instead of relying on the LLM's training data, RAG fetches relevant documents from your knowledge base and uses them to generate accurate, sourced answers.

How RAG Works

User asks: "What's the refund policy?"

1. RETRIEVE: Search your documents for "refund policy"
   → Finds: refund-policy.md (similarity score: 0.92)

2. AUGMENT: Add the retrieved document to the LLM prompt
   → "Using this context: [refund policy document], answer: What's the refund policy?"

3. GENERATE: LLM answers using the provided context
   → "Our refund policy allows full refunds within 30 days of purchase..."

Without RAG: LLM makes up an answer (or says "I don't know"). With RAG: LLM answers accurately from your actual documents.

Architecture

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  User Query  │────▶│  Embedding   │────▶│  Vector DB  │
│              │     │  Model       │     │  (Search)   │
└─────────────┘     └──────────────┘     └──────┬──────┘
                                                │
                                    Top K results│
                                                │
┌─────────────┐     ┌──────────────┐     ┌──────▼──────┐
│  Response    │◀────│     LLM      │◀────│  Prompt +   │
│  to User     │     │  (Generate)  │     │  Context    │
└─────────────┘     └──────────────┘     └─────────────┘

Step 1: Prepare Your Data

What Data Works

SourceFormatNotes
DocumentationMarkdown, HTMLBest results — structured, clear
Knowledge baseArticles, FAQsHigh-value for support chatbots
PDFsText-extractableOCR for scanned documents
Notion/ConfluenceExport as markdownClean up formatting
Slack/DiscordMessage exportsNoisy — filter for quality
CodeSource filesFor developer-facing chatbots

Chunking Strategy

Documents need to be split into chunks for embedding:

// Simple: Fixed-size chunks with overlap
const chunks = splitText(document, {
  chunkSize: 500,     // ~500 tokens per chunk
  chunkOverlap: 50,   // 50 token overlap between chunks
});

Chunk size guidelines:

  • Too small (< 200 tokens): Loses context. Answers are fragmented.
  • Too large (> 1000 tokens): Retrieval is less precise. Wastes context window.
  • Sweet spot (300-500 tokens): Good balance of context and precision.

Better: Semantic chunking. Split at natural boundaries (headings, paragraphs) rather than fixed character counts.

Step 2: Generate Embeddings

Embeddings convert text into numerical vectors that capture meaning:

import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';

const { embedding } = await embed({
  model: openai.embedding('text-embedding-3-small'),
  value: 'Our refund policy allows returns within 30 days.',
});
// embedding: [0.023, -0.041, 0.018, ...] (1536 dimensions)

Embedding models:

ModelDimensionsCostBest For
OpenAI text-embedding-3-small1536$0.02/1M tokensGeneral purpose
OpenAI text-embedding-3-large3072$0.13/1M tokensMaximum accuracy
Cohere embed-v31024$0.10/1M tokensMultilingual
BGE (local)768FreePrivacy, cost-sensitive

Step 3: Store in Vector Database

Options

DatabaseTypeBest ForPrice
PineconeManagedProduction, scaleFree-$70/mo
Supabase pgvectorManagedAlready using SupabaseIncluded
Neon pgvectorManagedAlready using NeonIncluded
ChromaLocal/self-hostedDevelopment, privacyFree
WeaviateManaged/self-hostedAdvanced searchFree-$25/mo
QdrantManaged/self-hostedPerformanceFree-$25/mo

Example: Supabase pgvector

-- Enable the extension
create extension vector;

-- Create a table for documents
create table documents (
  id bigserial primary key,
  content text,
  metadata jsonb,
  embedding vector(1536)
);

-- Create an index for fast search
create index on documents using ivfflat (embedding vector_cosine_ops);
// Insert document chunks
for (const chunk of chunks) {
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: chunk.content,
  });
  
  await supabase.from('documents').insert({
    content: chunk.content,
    metadata: { source: chunk.source, page: chunk.page },
    embedding,
  });
}

Step 4: Build the Retrieval Pipeline

async function retrieveContext(query: string, topK = 5) {
  // 1. Embed the user's question
  const { embedding } = await embed({
    model: openai.embedding('text-embedding-3-small'),
    value: query,
  });
  
  // 2. Search for similar documents
  const { data: documents } = await supabase.rpc('match_documents', {
    query_embedding: embedding,
    match_threshold: 0.7,
    match_count: topK,
  });
  
  // 3. Return formatted context
  return documents.map(doc => doc.content).join('\n\n---\n\n');
}

Step 5: Generate the Answer

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

async function answerQuestion(userQuery: string) {
  const context = await retrieveContext(userQuery);
  
  const { text } = await generateText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: `You are a helpful assistant that answers questions based on the provided context.
      
      Rules:
      - Only answer based on the provided context
      - If the context doesn't contain the answer, say "I don't have information about that"
      - Cite which document the answer comes from
      - Be concise and direct`,
    prompt: `Context:\n${context}\n\nQuestion: ${userQuery}`,
  });
  
  return text;
}

Step 6: Add Chat Interface

Using Vercel AI SDK

// app/api/chat/route.ts
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

export async function POST(req: Request) {
  const { messages } = await req.json();
  const lastMessage = messages[messages.length - 1].content;
  
  const context = await retrieveContext(lastMessage);
  
  const result = streamText({
    model: anthropic('claude-sonnet-4-20250514'),
    system: `Answer using this context:\n${context}`,
    messages,
  });
  
  return result.toDataStreamResponse();
}
// app/page.tsx
'use client';
import { useChat } from 'ai/react';

export default function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat();
  
  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>
          <strong>{m.role}:</strong> {m.content}
        </div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} placeholder="Ask a question..." />
        <button type="submit">Send</button>
      </form>
    </div>
  );
}

Improving RAG Quality

1. Hybrid Search

Combine vector similarity with keyword search:

Final score = 0.7 × semantic_similarity + 0.3 × keyword_match (BM25)

2. Re-ranking

After retrieval, re-rank results with a cross-encoder model for better relevance.

3. Query Expansion

Rephrase the user's question for better retrieval: "What's the return policy?" → Also search: "refund", "money back", "exchange"

4. Metadata Filtering

Filter results by category, date, or source before vector search. "What changed in the latest version?" → Filter: docs from last 30 days.

5. Evaluation

Test with real questions. Track: answer accuracy, retrieval relevance, and user satisfaction.

FAQ

How much data do I need?

Even 10 documents make a useful RAG chatbot. Quality matters more than quantity. Start small, expand as you identify gaps.

How much does it cost to run?

Embedding 1,000 pages: ~$1. Vector storage: $0-25/month. LLM per query: ~$0.01-0.05. For a chatbot handling 1,000 queries/month: ~$10-50 total.

How do I keep the data updated?

Re-embed documents when they change. For frequently updated sources: schedule re-embedding (daily/weekly). For static docs: embed once.

Can RAG handle multiple languages?

Yes. Use multilingual embedding models (Cohere embed-v3). The retrieval works across languages — a question in English can retrieve documents in Spanish.

RAG vs fine-tuning?

RAG: best for factual Q&A from specific documents. Easy to update. Fine-tuning: best for changing the model's behavior or style. Expensive to update. For most use cases: RAG is the right choice.

Bottom Line

Building a RAG chatbot in 2026 is straightforward with modern tools. Vercel AI SDK + Supabase pgvector + Claude gives you a production-ready chatbot in an afternoon. The quality depends on: data preparation (clean, well-chunked documents), retrieval tuning (relevance thresholds, hybrid search), and prompt engineering (clear instructions to the LLM).

Build it today: Prepare 10-20 key documents → embed with OpenAI → store in Supabase pgvector → build chat UI with Vercel AI SDK. Total development time: 4-8 hours. Total cost: ~$20-50/month to run.

Get AI tool guides in your inbox

Weekly deep-dives on the best AI coding tools, automation platforms, and productivity software.