How to Build a RAG Chatbot (2026)
RAG (Retrieval-Augmented Generation) chatbots answer questions using your data. Instead of relying on the LLM's training data, RAG fetches relevant documents from your knowledge base and uses them to generate accurate, sourced answers.
How RAG Works
User asks: "What's the refund policy?"
1. RETRIEVE: Search your documents for "refund policy"
→ Finds: refund-policy.md (similarity score: 0.92)
2. AUGMENT: Add the retrieved document to the LLM prompt
→ "Using this context: [refund policy document], answer: What's the refund policy?"
3. GENERATE: LLM answers using the provided context
→ "Our refund policy allows full refunds within 30 days of purchase..."
Without RAG: LLM makes up an answer (or says "I don't know"). With RAG: LLM answers accurately from your actual documents.
Architecture
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ User Query │────▶│ Embedding │────▶│ Vector DB │
│ │ │ Model │ │ (Search) │
└─────────────┘ └──────────────┘ └──────┬──────┘
│
Top K results│
│
┌─────────────┐ ┌──────────────┐ ┌──────▼──────┐
│ Response │◀────│ LLM │◀────│ Prompt + │
│ to User │ │ (Generate) │ │ Context │
└─────────────┘ └──────────────┘ └─────────────┘
Step 1: Prepare Your Data
What Data Works
| Source | Format | Notes |
|---|---|---|
| Documentation | Markdown, HTML | Best results — structured, clear |
| Knowledge base | Articles, FAQs | High-value for support chatbots |
| PDFs | Text-extractable | OCR for scanned documents |
| Notion/Confluence | Export as markdown | Clean up formatting |
| Slack/Discord | Message exports | Noisy — filter for quality |
| Code | Source files | For developer-facing chatbots |
Chunking Strategy
Documents need to be split into chunks for embedding:
// Simple: Fixed-size chunks with overlap
const chunks = splitText(document, {
chunkSize: 500, // ~500 tokens per chunk
chunkOverlap: 50, // 50 token overlap between chunks
});
Chunk size guidelines:
- Too small (< 200 tokens): Loses context. Answers are fragmented.
- Too large (> 1000 tokens): Retrieval is less precise. Wastes context window.
- Sweet spot (300-500 tokens): Good balance of context and precision.
Better: Semantic chunking. Split at natural boundaries (headings, paragraphs) rather than fixed character counts.
Step 2: Generate Embeddings
Embeddings convert text into numerical vectors that capture meaning:
import { embed } from 'ai';
import { openai } from '@ai-sdk/openai';
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: 'Our refund policy allows returns within 30 days.',
});
// embedding: [0.023, -0.041, 0.018, ...] (1536 dimensions)
Embedding models:
| Model | Dimensions | Cost | Best For |
|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 | $0.02/1M tokens | General purpose |
| OpenAI text-embedding-3-large | 3072 | $0.13/1M tokens | Maximum accuracy |
| Cohere embed-v3 | 1024 | $0.10/1M tokens | Multilingual |
| BGE (local) | 768 | Free | Privacy, cost-sensitive |
Step 3: Store in Vector Database
Options
| Database | Type | Best For | Price |
|---|---|---|---|
| Pinecone | Managed | Production, scale | Free-$70/mo |
| Supabase pgvector | Managed | Already using Supabase | Included |
| Neon pgvector | Managed | Already using Neon | Included |
| Chroma | Local/self-hosted | Development, privacy | Free |
| Weaviate | Managed/self-hosted | Advanced search | Free-$25/mo |
| Qdrant | Managed/self-hosted | Performance | Free-$25/mo |
Example: Supabase pgvector
-- Enable the extension
create extension vector;
-- Create a table for documents
create table documents (
id bigserial primary key,
content text,
metadata jsonb,
embedding vector(1536)
);
-- Create an index for fast search
create index on documents using ivfflat (embedding vector_cosine_ops);
// Insert document chunks
for (const chunk of chunks) {
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: chunk.content,
});
await supabase.from('documents').insert({
content: chunk.content,
metadata: { source: chunk.source, page: chunk.page },
embedding,
});
}
Step 4: Build the Retrieval Pipeline
async function retrieveContext(query: string, topK = 5) {
// 1. Embed the user's question
const { embedding } = await embed({
model: openai.embedding('text-embedding-3-small'),
value: query,
});
// 2. Search for similar documents
const { data: documents } = await supabase.rpc('match_documents', {
query_embedding: embedding,
match_threshold: 0.7,
match_count: topK,
});
// 3. Return formatted context
return documents.map(doc => doc.content).join('\n\n---\n\n');
}
Step 5: Generate the Answer
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
async function answerQuestion(userQuery: string) {
const context = await retrieveContext(userQuery);
const { text } = await generateText({
model: anthropic('claude-sonnet-4-20250514'),
system: `You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only answer based on the provided context
- If the context doesn't contain the answer, say "I don't have information about that"
- Cite which document the answer comes from
- Be concise and direct`,
prompt: `Context:\n${context}\n\nQuestion: ${userQuery}`,
});
return text;
}
Step 6: Add Chat Interface
Using Vercel AI SDK
// app/api/chat/route.ts
import { streamText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
export async function POST(req: Request) {
const { messages } = await req.json();
const lastMessage = messages[messages.length - 1].content;
const context = await retrieveContext(lastMessage);
const result = streamText({
model: anthropic('claude-sonnet-4-20250514'),
system: `Answer using this context:\n${context}`,
messages,
});
return result.toDataStreamResponse();
}
// app/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map(m => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} placeholder="Ask a question..." />
<button type="submit">Send</button>
</form>
</div>
);
}
Improving RAG Quality
1. Hybrid Search
Combine vector similarity with keyword search:
Final score = 0.7 × semantic_similarity + 0.3 × keyword_match (BM25)
2. Re-ranking
After retrieval, re-rank results with a cross-encoder model for better relevance.
3. Query Expansion
Rephrase the user's question for better retrieval: "What's the return policy?" → Also search: "refund", "money back", "exchange"
4. Metadata Filtering
Filter results by category, date, or source before vector search. "What changed in the latest version?" → Filter: docs from last 30 days.
5. Evaluation
Test with real questions. Track: answer accuracy, retrieval relevance, and user satisfaction.
FAQ
How much data do I need?
Even 10 documents make a useful RAG chatbot. Quality matters more than quantity. Start small, expand as you identify gaps.
How much does it cost to run?
Embedding 1,000 pages: ~$1. Vector storage: $0-25/month. LLM per query: ~$0.01-0.05. For a chatbot handling 1,000 queries/month: ~$10-50 total.
How do I keep the data updated?
Re-embed documents when they change. For frequently updated sources: schedule re-embedding (daily/weekly). For static docs: embed once.
Can RAG handle multiple languages?
Yes. Use multilingual embedding models (Cohere embed-v3). The retrieval works across languages — a question in English can retrieve documents in Spanish.
RAG vs fine-tuning?
RAG: best for factual Q&A from specific documents. Easy to update. Fine-tuning: best for changing the model's behavior or style. Expensive to update. For most use cases: RAG is the right choice.
Bottom Line
Building a RAG chatbot in 2026 is straightforward with modern tools. Vercel AI SDK + Supabase pgvector + Claude gives you a production-ready chatbot in an afternoon. The quality depends on: data preparation (clean, well-chunked documents), retrieval tuning (relevance thresholds, hybrid search), and prompt engineering (clear instructions to the LLM).
Build it today: Prepare 10-20 key documents → embed with OpenAI → store in Supabase pgvector → build chat UI with Vercel AI SDK. Total development time: 4-8 hours. Total cost: ~$20-50/month to run.