How to Set Up AI Monitoring and Alerting (2026)
Running AI in production without monitoring is flying blind. You need to track: latency (is it fast enough?), cost (is it burning money?), quality (is it hallucinating?), and reliability (is it failing?). Here's how to set it all up.
What to Monitor
| Metric | Why It Matters | Alert Threshold |
|---|---|---|
| Latency (p50/p95/p99) | User experience | p95 > 5s |
| Token usage | Cost control | >2x daily average |
| Error rate | Reliability | >1% of requests |
| Cost per request | Budget | >$0.10 per request |
| Daily spend | Budget | >$50/day (adjust) |
| Hallucination rate | Quality | Manual review triggers |
| User feedback | Satisfaction | Thumbs-down >10% |
| Rate limit hits | Capacity | Any occurrence |
| Model availability | Uptime | Any 5xx errors |
Architecture
┌──────────┐ ┌─────────────┐ ┌──────────────┐
│ Your App │────▶│ AI Gateway │────▶│ LLM Provider│
│ │◀────│ (logging) │◀────│ (OpenAI, │
│ │ │ │ │ Anthropic) │
└──────────┘ └──────┬──────┘ └──────────────┘
│
Logs + Metrics
│
┌─────────▼──────────┐
│ Monitoring Stack │
│ (Helicone/Langfuse│
│ /custom) │
└─────────┬──────────┘
│
┌────▼────┐
│ Alerts │
│ (Slack, │
│ PD) │
└─────────┘
Option 1: Helicone (Easiest)
Helicone is a proxy that sits between your app and the LLM. One-line setup:
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
baseURL: 'https://oai.helicone.ai/v1',
defaultHeaders: {
'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
},
});
// Use exactly as before — Helicone logs everything transparently
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: 'Hello' }],
});
What you get:
- Every request logged (prompt, response, latency, tokens, cost)
- Dashboard with usage trends
- Cost tracking per user, feature, or model
- Rate limit monitoring
- Alerting on anomalies
Price: Free up to 100K requests/month. $20/mo for 1M+.
Helicone Custom Properties
Tag requests for granular monitoring:
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [{ role: 'user', content: userMessage }],
}, {
headers: {
'Helicone-Property-Feature': 'chat',
'Helicone-Property-UserId': userId,
'Helicone-Property-Environment': 'production',
},
});
Now filter dashboards by feature, user, or environment. "How much is the chat feature costing?" → instant answer.
Option 2: Langfuse (Self-Hostable)
Langfuse provides tracing, evaluation, and monitoring. Open source.
import { Langfuse } from 'langfuse';
const langfuse = new Langfuse({
publicKey: process.env.LANGFUSE_PUBLIC_KEY,
secretKey: process.env.LANGFUSE_SECRET_KEY,
});
// Create a trace for each user interaction
const trace = langfuse.trace({ name: 'chat', userId, metadata: { feature: 'support' } });
// Track the LLM call
const generation = trace.generation({
name: 'llm-call',
model: 'claude-sonnet-4-20250514',
input: messages,
});
const response = await anthropic.messages.create({
model: 'claude-sonnet-4-20250514',
messages,
});
generation.end({ output: response.content, usage: {
input: response.usage.input_tokens,
output: response.usage.output_tokens,
}});
await langfuse.flush();
What you get:
- Full request traces (multi-step chains, RAG pipelines)
- Evaluation framework (score outputs for quality)
- Cost tracking
- Latency monitoring
- Prompt management and versioning
- Self-hostable (Docker)
Price: Free (self-hosted). Cloud: free tier → $59/mo.
Option 3: Custom with Vercel AI SDK
If you're using the Vercel AI SDK, add telemetry:
import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';
const result = await generateText({
model: anthropic('claude-sonnet-4-20250514'),
prompt: userMessage,
experimental_telemetry: { isEnabled: true },
});
// Log to your monitoring system
await logToMonitoring({
model: 'claude-sonnet-4-20250514',
inputTokens: result.usage.promptTokens,
outputTokens: result.usage.completionTokens,
latencyMs: result.experimental_providerMetadata?.latencyMs,
finishReason: result.finishReason,
timestamp: new Date(),
});
Custom Metrics with Prometheus
import { Counter, Histogram, Gauge } from 'prom-client';
const aiRequestDuration = new Histogram({
name: 'ai_request_duration_seconds',
help: 'AI request duration',
labelNames: ['model', 'feature'],
buckets: [0.5, 1, 2, 5, 10, 30],
});
const aiTokensUsed = new Counter({
name: 'ai_tokens_total',
help: 'Total tokens used',
labelNames: ['model', 'type'], // input/output
});
const aiErrorRate = new Counter({
name: 'ai_errors_total',
help: 'AI request errors',
labelNames: ['model', 'error_type'],
});
const aiDailySpend = new Gauge({
name: 'ai_daily_spend_dollars',
help: 'Estimated daily AI spend',
});
Setting Up Alerts
Slack Alerts (Simple)
async function sendAlert(message: string, severity: 'warning' | 'critical') {
await fetch(process.env.SLACK_WEBHOOK_URL, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
text: `${severity === 'critical' ? '🚨' : '⚠️'} AI Alert: ${message}`,
}),
});
}
// Check after each request
if (latencyMs > 5000) {
sendAlert(`High latency: ${latencyMs}ms on ${model}`, 'warning');
}
if (dailySpend > 50) {
sendAlert(`Daily spend exceeded $50: $${dailySpend.toFixed(2)}`, 'critical');
}
Alert Rules
| Condition | Severity | Action |
|---|---|---|
| p95 latency > 5s (5 min window) | Warning | Slack notification |
| Error rate > 5% (5 min window) | Critical | PagerDuty + Slack |
| Daily cost > 2x average | Warning | Slack + email |
| Daily cost > 5x average | Critical | Auto-disable non-essential features |
| Rate limit errors | Warning | Queue requests, notify team |
| Model provider outage | Critical | Switch to fallback model |
Cost Circuit Breaker
class CostCircuitBreaker {
private dailySpend = 0;
private readonly maxDailySpend: number;
constructor(maxDaily: number) {
this.maxDailySpend = maxDaily;
}
async check(estimatedCost: number): Promise<boolean> {
if (this.dailySpend + estimatedCost > this.maxDailySpend) {
await sendAlert(`Cost limit reached: $${this.dailySpend}/$${this.maxDailySpend}`, 'critical');
return false; // Block the request
}
return true;
}
record(actualCost: number) {
this.dailySpend += actualCost;
}
}
Quality Monitoring
Automated Evaluation
// After each AI response, run a quick quality check
async function evaluateResponse(input: string, output: string) {
const evaluation = await generateText({
model: anthropic('claude-haiku-3'),
prompt: `Rate this AI response on a 1-5 scale for:
- Relevance (does it answer the question?)
- Accuracy (is the information correct?)
- Helpfulness (is it actionable?)
Input: ${input}
Output: ${output}
Return JSON: { relevance: N, accuracy: N, helpfulness: N }`,
});
const scores = JSON.parse(evaluation.text);
if (scores.accuracy < 3 || scores.relevance < 3) {
await flagForReview(input, output, scores);
}
return scores;
}
User Feedback Loop
// Track thumbs up/down
app.post('/api/feedback', async (req, res) => {
const { traceId, rating, comment } = req.body;
await db.insert('feedback', { traceId, rating, comment, timestamp: new Date() });
// Alert if negative feedback rate is high
const recentFeedback = await db.query(
'SELECT AVG(rating) as avg FROM feedback WHERE timestamp > NOW() - INTERVAL 1 HOUR'
);
if (recentFeedback.avg < 0.6) {
await sendAlert('User satisfaction dropped below 60% in the last hour', 'warning');
}
});
FAQ
Which monitoring tool should I start with?
Helicone for the fastest setup (one line of code). Langfuse if you want self-hosting or detailed traces. Custom Prometheus if you already have a monitoring stack.
How much does monitoring add to latency?
Helicone (proxy): ~10-20ms added. Langfuse (async logging): ~0ms (fires asynchronously). Custom: depends on implementation. Always log asynchronously.
Should I log full prompts and responses?
In development/staging: yes, for debugging. In production: be careful with PII. Log metadata (tokens, latency, cost) always. Log content with PII redaction or only for flagged requests.
How do I monitor costs across multiple providers?
Use a unified logging layer that normalizes costs. Helicone supports multiple providers. Or build a cost lookup table: model → price per 1K tokens.
When should I use LLM-as-judge for quality?
For high-stakes applications (customer-facing, financial, medical). Run evaluation on a sample (10-20% of requests) to manage costs. Use a cheaper model (Haiku) as the judge to keep evaluation costs low.
Bottom Line
Start simple: Helicone proxy → instant visibility into latency, cost, and errors. Add alerts for: cost spikes, high latency, and error rate increases. Graduate to Langfuse for detailed tracing and quality evaluation as your AI usage grows.
Setup in 30 minutes: 1) Add Helicone proxy (5 min). 2) Set up Slack webhook for alerts (10 min). 3) Add cost circuit breaker (15 min). You now have visibility, alerting, and budget protection for your AI in production.