How to Set Up AI Monitoring and Alerting (2026)

Running AI in production without monitoring is flying blind. You need to track: latency (is it fast enough?), cost (is it burning money?), quality (is it hallucinating?), and reliability (is it failing?). Here's how to set it all up.

What to Monitor

Metric	Why It Matters	Alert Threshold
Latency (p50/p95/p99)	User experience	p95 > 5s
Token usage	Cost control	>2x daily average
Error rate	Reliability	>1% of requests
Cost per request	Budget	>$0.10 per request
Daily spend	Budget	>$50/day (adjust)
Hallucination rate	Quality	Manual review triggers
User feedback	Satisfaction	Thumbs-down >10%
Rate limit hits	Capacity	Any occurrence
Model availability	Uptime	Any 5xx errors

Architecture

┌──────────┐     ┌─────────────┐     ┌──────────────┐
│  Your App │────▶│  AI Gateway │────▶│  LLM Provider│
│           │◀────│  (logging)  │◀────│  (OpenAI,    │
│           │     │             │     │   Anthropic) │
└──────────┘     └──────┬──────┘     └──────────────┘
                        │
                  Logs + Metrics
                        │
              ┌─────────▼──────────┐
              │  Monitoring Stack  │
              │  (Helicone/Langfuse│
              │   /custom)         │
              └─────────┬──────────┘
                        │
                   ┌────▼────┐
                   │ Alerts  │
                   │ (Slack, │
                   │  PD)    │
                   └─────────┘

Option 1: Helicone (Easiest)

Helicone is a proxy that sits between your app and the LLM. One-line setup:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Use exactly as before — Helicone logs everything transparently
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
});

What you get:

Every request logged (prompt, response, latency, tokens, cost)
Dashboard with usage trends
Cost tracking per user, feature, or model
Rate limit monitoring
Alerting on anomalies

Price: Free up to 100K requests/month. $20/mo for 1M+.

Helicone Custom Properties

Tag requests for granular monitoring:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: userMessage }],
}, {
  headers: {
    'Helicone-Property-Feature': 'chat',
    'Helicone-Property-UserId': userId,
    'Helicone-Property-Environment': 'production',
  },
});

Now filter dashboards by feature, user, or environment. "How much is the chat feature costing?" → instant answer.

Option 2: Langfuse (Self-Hostable)

Langfuse provides tracing, evaluation, and monitoring. Open source.

import { Langfuse } from 'langfuse';

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

// Create a trace for each user interaction
const trace = langfuse.trace({ name: 'chat', userId, metadata: { feature: 'support' } });

// Track the LLM call
const generation = trace.generation({
  name: 'llm-call',
  model: 'claude-sonnet-4-20250514',
  input: messages,
});

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  messages,
});

generation.end({ output: response.content, usage: {
  input: response.usage.input_tokens,
  output: response.usage.output_tokens,
}});

await langfuse.flush();

What you get:

Full request traces (multi-step chains, RAG pipelines)
Evaluation framework (score outputs for quality)
Cost tracking
Latency monitoring
Prompt management and versioning
Self-hostable (Docker)

Price: Free (self-hosted). Cloud: free tier → $59/mo.

Option 3: Custom with Vercel AI SDK

If you're using the Vercel AI SDK, add telemetry:

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await generateText({
  model: anthropic('claude-sonnet-4-20250514'),
  prompt: userMessage,
  experimental_telemetry: { isEnabled: true },
});

// Log to your monitoring system
await logToMonitoring({
  model: 'claude-sonnet-4-20250514',
  inputTokens: result.usage.promptTokens,
  outputTokens: result.usage.completionTokens,
  latencyMs: result.experimental_providerMetadata?.latencyMs,
  finishReason: result.finishReason,
  timestamp: new Date(),
});

Custom Metrics with Prometheus

import { Counter, Histogram, Gauge } from 'prom-client';

const aiRequestDuration = new Histogram({
  name: 'ai_request_duration_seconds',
  help: 'AI request duration',
  labelNames: ['model', 'feature'],
  buckets: [0.5, 1, 2, 5, 10, 30],
});

const aiTokensUsed = new Counter({
  name: 'ai_tokens_total',
  help: 'Total tokens used',
  labelNames: ['model', 'type'], // input/output
});

const aiErrorRate = new Counter({
  name: 'ai_errors_total',
  help: 'AI request errors',
  labelNames: ['model', 'error_type'],
});

const aiDailySpend = new Gauge({
  name: 'ai_daily_spend_dollars',
  help: 'Estimated daily AI spend',
});

Setting Up Alerts

Slack Alerts (Simple)

async function sendAlert(message: string, severity: 'warning' | 'critical') {
  await fetch(process.env.SLACK_WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      text: `${severity === 'critical' ? '🚨' : '⚠️'} AI Alert: ${message}`,
    }),
  });
}

// Check after each request
if (latencyMs > 5000) {
  sendAlert(`High latency: ${latencyMs}ms on ${model}`, 'warning');
}
if (dailySpend > 50) {
  sendAlert(`Daily spend exceeded $50: $${dailySpend.toFixed(2)}`, 'critical');
}

Alert Rules

Condition	Severity	Action
p95 latency > 5s (5 min window)	Warning	Slack notification
Error rate > 5% (5 min window)	Critical	PagerDuty + Slack
Daily cost > 2x average	Warning	Slack + email
Daily cost > 5x average	Critical	Auto-disable non-essential features
Rate limit errors	Warning	Queue requests, notify team
Model provider outage	Critical	Switch to fallback model

Cost Circuit Breaker

class CostCircuitBreaker {
  private dailySpend = 0;
  private readonly maxDailySpend: number;

  constructor(maxDaily: number) {
    this.maxDailySpend = maxDaily;
  }

  async check(estimatedCost: number): Promise<boolean> {
    if (this.dailySpend + estimatedCost > this.maxDailySpend) {
      await sendAlert(`Cost limit reached: $${this.dailySpend}/$${this.maxDailySpend}`, 'critical');
      return false; // Block the request
    }
    return true;
  }

  record(actualCost: number) {
    this.dailySpend += actualCost;
  }
}

Quality Monitoring

Automated Evaluation

// After each AI response, run a quick quality check
async function evaluateResponse(input: string, output: string) {
  const evaluation = await generateText({
    model: anthropic('claude-haiku-3'),
    prompt: `Rate this AI response on a 1-5 scale for:
      - Relevance (does it answer the question?)
      - Accuracy (is the information correct?)
      - Helpfulness (is it actionable?)
      
      Input: ${input}
      Output: ${output}
      
      Return JSON: { relevance: N, accuracy: N, helpfulness: N }`,
  });
  
  const scores = JSON.parse(evaluation.text);
  if (scores.accuracy < 3 || scores.relevance < 3) {
    await flagForReview(input, output, scores);
  }
  return scores;
}

User Feedback Loop

// Track thumbs up/down
app.post('/api/feedback', async (req, res) => {
  const { traceId, rating, comment } = req.body;
  await db.insert('feedback', { traceId, rating, comment, timestamp: new Date() });
  
  // Alert if negative feedback rate is high
  const recentFeedback = await db.query(
    'SELECT AVG(rating) as avg FROM feedback WHERE timestamp > NOW() - INTERVAL 1 HOUR'
  );
  if (recentFeedback.avg < 0.6) {
    await sendAlert('User satisfaction dropped below 60% in the last hour', 'warning');
  }
});

FAQ

Which monitoring tool should I start with?

Helicone for the fastest setup (one line of code). Langfuse if you want self-hosting or detailed traces. Custom Prometheus if you already have a monitoring stack.

How much does monitoring add to latency?

Helicone (proxy): ~10-20ms added. Langfuse (async logging): ~0ms (fires asynchronously). Custom: depends on implementation. Always log asynchronously.

Should I log full prompts and responses?

In development/staging: yes, for debugging. In production: be careful with PII. Log metadata (tokens, latency, cost) always. Log content with PII redaction or only for flagged requests.

How do I monitor costs across multiple providers?

Use a unified logging layer that normalizes costs. Helicone supports multiple providers. Or build a cost lookup table: model → price per 1K tokens.

When should I use LLM-as-judge for quality?

For high-stakes applications (customer-facing, financial, medical). Run evaluation on a sample (10-20% of requests) to manage costs. Use a cheaper model (Haiku) as the judge to keep evaluation costs low.

Bottom Line

Start simple: Helicone proxy → instant visibility into latency, cost, and errors. Add alerts for: cost spikes, high latency, and error rate increases. Graduate to Langfuse for detailed tracing and quality evaluation as your AI usage grows.

Setup in 30 minutes: 1) Add Helicone proxy (5 min). 2) Set up Slack webhook for alerts (10 min). 3) Add cost circuit breaker (15 min). You now have visibility, alerting, and budget protection for your AI in production.