← Back to articles

How to Set Up AI Monitoring and Alerting (2026)

Running AI in production without monitoring is flying blind. You need to track: latency (is it fast enough?), cost (is it burning money?), quality (is it hallucinating?), and reliability (is it failing?). Here's how to set it all up.

What to Monitor

MetricWhy It MattersAlert Threshold
Latency (p50/p95/p99)User experiencep95 > 5s
Token usageCost control>2x daily average
Error rateReliability>1% of requests
Cost per requestBudget>$0.10 per request
Daily spendBudget>$50/day (adjust)
Hallucination rateQualityManual review triggers
User feedbackSatisfactionThumbs-down >10%
Rate limit hitsCapacityAny occurrence
Model availabilityUptimeAny 5xx errors

Architecture

┌──────────┐     ┌─────────────┐     ┌──────────────┐
│  Your App │────▶│  AI Gateway │────▶│  LLM Provider│
│           │◀────│  (logging)  │◀────│  (OpenAI,    │
│           │     │             │     │   Anthropic) │
└──────────┘     └──────┬──────┘     └──────────────┘
                        │
                  Logs + Metrics
                        │
              ┌─────────▼──────────┐
              │  Monitoring Stack  │
              │  (Helicone/Langfuse│
              │   /custom)         │
              └─────────┬──────────┘
                        │
                   ┌────▼────┐
                   │ Alerts  │
                   │ (Slack, │
                   │  PD)    │
                   └─────────┘

Option 1: Helicone (Easiest)

Helicone is a proxy that sits between your app and the LLM. One-line setup:

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  baseURL: 'https://oai.helicone.ai/v1',
  defaultHeaders: {
    'Helicone-Auth': `Bearer ${process.env.HELICONE_API_KEY}`,
  },
});

// Use exactly as before — Helicone logs everything transparently
const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: 'Hello' }],
});

What you get:

  • Every request logged (prompt, response, latency, tokens, cost)
  • Dashboard with usage trends
  • Cost tracking per user, feature, or model
  • Rate limit monitoring
  • Alerting on anomalies

Price: Free up to 100K requests/month. $20/mo for 1M+.

Helicone Custom Properties

Tag requests for granular monitoring:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages: [{ role: 'user', content: userMessage }],
}, {
  headers: {
    'Helicone-Property-Feature': 'chat',
    'Helicone-Property-UserId': userId,
    'Helicone-Property-Environment': 'production',
  },
});

Now filter dashboards by feature, user, or environment. "How much is the chat feature costing?" → instant answer.

Option 2: Langfuse (Self-Hostable)

Langfuse provides tracing, evaluation, and monitoring. Open source.

import { Langfuse } from 'langfuse';

const langfuse = new Langfuse({
  publicKey: process.env.LANGFUSE_PUBLIC_KEY,
  secretKey: process.env.LANGFUSE_SECRET_KEY,
});

// Create a trace for each user interaction
const trace = langfuse.trace({ name: 'chat', userId, metadata: { feature: 'support' } });

// Track the LLM call
const generation = trace.generation({
  name: 'llm-call',
  model: 'claude-sonnet-4-20250514',
  input: messages,
});

const response = await anthropic.messages.create({
  model: 'claude-sonnet-4-20250514',
  messages,
});

generation.end({ output: response.content, usage: {
  input: response.usage.input_tokens,
  output: response.usage.output_tokens,
}});

await langfuse.flush();

What you get:

  • Full request traces (multi-step chains, RAG pipelines)
  • Evaluation framework (score outputs for quality)
  • Cost tracking
  • Latency monitoring
  • Prompt management and versioning
  • Self-hostable (Docker)

Price: Free (self-hosted). Cloud: free tier → $59/mo.

Option 3: Custom with Vercel AI SDK

If you're using the Vercel AI SDK, add telemetry:

import { generateText } from 'ai';
import { anthropic } from '@ai-sdk/anthropic';

const result = await generateText({
  model: anthropic('claude-sonnet-4-20250514'),
  prompt: userMessage,
  experimental_telemetry: { isEnabled: true },
});

// Log to your monitoring system
await logToMonitoring({
  model: 'claude-sonnet-4-20250514',
  inputTokens: result.usage.promptTokens,
  outputTokens: result.usage.completionTokens,
  latencyMs: result.experimental_providerMetadata?.latencyMs,
  finishReason: result.finishReason,
  timestamp: new Date(),
});

Custom Metrics with Prometheus

import { Counter, Histogram, Gauge } from 'prom-client';

const aiRequestDuration = new Histogram({
  name: 'ai_request_duration_seconds',
  help: 'AI request duration',
  labelNames: ['model', 'feature'],
  buckets: [0.5, 1, 2, 5, 10, 30],
});

const aiTokensUsed = new Counter({
  name: 'ai_tokens_total',
  help: 'Total tokens used',
  labelNames: ['model', 'type'], // input/output
});

const aiErrorRate = new Counter({
  name: 'ai_errors_total',
  help: 'AI request errors',
  labelNames: ['model', 'error_type'],
});

const aiDailySpend = new Gauge({
  name: 'ai_daily_spend_dollars',
  help: 'Estimated daily AI spend',
});

Setting Up Alerts

Slack Alerts (Simple)

async function sendAlert(message: string, severity: 'warning' | 'critical') {
  await fetch(process.env.SLACK_WEBHOOK_URL, {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      text: `${severity === 'critical' ? '🚨' : '⚠️'} AI Alert: ${message}`,
    }),
  });
}

// Check after each request
if (latencyMs > 5000) {
  sendAlert(`High latency: ${latencyMs}ms on ${model}`, 'warning');
}
if (dailySpend > 50) {
  sendAlert(`Daily spend exceeded $50: $${dailySpend.toFixed(2)}`, 'critical');
}

Alert Rules

ConditionSeverityAction
p95 latency > 5s (5 min window)WarningSlack notification
Error rate > 5% (5 min window)CriticalPagerDuty + Slack
Daily cost > 2x averageWarningSlack + email
Daily cost > 5x averageCriticalAuto-disable non-essential features
Rate limit errorsWarningQueue requests, notify team
Model provider outageCriticalSwitch to fallback model

Cost Circuit Breaker

class CostCircuitBreaker {
  private dailySpend = 0;
  private readonly maxDailySpend: number;

  constructor(maxDaily: number) {
    this.maxDailySpend = maxDaily;
  }

  async check(estimatedCost: number): Promise<boolean> {
    if (this.dailySpend + estimatedCost > this.maxDailySpend) {
      await sendAlert(`Cost limit reached: $${this.dailySpend}/$${this.maxDailySpend}`, 'critical');
      return false; // Block the request
    }
    return true;
  }

  record(actualCost: number) {
    this.dailySpend += actualCost;
  }
}

Quality Monitoring

Automated Evaluation

// After each AI response, run a quick quality check
async function evaluateResponse(input: string, output: string) {
  const evaluation = await generateText({
    model: anthropic('claude-haiku-3'),
    prompt: `Rate this AI response on a 1-5 scale for:
      - Relevance (does it answer the question?)
      - Accuracy (is the information correct?)
      - Helpfulness (is it actionable?)
      
      Input: ${input}
      Output: ${output}
      
      Return JSON: { relevance: N, accuracy: N, helpfulness: N }`,
  });
  
  const scores = JSON.parse(evaluation.text);
  if (scores.accuracy < 3 || scores.relevance < 3) {
    await flagForReview(input, output, scores);
  }
  return scores;
}

User Feedback Loop

// Track thumbs up/down
app.post('/api/feedback', async (req, res) => {
  const { traceId, rating, comment } = req.body;
  await db.insert('feedback', { traceId, rating, comment, timestamp: new Date() });
  
  // Alert if negative feedback rate is high
  const recentFeedback = await db.query(
    'SELECT AVG(rating) as avg FROM feedback WHERE timestamp > NOW() - INTERVAL 1 HOUR'
  );
  if (recentFeedback.avg < 0.6) {
    await sendAlert('User satisfaction dropped below 60% in the last hour', 'warning');
  }
});

FAQ

Which monitoring tool should I start with?

Helicone for the fastest setup (one line of code). Langfuse if you want self-hosting or detailed traces. Custom Prometheus if you already have a monitoring stack.

How much does monitoring add to latency?

Helicone (proxy): ~10-20ms added. Langfuse (async logging): ~0ms (fires asynchronously). Custom: depends on implementation. Always log asynchronously.

Should I log full prompts and responses?

In development/staging: yes, for debugging. In production: be careful with PII. Log metadata (tokens, latency, cost) always. Log content with PII redaction or only for flagged requests.

How do I monitor costs across multiple providers?

Use a unified logging layer that normalizes costs. Helicone supports multiple providers. Or build a cost lookup table: model → price per 1K tokens.

When should I use LLM-as-judge for quality?

For high-stakes applications (customer-facing, financial, medical). Run evaluation on a sample (10-20% of requests) to manage costs. Use a cheaper model (Haiku) as the judge to keep evaluation costs low.

Bottom Line

Start simple: Helicone proxy → instant visibility into latency, cost, and errors. Add alerts for: cost spikes, high latency, and error rate increases. Graduate to Langfuse for detailed tracing and quality evaluation as your AI usage grows.

Setup in 30 minutes: 1) Add Helicone proxy (5 min). 2) Set up Slack webhook for alerts (10 min). 3) Add cost circuit breaker (15 min). You now have visibility, alerting, and budget protection for your AI in production.

Get AI tool guides in your inbox

Weekly deep-dives on the best AI coding tools, automation platforms, and productivity software.