How to monitor an LLM-powered application

When you deploy a traditional web application, monitoring is a solved problem. You watch error rates, response times, and uptime. A spike in 5xx errors means something broke. A latency increase means something’s slow. The signals are clear.

LLM-powered applications break this model entirely.

Your HTTP status codes can be green while your application is producing garbage. Your p99 latency can look fine while a specific model endpoint is degraded. Your users can be receiving responses — technically correct responses — that are wrong in ways no metric catches automatically.

This is a different kind of reliability problem, and it needs a different kind of monitoring.

What actually breaks in LLM applications

Before you can monitor something, you need to know what goes wrong. LLM applications have failure modes that fall into a few distinct categories.

Upstream provider failures. OpenAI, Anthropic, Cohere — all of them have outages, rate limit events, and model-specific degradations. These are partial: a specific model endpoint can be down while the API appears operational. Your application might silently fall back to an older model, hit token limits, or start receiving degraded responses — all without a single 5xx to show for it.

Latency spikes. LLM inference is slow compared to traditional APIs. A normal response might take 800ms; under load or during provider degradation, that can stretch to 15 seconds. If you’re not tracking p95 and p99 latency separately from mean, you’ll miss the long tail entirely.

Context window saturation. Applications that stuff conversation history or retrieved documents into the context window will eventually hit limits. This causes silent truncation or hard errors depending on your implementation. Neither is obvious from outside the request.

Downstream failures. Most LLM applications are pipelines. The model calls a tool; the tool calls an API; the API calls a database. A failure three steps downstream looks identical to a model failure from the user’s perspective. Knowing where in the chain things broke is the difference between a quick fix and an hour of debugging.

Output quality drift. The hardest one to monitor. Model providers update their models. Your prompt that worked perfectly in December produces subtly different output in March. No metric captures this directly — you need evaluation pipelines or at minimum a way to flag and review low-confidence outputs.

What to actually monitor

Given those failure modes, here’s what produces signal worth acting on:

Provider API health

Treat your LLM provider endpoints like any other critical upstream dependency. Monitor:

The model endpoint you’re actually using, not just the API base URL
Response time percentiles (p50, p95, p99) — not averages
Error rates by error type (rate limit errors vs. timeout errors vs. model errors mean very different things)
Whether the endpoint is returning responses at all, from multiple geographic regions

Don’t rely on your provider’s status page. It’s almost always behind. By the time they post an incident, you’ve had degraded service for 20 minutes. Hit the endpoint yourself on a schedule.

Your application’s own endpoints

Separately from the model provider, monitor your own application:

The routes that users actually hit
End-to-end latency from request receipt to response delivery
Error rates at the application layer, not just the model layer

This separation matters. When your application latency spikes, you want to know immediately whether the spike is in the model call or in your own code.

Downstream dependencies

If your application uses tools, function calling, or retrieval (which most do), every external call is a potential failure point:

Any APIs your tools call
Your vector database or retrieval endpoint
Any webhook endpoints you deliver results to
Any storage endpoints used for context or memory

These should all be monitored on their own. When a tool call fails, you want to know if that tool’s upstream API was down independently — not try to infer it from your application logs.

Rate limits and quota

LLM API rate limits are enforced in ways that interact badly with application traffic patterns. A burst of user requests can exhaust your per-minute token quota even if your daily usage is fine. Monitor:

Rate limit error frequency over time
Whether rate limit errors are clustered (indicating a quota problem) or random (indicating overloaded endpoints)

What not to monitor

Not everything that seems useful is worth the noise. A few things that generate alerts without being actionable:

Token count per request. Interesting for cost tracking, nearly useless for reliability monitoring. High token counts don’t reliably predict failures.

Model response time in isolation. Meaningful only relative to your p99 baseline. A 3-second response from GPT-4 might be completely normal; a 3-second response from a call that’s usually 400ms is a problem.

Error rates without segmentation. An error rate metric that lumps rate limit errors, model errors, and network errors together will mislead you constantly. Segment them.

The pipeline problem

The most common mistake in LLM application monitoring is treating the model as the unit of observation. It isn’t. The pipeline is.

A typical RAG application, for example, has at least five distinct things that can fail:

User request
  → Query embedding (model call)
  → Vector search (database call)
  → Context assembly (your code)
  → Completion request (model call)
  → Response delivery (your application)

Monitoring any one of these tells you something. Monitoring all of them tells you where problems actually live. The goal is to get from “something’s wrong” to “this specific step in this specific pipeline is degraded” as fast as possible.

Alerts worth setting

After you’ve set up monitoring, these are the thresholds worth alerting on:

Provider endpoint unavailable — immediate alert, no delay
Application p99 latency > 2× baseline for 5 minutes — something is degraded
Error rate > 2% over any 10-minute window — needs investigation
Rate limit errors > 10% of requests — quota problem, needs immediate response
Any downstream tool/API unavailable — affects all requests that use that tool

The key is calibrating to your baseline. What’s slow for a search autocomplete is fast for a complex reasoning task. Know your normal numbers before you set thresholds.

Putting it together

A well-monitored LLM application looks like this:

The LLM provider endpoint is checked on a schedule from outside your infrastructure
Your own application endpoints are monitored end-to-end
Every external dependency (tools, retrieval, APIs) is monitored independently
Latency is tracked as percentiles, not averages
Errors are segmented by type
Alerts fire on what’s actionable, not on noise

This isn’t fundamentally different from monitoring any distributed system. What’s different is the failure modes — partial degradations, silent quality drops, quota exhaustion — that make the standard “is it up” check insufficient.

The applications that handle LLM infrastructure reliability well treat their model provider like any other critical external API: not trusted, actively monitored, and assumed to be occasionally degraded. The ones that struggle are the ones waiting for the provider’s status page to tell them something’s wrong.

Vigil monitors HTTP and MCP endpoints on a configurable schedule with real-time alerting. If you’re building LLM-powered applications and want to know when your upstream dependencies degrade before your users do, try it for free.