Why your webhooks are silently failing

Webhooks have a reliability problem that nobody talks about, because the failure mode is invisible.

When a REST API call fails, you know immediately. Your code throws an exception, you get a status code, you handle it. The feedback loop is tight and synchronous.

Webhooks are the opposite. You fire a POST request at someone else’s server and move on. Whether that request was received, processed, or quietly dropped into /dev/null — you find out later, if at all. Often you find out from a customer.

This isn’t a hypothetical risk. It’s the normal operating condition for most webhook integrations in production.

How webhook delivery actually fails

There are more ways for a webhook to fail than most developers account for when they implement them.

The endpoint is down. The most obvious failure. Your consumer’s server is restarting, their process crashed, they’re doing a deployment, their hosting provider is having an incident. You send the event. It goes nowhere.

The endpoint is slow. Your webhook fires; the consumer server receives it but takes 45 seconds to respond. Your timeout is 30 seconds. From your perspective, the request failed. From their perspective, they processed it successfully. Both sides have divergent state with no mechanism to reconcile it.

The endpoint returns a 2xx but didn’t process the event. Some servers respond 200 before actually processing the payload. The database write happens asynchronously. If the async job fails, the event is acknowledged but never handled. Your logs show successful delivery. Their system never saw it.

The payload is rejected for reasons unrelated to availability. Your event schema changed; their parser chokes on the new field. Your content-type header is wrong. Your payload exceeds their body size limit. All of these cause failures that look identical to connectivity failures from a monitoring perspective.

Signature verification fails silently. If you’re signing your payloads (you should be), and the consumer is verifying signatures (they should be), a mismatch causes a rejection. This is correct behavior — but a key rotation, a misconfiguration, or a subtle encoding difference can cause all your events to silently fail signature verification and be dropped.

Events arrive out of order. You send event A, then event B. B arrives first. If the consumer’s processing is order-dependent — and more consumer logic is order-dependent than developers assume — you’ve created a race condition that produces corrupted state.

The asymmetry that makes this hard

The fundamental problem is that webhook delivery is a fire-and-forget protocol designed for a world where “the consumer’s server is up” is a reasonable assumption.

In practice:

Consumer servers go down during deployments
Consumer endpoints have their own rate limits and queue depths
Networks introduce delays that interact badly with timeouts
Consumer codebases change and break compatibility
None of this is visible to you from the sending side

You can implement retries. Most webhook systems do. But retries have their own problems: they exacerbate out-of-order delivery, they can cause duplicate processing if the consumer has at-least-once semantics, and they delay your own event processing pipeline.

And even with retries, if the consumer endpoint is down for longer than your retry window — say, a full hour during a major incident — those events are gone.

What actually helps

Several things make webhook delivery meaningfully more reliable. None of them are complicated; they’re just not defaults.

Sign every payload

Use HMAC-SHA256 with a shared secret to sign your payloads and include the signature in a header (X-Webhook-Signature or similar). Consumers can verify that events came from you and haven’t been tampered with.

The key management matters as much as the signing itself. You need a way to rotate secrets without dropping events during the transition. The standard approach is a short window where both the old and new signatures are accepted.

Include event metadata in every payload

Every event should carry at minimum:

{
  "event_id": "evt_01HX4K...",
  "event_type": "monitor.down",
  "created_at": "2026-03-04T14:22:31Z",
  "sequence": 1042,
  "payload": { ... }
}

event_id lets consumers deduplicate safely. created_at lets them detect out-of-order delivery. sequence (if you can provide it) gives consumers a way to detect gaps.

Without these, consumers have no good options when delivery problems occur.

Implement exponential backoff with jitter

A flat retry interval (retry every 30 seconds) creates thundering herd problems when a consumer comes back online after an outage. Exponential backoff with jitter spreads the load:

Attempt 1: immediate
Attempt 2: 30s + random(0–10s)
Attempt 3: 2m + random(0–30s)
Attempt 4: 10m + random(0–2m)
Attempt 5: 1h + random(0–10m)

This gives the consumer time to stabilize while avoiding a spike of simultaneous retries.

Expose a delivery log

Consumers should be able to see the delivery history for their endpoint: what was sent, when, whether it succeeded, and what response was returned. This sounds optional. In practice it’s essential — it’s the only thing that lets consumers debug their own processing failures.

Monitor the endpoint

This is the step that’s most consistently skipped. You should be continuously checking that your consumers’ endpoints are reachable before events arrive. If an endpoint has been returning 500s for the last hour, you know before the first retry attempt fails. You can alert the consumer. You can surface the problem in your dashboard.

Waiting for delivery failures to discover that an endpoint is down means your first signal is a missed event, not a degraded endpoint. By then you may already have a queue of failed events to replay.

On the consumer side

If you’re on the receiving end of webhooks, the situation is nearly symmetric: you’re equally blind to what the sender’s retry behavior looks like, and equally dependent on them to get it right.

A few things that help regardless of what the sender does:

Respond 200 fast, process asynchronously. Accept the event, write it to a queue, return 200 immediately. Do your processing in a separate worker. This decouples your processing latency from the sender’s timeout.

Deduplicate on event_id. Assume you will receive duplicates. Every event handling path should be idempotent or explicitly deduplicated.

Log raw payloads before processing. Before you parse and process an event, store the raw JSON somewhere. If your processing fails or your schema breaks, you can replay from the raw log. Without this, you lose the event.

Alert when you stop receiving events. If you typically receive N events per hour and you receive zero for an hour, something is wrong. This is as important as alerting on processing failures.

The invisible queue

The uncomfortable truth about webhooks in production is that at any given time, there’s likely a small queue of undelivered events somewhere in your system. Maybe it’s the three events that fired during last Thursday’s deployment. Maybe it’s a batch that hit a timeout window. Maybe it’s events that were delivered but silently rejected by a schema mismatch.

The systems that handle this well aren’t the ones that never have delivery problems. They’re the ones that detect them quickly, have enough metadata to diagnose them, and have a recovery path that doesn’t require manual intervention.

The ones that handle it poorly are the ones that find out from a customer that events stopped arriving three days ago.

Vigil monitors webhook endpoints continuously and alerts you the moment they become unreachable — before delivery failures pile up. See how it works.