Debugging Webhook Delivery Failures: Hidden Pitfalls

You've set up a webhook endpoint, tested it with curl, and it works. But in production, webhooks disappear into a black hole. The sender logs show a 200 OK, yet your database never gets the data. This is the classic webhook debugging nightmare.

Most debugging guides focus on the obvious: check the HTTP status code, verify the payload format, look at the logs. But the real failures happen in layers that your application never sees. I've spent weeks tracing webhook drops that turned out to be DNS caching, reverse proxy timeouts, and even SSL session resumption bugs.

The HTTP Handshake That Never Happened: DNS and SSL

Your webhook receiver's DNS record changed, but the sender's DNS resolver cached the old IP for 24 hours. The webhook goes to a dead server. You see nothing in your logs because the request never reached your application.

I once saw a webhook pipeline where the sender's DNS TTL was set to 300 seconds, but the sender's infrastructure ignored it and cached for 24 hours. We lost webhooks for a day after a DNS change. The fix: use a CDN or load balancer with a static IP, and keep the old server alive until the TTL expires.

warning

Always set DNS TTLs to 60 seconds or less for webhook endpoints. And never delete the old server until the old TTL has passed. Better yet, use a load balancer with a static IP so DNS changes are transparent.

Checking DNS resolution from the sender's network. If you see a stale IP, you found your problem.

dig +short webhook.example.com
# 203.0.113.10 (old IP, still cached)
# Expected: 198.51.100.20

SSL Renegotiation and Session Resumption

Another silent killer: SSL session resumption failures. Some senders reuse SSL sessions, but if your server runs out of session cache or the session ticket key rotates, the handshake fails. The sender gets a connection reset and silently retries, but if the retry also fails, the webhook is lost.

I've seen cases where the SSL session cache was too small and evicted old sessions. After a spike in traffic, every new connection required a full handshake, which increased latency. Senders timed out and dropped the webhook.

Reverse Proxy Timeouts: The Silent Drop

Your application logs show nothing, but the sender says they got a 200. What happened? The request hit your reverse proxy (Nginx, HAProxy, AWS ALB), which forwarded it to your app. Your app processed it and returned a response. But the proxy had already timed out waiting for the app and sent a 504 to the sender. The sender got a 504 and retried, but the app had already processed the first request. Now you have duplicate data.

Actually, worse: the proxy might have closed the connection to the app before the app finished. The app writes to the database, but when it tries to write the response, the socket is broken. The app crashes or logs an error. But the proxy already returned a 200 to the sender? No, the proxy returned a 504. But the sender's logs say 200? Maybe the sender retried and got a 200 on the second attempt. This is a mess.

info

Configure your reverse proxy's proxy_read_timeout to be longer than your app's maximum processing time. For Nginx: proxy_read_timeout 30s;. Also, ensure your app returns a response quickly (within 5 seconds) and processes asynchronously.

Nginx configuration with generous timeouts and buffer sizes for webhook endpoints.

location /webhook {
    proxy_pass http://backend;
    proxy_read_timeout 30s;
    proxy_send_timeout 30s;
    proxy_buffer_size 128k;
    proxy_buffers 4 256k;
    proxy_busy_buffers_size 256k;
}

Idempotency Keys: Not Just for Deduplication

Every webhook sender should include an idempotency key (like Stripe's Idempotency-Key header). But many don't. If your webhook receiver does not require idempotency keys, you're setting yourself up for duplicates when retries happen.

But idempotency keys also help in debugging. When a webhook is delivered multiple times, you can correlate the logs by the key. You can see how many times it was delivered and when. Without it, you're guessing.

The Case of the Duplicate Payment Webhooks

14:23Sender sends payment webhook (idempotency key: abc123). Receiver returns 200 after 3 seconds.
14:24Sender doesn't get response (TCP reset) and retries with same idempotency key.
14:24Second request arrives at receiver, but first request is still processing. Receiver's idempotency check returns 409 Conflict, but sender interprets 409 as failure and retries again.
14:25Third request arrives, first request finished. Receiver sees no idempotency record (because it was stored in memory and lost on restart?) and processes again. Duplicate payment created.

Lesson

Idempotency keys must be stored durably (e.g., in a database) and checked before processing. The response to a duplicate should be 200 with the original result, not 409. Also, ensure your idempotency check is atomic.

Structured Logging: The Bare Minimum

You can't debug webhooks if your logs are a mess. Use structured logging (JSON) with these fields: request_id, idempotency_key, source_ip, user_agent, endpoint, http_method, http_status, latency_ms, processing_time_ms, error_message.

Also log the raw payload at DEBUG level, but be careful with PII. And always log the response body you send back. I can't count how many times I've seen logs that say "processed webhook" without any details.

Structured logging in a webhook handler. Note the requestId and processingTime fields.

// Example webhook handler with structured logging
import { createLogger } from 'winston';

const logger = createLogger({ level: 'info', format: format.json() });

app.post('/webhook', async (req, res) => {
  const requestId = req.headers['x-request-id'] || uuid();
  const idempotencyKey = req.headers['idempotency-key'];
  const start = Date.now();

  logger.info('Webhook received', { requestId, idempotencyKey, sourceIp: req.ip });

  try {
    const result = await processWebhook(req.body);
    const processingTime = Date.now() - start;
    logger.info('Webhook processed', { requestId, processingTime, status: 200 });
    res.status(200).json({ status: 'ok' });
  } catch (err) {
    const processingTime = Date.now() - start;
    logger.error('Webhook processing failed', { requestId, error: err.message, processingTime });
    res.status(500).json({ status: 'error' });
  }
});

Distributed Tracing: Connect the Dots

When a webhook passes through multiple services (sender -> CDN -> reverse proxy -> app -> database -> queue -> worker), you need a trace to see where it got lost. Use OpenTelemetry or a vendor like Datadog, Honeycomb, or Jaeger.

Start by adding trace context propagation in your webhook handler. If the sender sends a trace header (e.g., sw8 or traceparent), use it. If not, generate one and propagate it internally. Then you can see the full journey of a single webhook.

If you don't have distributed tracing, you are debugging blind. A single webhook can traverse 10 services. The logs from each service are islands. Tracing connects them.

A Practical Checklist

1Verify DNS resolution from the sender's network (use dig).
2Check SSL/TLS handshake with openssl s_client.
3Review reverse proxy logs for timeout errors (upstream timed out).
4Ensure reverse proxy timeout > app processing time.
5Add idempotency key support; store keys in a database.
6Implement structured logging with request IDs.
7Set up distributed tracing; propagate trace context.
8Monitor webhook latency and error rates; alert on anomalies.

Conclusion

Debugging webhook delivery is about looking beyond your application code. The network, DNS, SSL, and reverse proxy layers all have their own failure modes. And without proper observability (structured logs + tracing), you're flying blind.

Start with the checklist above. The next time a webhook disappears, you'll know where to look.

Frequently asked questions

Why does my webhook return 200 but the data never appears?

A 200 response only means the server accepted the connection and sent a response. The server might have queued the request and then failed to process it (e.g., database error, memory limit). Always log the processing result, not just the HTTP response.

How can I detect a webhook that was dropped by a reverse proxy?

Check the reverse proxy's access logs for the request. If the request appears there but not in your application logs, it's likely a proxy-to-app issue. Look for upstream timeouts or connection resets.

Should I retry webhooks that fail with 5xx errors?

Yes, but use exponential backoff and idempotency keys. A 5xx error indicates the server is in a bad state; retrying immediately will likely fail again. Wait at least 30 seconds, then retry with a unique idempotency key per attempt.

What is the most common mistake in webhook receiver implementation?

Not returning a response until the entire processing is complete. If processing takes longer than the sender's timeout (often 5-10 seconds), the sender will retry, causing duplicates. Instead, acknowledge receipt immediately and process asynchronously.

Debugging Webhook Delivery Failures: What the Logs Don't Tell You