All documentation

Webhook Timeouts and Retries: The Complete Reliability Guide for Production Automation

18 min read

By LogicLot Team · Last updated March 2026

Deep-dive guide to webhook failure modes, timeout thresholds by provider, retry strategies (exponential backoff, linear, fixed), idempotency keys, dead letter queues, monitoring webhook health, and platform-specific guidance for Stripe, Shopify, and GitHub webhooks.

Webhooks are the backbone of real-time automation. When a customer places an order, a payment succeeds, or a support ticket is created, a webhook delivers that event to your workflow in seconds. But webhooks fail more often than most teams realise. According to Hookdeck's analysis of webhook traffic patterns, between 1% and 5% of webhook deliveries fail on the first attempt due to timeouts, server errors, or network issues. At scale, that means thousands of missed events per month if your system is not designed for failure.

Svix, which processes webhook deliveries for companies including Clerk, Brex, and Lob, reports that the median webhook processing time for well-designed endpoints is under 200 milliseconds, but the 99th percentile often exceeds 10 seconds. That tail latency is where failures concentrate. This guide covers why webhooks fail, how to handle timeouts, retry strategies that actually work in production, idempotency implementation, dead letter queues, monitoring, and platform-specific guidance for Stripe, Shopify, GitHub, Twilio, HubSpot, and SendGrid.

Why webhooks fail: the root causes

Webhook failures fall into five categories. Understanding the cause determines the correct response.

Timeout failures

The most common failure mode. The webhook provider sends an HTTP POST to your endpoint and waits for a response. If your endpoint does not respond within the provider's timeout window, the delivery is marked as failed. Timeouts happen because: your endpoint performs heavy processing before responding (database writes, external API calls, file processing), the server is under load and response times have increased, a downstream dependency (database, cache, external API) is slow, or a network issue adds latency between the provider and your endpoint.

The critical insight: the provider does not care why you timed out. From the provider's perspective, a timeout is a timeout. Whether your server crashed or you were doing useful work, the result is the same—the event is marked as failed and scheduled for retry.

Server errors (5xx responses)

Your endpoint returns a 500, 502, 503, or other 5xx status code. This signals that your server received the request but could not process it. Common causes: unhandled exceptions in your webhook handler code, database connection pool exhaustion, memory pressure causing the application to error, deployment in progress (rolling restart), or dependency failure (Redis down, external service unavailable).

Network failures

The provider cannot reach your endpoint at all. DNS resolution failure, TCP connection refused, TLS handshake failure, or a firewall blocking the request. These are typically infrastructure issues: misconfigured DNS, expired SSL certificates, or security rules that block the provider's IP range.

Client errors (4xx responses)

Your endpoint returns a 400, 401, 403, or 404. These are not transient—they indicate a configuration problem. A 401 means your signature verification is rejecting valid requests (wrong secret configured). A 404 means the webhook URL is incorrect. A 400 means your parser cannot handle the payload format. Providers generally do not retry 4xx errors because the same request will produce the same result.

Rate limiting (429 responses)

Your endpoint rejects the webhook because it is receiving too many requests. This happens when a burst of events (flash sale, bulk operation, migration) overwhelms your endpoint's capacity. Some providers retry 429 responses; others treat them as permanent failures.

Timeout thresholds by provider: the definitive reference

Every webhook provider enforces its own timeout window. Knowing these numbers is essential for designing your handler.

| Provider | Timeout | Retry policy | Max retry window | Documentation | |----------|---------|-------------|------------------|---------------| | Stripe | ~20 seconds | Exponential backoff | ~3 days (up to 16 attempts) | Stripe webhook best practices | | Shopify | ~5 seconds | Exponential backoff | ~48 hours (19 attempts) | Shopify webhook docs | | GitHub | ~10 seconds | Fixed interval | ~24 hours | GitHub webhook deliveries | | Twilio | ~15 seconds | Configurable | Varies by product | Twilio webhook docs | | HubSpot | ~30 seconds | Exponential backoff | Varies | HubSpot webhook API | | SendGrid | ~10 seconds | Exponential backoff | ~24 hours | SendGrid event webhook | | PayPal | ~30 seconds | Fixed interval | ~3 days | PayPal webhook notifications | | Slack | ~3 seconds | Fixed interval | ~1 hour | Slack Events API |

Shopify's 5-second timeout is notably aggressive. If your handler does anything beyond signature verification and enqueueing before responding, you will experience timeouts. Slack's 3-second timeout for the Events API is even shorter—responding late causes Slack to disable your endpoint after repeated failures. Stripe's 20-second window is more forgiving but should not be taken as permission to do heavy processing synchronously.

Always verify these numbers against current documentation. Providers update timeout policies, and the numbers above are based on documentation as of the reference dates.

The acknowledge-first pattern: the single most important design decision

The acknowledge-first pattern (also called "respond-then-process" or "async processing") separates webhook receipt from webhook processing. Your endpoint receives the HTTP POST, performs minimal validation, returns 200 OK, and then processes the event asynchronously.

Why this pattern exists

The math is straightforward. If your handler needs to: verify the HMAC signature (5-50ms), parse the JSON body (1-5ms), query your database to check for duplicates (20-200ms), create a record in your CRM (200-2000ms), send an email notification (500-3000ms), and update a project management tool (300-1500ms)—the total processing time is 1-7 seconds in the best case, and much longer if any dependency is slow. Against a 5-second Shopify timeout, this fails regularly. Against a 3-second Slack timeout, it fails every time.

Implementation architecture

Step 1: Receive and validate. Your endpoint receives the POST request. Verify the HMAC signature using the provider's shared secret. This confirms the request is authentic. Reject invalid signatures with 401. Parse the JSON body. Reject malformed payloads with 400.

Step 2: Enqueue. Write the validated event to a queue. Options include:

  • **Redis** — Use LPUSH to add events to a list. A worker process reads events with BRPOP. Fast (sub-millisecond), widely available, and simple.
  • **AWS SQS** — Managed queue service. Built-in dead letter queue support. No infrastructure to manage.
  • **RabbitMQ** — Feature-rich message broker with routing, acknowledgements, and persistence.
  • Database table — Insert the event into a `webhook_events` table with status `pending`. A background job polls for pending events. Simpler to set up; higher latency than dedicated queues.

Step 3: Respond with 200. Return HTTP 200 OK immediately after enqueueing. Do not wait for processing to complete. The response body can be empty or contain a simple acknowledgement.

Step 4: Process asynchronously. A separate worker process reads events from the queue and performs the actual business logic: database writes, API calls, email sends, notifications. If processing fails, the worker can retry with its own backoff logic, independent of the webhook provider's retry schedule.

Svix recommends keeping the queue physically close to the webhook receiver to minimise enqueue latency. If your webhook endpoint is in US-East and your queue is in EU-West, the round-trip latency to enqueue eats into your timeout budget.

How workflow platforms handle this

Zapier, Make, and n8n implement the acknowledge-first pattern internally. When a webhook trigger fires, the platform receives the POST, returns 200, and queues the workflow execution. Subsequent steps run asynchronously. This means you generally do not need to implement your own queue when using these platforms. However, you should be aware that: individual steps within the workflow can still time out or fail, some platforms retry the entire workflow on step failure (which can cause duplicate processing if earlier steps had side effects), and the platform's internal queue may have its own latency and throughput limits.

Retry strategies: choosing the right approach

When a webhook delivery fails, the provider retries. Your own system should also have retry logic for processing failures. Understanding the different retry strategies helps you design for reliability.

Exponential backoff

The standard retry strategy. The delay between retries increases exponentially: 1 second, 2 seconds, 4 seconds, 8 seconds, 16 seconds, 32 seconds, and so on, up to a maximum delay. This gives the failing system time to recover without overwhelming it with retry requests.

Algorithm: ``` delay = min(base_delay * 2^attempt, max_delay) ```

Example with base_delay = 1s, max_delay = 3600s (1 hour):

  • Attempt 1: 1s delay
  • Attempt 2: 2s delay
  • Attempt 3: 4s delay
  • Attempt 4: 8s delay
  • Attempt 5: 16s delay
  • Attempt 6: 32s delay
  • Attempt 7: 64s delay
  • Attempt 8: 128s delay (~2 min)
  • Attempt 9: 256s delay (~4 min)
  • Attempt 10: 512s delay (~8.5 min)

Stripe uses exponential backoff with up to 16 retry attempts spread across approximately 3 days.

Exponential backoff with jitter

Pure exponential backoff has a problem: if many webhooks fail at the same time (service outage, deployment), they all retry at the same intervals, creating a "thundering herd" that can overwhelm the recovering system. Jitter adds randomness to the delay to spread retries across time.

**Full jitter algorithm (recommended by AWS architecture blog):** ``` delay = random(0, min(base_delay * 2^attempt, max_delay)) ```

Decorrelated jitter: ``` delay = min(max_delay, random(base_delay, previous_delay * 3)) ```

AWS's analysis of jitter strategies found that full jitter produces the fewest total retries and the shortest total completion time compared to equal jitter or no jitter. For webhook systems handling thousands of events, jitter is not optional—it is a reliability requirement.

Linear backoff

The delay increases by a fixed amount: 5 seconds, 10 seconds, 15 seconds, 20 seconds. Linear backoff is simpler to reason about but less efficient than exponential backoff for long outages. Use it when: the failure is expected to be brief (seconds, not minutes), you want predictable retry timing, or the downstream system recovers quickly but cannot handle burst traffic.

Fixed interval retry

Retry at the same interval every time: every 60 seconds, for example. GitHub uses a fixed interval approach for webhook retries. Fixed interval is appropriate when: the failure is likely infrastructure-level (the system is either up or down), you want simple, predictable behaviour, and the retry count is low (3-5 attempts).

Choosing a strategy

| Scenario | Recommended strategy | Rationale | |----------|---------------------|-----------| | Payment webhooks (Stripe, PayPal) | Exponential backoff with jitter | Critical data, extended outages possible | | E-commerce events (Shopify) | Exponential backoff with jitter | High volume, thundering herd risk | | Developer tooling (GitHub) | Fixed interval or linear | Usually brief failures | | High-volume event streams | Exponential backoff with full jitter | Burst traffic management | | Internal microservices | Linear or fixed | Controlled environment, fast recovery |

Idempotency: making retries safe

Retries create duplicates. Whether the provider retries a failed delivery or your worker retries a failed processing step, the same event may be processed more than once. Idempotency ensures that processing an event multiple times produces the same result as processing it once.

Event ID deduplication

Every major webhook provider includes a unique identifier for each event. Store processed IDs and check before processing.

Stripe: The `id` field in the event object (e.g. `evt_1Nq5dF2eZvKYlo2CzQ3U9FXh`) — Stripe webhook events

GitHub: The `X-GitHub-Delivery` header contains a UUID that uniquely identifies the delivery — GitHub webhook headers

Shopify: The `X-Shopify-Webhook-Id` header — Shopify webhook headers

HubSpot: Event objects include unique identifiers — HubSpot webhook API

SendGrid: Each event includes an `sg_event_id` — SendGrid event webhook

Implementation with Redis

Redis is the most common choice for deduplication because of its speed and built-in TTL (time-to-live) support.

``` SET webhook:evt_1Nq5dF2eZvKYlo2CzQ3U9FXh 1 NX EX 604800 ```

This command: sets a key with the event ID, uses NX (only set if the key does not exist—returns null if it does), and EX 604800 sets a 7-day expiration. If the SET returns OK, this is a new event—process it. If it returns null, this is a duplicate—skip processing and return 200.

Implementation with a database table

Create a table to track processed events:

  • `event_id` (primary key, VARCHAR)
  • `provider` (VARCHAR — "stripe", "shopify", etc.)
  • `event_type` (VARCHAR — "payment_intent.succeeded", "orders/create", etc.)
  • `received_at` (TIMESTAMP)
  • `processed_at` (TIMESTAMP, nullable)
  • `status` (VARCHAR — "processing", "completed", "failed")

Before processing, attempt an INSERT. If it succeeds (no conflict), process the event. If it fails (duplicate key), skip. This approach provides an audit trail and supports querying for failed or stuck events.

Idempotency keys for outgoing API calls

When your webhook handler calls external APIs to create records, use idempotency keys to prevent duplicates downstream. Stripe's idempotency key pattern is the canonical example: include an `Idempotency-Key` header with a unique value derived from the event. The server stores the response for that key and returns it on retry instead of creating a duplicate. Generate the idempotency key deterministically from the webhook event ID and the action being performed. For example: `{event_id}:{action}` ensures that retrying the same event for the same action produces the same key.

Dead letter queues: handling exhausted retries

When all retry attempts are exhausted, the event needs somewhere to go. Without a dead letter queue (DLQ), the event is lost. For payment webhooks, lost events mean missed revenue, incorrect order statuses, or failed fulfilment. For CRM updates, lost events mean stale data.

What a dead letter queue stores

Each DLQ entry should contain: the original event payload (complete JSON body), all headers from the original delivery (including signature headers for re-verification), the event ID, the number of retry attempts made, the error or status code from the last attempt, timestamps (first received, last retry, moved to DLQ), and any stack trace or error message from your handler.

DLQ implementation options

AWS SQS dead letter queue: SQS natively supports DLQs. Configure a "redrive policy" on your main queue specifying the DLQ and the maximum receive count. Events that exceed the receive count are automatically moved to the DLQ. AWS SQS DLQ documentation.

Database table: A `dead_letter_events` table with the fields above. Simple, queryable, and compatible with any stack. Add a `replayed_at` column to track when events are replayed after the fix is deployed.

Redis list: RPUSH failed events to a `dlq:{provider}` list. Monitor list length as an operational metric. Less durable than a database (Redis data can be lost on restart unless persistence is configured).

Workflow platform error handling: n8n has error workflows that trigger when a main workflow fails. Configure the error workflow to log the failed execution details to a database or notification system. Make has error handling routes that can redirect failed executions. Zapier shows failed tasks in task history but does not have a native DLQ—use a custom Code step to send failures to an external logging service.

Replaying events from the DLQ

After identifying and fixing the root cause of the failure: query the DLQ for events that match the failure pattern, re-process each event through your handler, verify the result, and move the event from the DLQ to a "replayed" status. Automate this where possible. A manual replay process is error-prone and does not scale.

Monitoring webhook health: metrics that matter

Webhook reliability is not "set and forget." Continuous monitoring catches degradation before it causes data loss.

Key metrics

  • Delivery success rate: Percentage of webhook deliveries that receive a 200 response on the first attempt. Target: above 99%. Below 98% indicates a systemic issue.
  • p95 and p99 response time: How long your endpoint takes to respond at the 95th and 99th percentiles. If p99 exceeds 50% of the provider's timeout, you are at risk.
  • Retry rate: What percentage of deliveries require at least one retry. A spike indicates emerging failures.
  • DLQ depth: How many events are in the dead letter queue. Non-zero means investigation is needed. Growing depth means the problem is ongoing.
  • Deduplication hit rate: What percentage of incoming events are duplicates. A spike means retries are increasing, which means delivery failures are increasing.
  • Queue processing lag: The time between an event entering the processing queue and being processed. Growing lag means your workers cannot keep up.
  • Error classification: Break down errors by type (timeout, 5xx, 4xx, network). Each type requires a different response.

Monitoring tools

**Hookdeck:** Purpose-built webhook infrastructure with delivery monitoring, automatic retries, and debugging tools. Sits between the webhook provider and your endpoint, adding observability without changing your handler code.

**Svix:** Webhook sending and receiving infrastructure with delivery tracking, retry management, and a dashboard for investigating failures. Used by Clerk, Brex, and other B2B SaaS companies.

**Datadog:** Full observability platform. Create custom metrics for webhook delivery success, response time, and error rates. Set up alerts on thresholds. Integrates with most cloud providers and application frameworks.

**Better Stack:** Uptime monitoring for your webhook endpoints. Alerts when your endpoint goes down. Useful as a complement to provider-side monitoring.

Platform-native logs: n8n execution history, Make scenario logs, and Zapier task history provide first-line visibility into workflow-level successes and failures.

Alerting rules

Set up alerts for: delivery success rate dropping below 99% (warning) or 95% (critical), p99 response time exceeding 60% of the provider's timeout, DLQ depth increasing for more than 30 minutes, retry rate exceeding 5% of total deliveries, and queue processing lag exceeding 5 minutes.

Platform-specific guidance: Stripe webhooks

Stripe is the most common webhook provider for payment automation. Their webhook system has specific behaviours you need to understand.

Timeout: Stripe waits approximately 20 seconds for a response. After that, the delivery is marked as failed.

Retry schedule: Stripe retries failed deliveries with exponential backoff. Up to 16 retry attempts over approximately 3 days. Each attempt is logged in the Stripe Dashboard under Developers > Webhooks > the specific endpoint.

Signature verification: Stripe uses HMAC-SHA256. The `Stripe-Signature` header contains a timestamp (`t`) and one or more signatures (`v1`). Verify using the webhook signing secret from the Stripe Dashboard. Stripe signature verification docs.

Event ordering: Stripe does not guarantee that events arrive in the order they occurred. A `payment_intent.succeeded` event might arrive before the `payment_intent.created` event. Use the `created` timestamp in the event object for ordering, or design your handler to be order-independent.

Best practice for Stripe: Use the Stripe CLI to forward webhook events to your local development environment during testing. This avoids the need for ngrok and provides a more reliable development experience.

Platform-specific guidance: Shopify webhooks

Shopify has the most aggressive timeout policy among major e-commerce platforms.

Timeout: Shopify waits approximately 5 seconds. This is one of the shortest timeout windows among major webhook providers. If your handler does anything beyond minimal validation and enqueueing, you will experience timeouts.

Retry schedule: Shopify retries failed deliveries with exponential backoff. Up to 19 retry attempts over approximately 48 hours. After exhausting retries, Shopify removes the webhook subscription entirely—your endpoint stops receiving events until you re-register.

Signature verification: Shopify uses HMAC-SHA256. The `X-Shopify-Hmac-SHA256` header contains the Base64-encoded HMAC. Verify using your app's shared secret.

Critical Shopify behaviour: If your endpoint returns non-2xx responses consistently, Shopify will unsubscribe the webhook. This is a destructive action—you must monitor delivery success and re-subscribe if Shopify removes your webhook. Implement automated monitoring that checks your active webhook subscriptions and re-registers if any are missing.

Platform-specific guidance: GitHub webhooks

GitHub webhook behaviour is designed for developer tooling and CI/CD integrations.

Timeout: GitHub waits approximately 10 seconds for a response.

Retry schedule: GitHub retries failed deliveries for up to 24 hours. The retry interval is roughly fixed rather than exponential. You can view delivery attempts and their results in the repository or organisation settings under Webhooks.

Signature verification: GitHub uses HMAC-SHA256. The `X-Hub-Signature-256` header contains the hexadecimal HMAC digest. GitHub signature verification docs.

Unique feature: GitHub provides a "Redeliver" button in the webhook settings dashboard. This allows you to manually redeliver any past webhook event. Useful for debugging and for recovering from failures without building your own replay mechanism.

Delivery tracking: The `X-GitHub-Delivery` header contains a UUID for each delivery. GitHub also provides an API endpoint to list recent deliveries and their status, enabling programmatic monitoring.

Testing webhook reliability

Testing is not optional. A webhook handler that has not been tested for timeout, retry, and deduplication scenarios will fail in production.

Testing timeout behaviour

1. Use webhook.site or ngrok to expose a local endpoint 2. Add a configurable delay to your handler (e.g. a sleep function) 3. Set the delay to just below the provider's timeout and verify success 4. Set the delay to just above the provider's timeout and verify that the provider marks it as failed 5. Observe the retry behaviour and verify your handler processes the retry correctly

Testing deduplication

1. Process a webhook event and verify success 2. Send the same event again (same event ID) 3. Verify that the second delivery is acknowledged (200) but not processed 4. Check your deduplication store for the correct entry

Testing dead letter queue

1. Configure your handler to always fail (return 500) 2. Send a webhook event 3. Observe retries until exhaustion 4. Verify the event appears in your dead letter queue with all required metadata 5. Fix the "failure" and replay the event from the DLQ 6. Verify successful processing

Provider test modes

Use sandbox or test environments to generate test events without affecting production data: Stripe test mode uses separate API keys and fake payment methods, Shopify partner development stores allow testing webhooks without affecting real stores, and GitHub webhook settings allow you to trigger ping events and redeliver past events.

Production checklist

  • Acknowledge first. Return 200 within 1-2 seconds. Process asynchronously.
  • Verify signatures. HMAC verification on every incoming request. Reject invalid signatures with 401.
  • Deduplicate. Store event IDs in Redis or a database. Check before processing. Set a TTL (7-30 days).
  • Use idempotency keys. For every outgoing create operation, include a deterministic idempotency key.
  • Implement a dead letter queue. Capture events that exhaust retries. Include full payload, headers, and error details.
  • Classify errors. Retry 5xx and timeouts. Do not retry 4xx. Handle 429 with Retry-After header.
  • Use exponential backoff with jitter. For your own retry logic, not just the provider's.
  • Monitor continuously. Track delivery success rate, p99 response time, DLQ depth, and deduplication hit rate.
  • Alert proactively. Set thresholds that trigger alerts before failures cascade.
  • Test failure scenarios. Timeout, deduplication, DLQ, and replay—test all of them before going to production.
  • Monitor Shopify subscriptions. Shopify removes webhook subscriptions after persistent failures. Automate re-registration.

When to use webhook infrastructure services

For teams processing more than a few hundred webhook events per day, or where webhook reliability is business-critical (payments, order fulfilment, compliance), consider dedicated webhook infrastructure. Hookdeck and Svix add a reliability layer between the webhook provider and your handler: automatic retries, delivery logging, rate limiting, transformation, and routing. These services handle the acknowledge-first pattern, deduplication, and DLQ for you, allowing your handler to focus on business logic.

The trade-off is cost and an additional dependency. For low-volume or non-critical webhooks, implementing the patterns in this guide directly is sufficient. For high-volume, business-critical webhooks—especially in payment processing and e-commerce—the investment in dedicated infrastructure pays for itself through reduced data loss and lower engineering maintenance.

Experts on LogicLot design webhook integrations that handle timeouts, retries, and failure scenarios correctly from day one. If your current webhooks are unreliable or you are building a new integration, post a Custom Project or book a Discovery Scan for a tailored assessment.

Frequently Asked Questions

How long do webhook providers wait before timing out?

Timeout windows vary significantly by provider: Stripe waits approximately 20 seconds, GitHub approximately 10 seconds, Shopify approximately 5 seconds, and Slack only 3 seconds. Always check the provider's current documentation, as these thresholds can change. Design your handler to respond within 1-2 seconds regardless of the provider.

What is the acknowledge-first pattern for webhooks?

The acknowledge-first pattern separates receipt from processing. Your endpoint verifies the HMAC signature, validates the payload structure, enqueues the event to a queue (Redis, SQS, or database), and returns 200 OK immediately. A separate worker process then handles the actual business logic asynchronously. This keeps response times under 200 milliseconds and prevents timeout failures.

How do I prevent duplicate processing when webhooks retry?

Use event ID deduplication. Every major provider includes a unique event identifier (Stripe uses the event id field, GitHub uses X-GitHub-Delivery header, Shopify uses X-Shopify-Webhook-Id header). Store processed IDs in Redis with a 7-day TTL or in a database table. Check for the ID before processing. For outgoing API calls triggered by the webhook, use idempotency keys to prevent downstream duplicates.

What happens when Shopify webhook retries are exhausted?

Shopify has one of the most aggressive webhook policies. If your endpoint consistently fails to respond with a 2xx status code, Shopify will automatically remove (unsubscribe) your webhook registration. Your endpoint stops receiving events entirely until you re-register the webhook. Monitor your active Shopify webhook subscriptions and implement automated re-registration to prevent silent data loss.