Webhook reliability for pipeline events: idempotency or it didn't happen

Webhooks deliver at least once, not exactly once. If your pipeline handler is not idempotent, retries silently corrupt state. The patterns LeadGrid uses, with code.

apiengineeringBy Ralf Klein · 5 min read
Server racks with blue illumination representing reliable webhook event delivery infrastructure
Photo by panumas nikhomkhai on Pexels

Every webhook provider on the planet ships at-least-once delivery. Stripe, GitHub, Shopify, LeadGrid. That is not a defect, it is the only honest semantic over an unreliable network. Which means your handler will receive the same event twice, and sometimes ten times, and if it is not idempotent the state of your pipeline silently rots.

This is the failure mode I see most in LeadGrid integrations: a single network blip retries a stage.changed event, the customer's handler re-runs the side effect, and now the same candidate has two interview invites or the same lead got two welcome emails. Nothing logged it as a bug. The retry was the bug.

At-least-once is not a bug, it is a contract

When a webhook sender does not get a 2xx within its timeout, it has two choices: assume the receiver got it, or assume it did not. Assuming success drops events on the floor whenever the network flakes. Every mature platform picks the other one. The Hookdeck guide on implementing webhook idempotency is blunt about it: most providers operate on at-least-once delivery, and the burden of dedup sits with the consumer.

Stripe documents the same expectation. In the Stripe webhooks reference, the recommendation is to treat the event.id as the primary key for deduplication on your end, because Stripe will retry on any non-2xx and on most timeouts.

LeadGrid follows the same model for pipeline.* events. If we time out waiting for your endpoint, we retry with the same event.id and the same payload. If your handler runs the side effect twice, that is on the handler.

Two layers of idempotency, not one

Most teams only build the first layer, and then are surprised when state still drifts.

The ingress layer dedupes the inbound event itself. You take the event.id, write it to a table with a UNIQUE constraint, and short-circuit if it is already there. This stops duplicate work on retries. The Hookdeck post on webhook scale recommends exactly this, and it is also the pattern Shopify documents in their webhooks best practices using the X-Shopify-Webhook-Id header.

The side-effect layer dedupes the outbound write that the event triggers. If the webhook handler calls a third-party API to send an email, create a Stripe charge, or post to Slack, that downstream call also needs an idempotency key. Otherwise a partial failure between "we wrote the event_id row" and "we sent the email" leaves you in a state where the next retry sees the event_id as already processed and skips, but the email never went out. Or worse, it went out twice.

Stripe's idempotent requests docs describe this well: keys live up to 24 hours, are scoped per endpoint, and should be V4 UUIDs generated client-side, never server-side, because a regenerated key on retry would defeat the whole mechanism.

What the LeadGrid pattern actually looks like

For pipeline events we ship, here is the receiver shape we recommend, and the one our internal services use:

async function handleLeadGridWebhook(req: Request) {
  const event = verifySignature(req); // throws on bad signature
 
  // Layer 1: ingress dedup
  const inserted = await db.query(
    `INSERT INTO webhook_events (event_id, received_at)
     VALUES ($1, NOW())
     ON CONFLICT (event_id) DO NOTHING
     RETURNING event_id`,
    [event.id]
  );
 
  if (inserted.rowCount === 0) {
    // already processed, ack and move on
    return new Response("ok", { status: 200 });
  }
 
  // Layer 2: enqueue, do not process inline
  await queue.enqueue("pipeline-events", {
    event_id: event.id,
    type: event.type,
    payload: event.data,
  });
 
  return new Response("ok", { status: 200 });
}

Three things are deliberate. The signature verify happens before the dedup write, so a forged event with a real ID cannot poison the table. The dedup uses INSERT ... ON CONFLICT so the check and the write are atomic, no race. And the actual work is enqueued, not done inline, because synchronous webhook processing is how you turn a 200ms blip into a multi-hour outage.

The Hookdeck team makes this case directly in their webhook scale post: treat your endpoint as verify → enqueue → ack, and let a worker process from the queue with its own retry and backoff logic.

Backoff, jitter, dead-letter queue

The other half of the contract is what your worker does on failure. Three rules that survive contact with production.

Use exponential backoff with jitter, not fixed retries. A retry storm of identical-interval retries from a thousand workers will hammer a recovering service back into the ground. The same Hookdeck guide notes that 1s, 2s, 4s, 8s with random jitter is the floor of acceptable.

Cap the retry count and route the rest to a dead-letter queue. The Hookdeck DLQ guide is the right reference here: a poison event that fails forever should not block the rest of the queue. Move it to a DLQ after N attempts, alert, and let an operator decide.

Make the worker itself idempotent on the side effect, not just on the dedup row. If it fails after sending the email but before marking the event_id as processed, the next retry should generate the same idempotency key and have the upstream API return the cached response. This is what the Stripe idempotency model gets right: same key, same result, regardless of how many times it is replayed.

State, not order, is the source of truth

One last trap. Webhooks are not strictly ordered. A stage.changed event arriving before a pipeline.created event is normal, not exceptional. Do not write logic that assumes order, write logic that checks state. Pull the current resource by ID, decide whether the event still applies, and act accordingly. The Hookdeck reliability guide calls this out: ordering is a mirage, state is the only thing that survives retries, races, and out-of-order delivery.

If you build all four layers, ingress dedup, side-effect idempotency, backoff with DLQ, and state-driven logic, your pipeline integration stops drifting. Without them, every retry is a slow-motion bug.

Start free →

Share:
Free forever. No credit card.
Sales + recruitment in one grid
Start free