Webhooks Done Right: Delivery, Retries, and Verification
Webhooks look trivial until events arrive twice, out of order, or not at all. How to build both sides, signing, idempotency, retries, so nothing is lost.
On this page
- Receiving webhooks: four rules that never change
- Verify the signature before you trust a byte
- Dedupe by event id, because delivery is at-least-once
- Respond fast, process async, and return the right status
- Tolerate duplicates and out-of-order arrival
- Sending webhooks: at-least-once is a promise you have to keep
- Sign every payload and version it
- Retry with exponential backoff, then dead-letter
- A deliveries table is your support team's best friend
- Security: the SSRF you are about to introduce
- The checklist
- Further reading
A POST route that reads req.body and writes a row is not a webhook handler. It is a liability waiting for the first network blip. I have cleaned up the aftermath more than once: a customer charged twice because invoice.paid was delivered three times during a Stripe retry storm, a subscription stuck active in our database after a customer.subscription.deleted event got dropped, and an SSRF that pivoted into a cloud metadata endpoint because someone let users register http://169.254.169.254/ as a webhook target.
Webhooks are a distributed systems problem dressed up as a single HTTP request. The network gives you at-least-once delivery and no ordering guarantees, and every serious sender — Stripe, GitHub, Shopify, Slack — designs around exactly that. So must you, on both the receiving and the sending side.
Receiving webhooks: four rules that never change
A correct inbound handler does four things, in this order: verify the signature, dedupe by event id, respond 2xx fast, then process asynchronously. Skip any one and you eventually get burned.
Verify the signature before you trust a byte
Anyone can curl your webhook URL. Without verification you are running arbitrary commands triggered by the open internet. Every reputable sender signs the payload with HMAC-SHA256 over a shared secret. The non-negotiable details: compute the HMAC over the raw request body, not the parsed-and-re-serialized JSON (re-serialization reorders keys and breaks the signature), and compare in constant time so you do not leak the secret through a timing side channel.
Here is a framework-agnostic verifier in TypeScript. The raw-body capture is the part people get wrong — by the time most frameworks hand you req.body, the bytes are gone.
import { createHmac, timingSafeEqual } from "node:crypto";
const TOLERANCE_SECONDS = 300; // 5 min, matches Stripe's default
export function verifyWebhook(
rawBody: Buffer,
signatureHeader: string, // e.g. "t=1703260800,v1=5257a8..."
secret: string,
): boolean {
const parts = Object.fromEntries(
signatureHeader.split(",").map((kv) => kv.split("=") as [string, string]),
);
const timestamp = Number(parts.t);
const provided = parts.v1;
if (!timestamp || !provided) return false;
// Reject stale signatures to blunt replay attacks.
const age = Math.abs(Date.now() / 1000 - timestamp);
if (age > TOLERANCE_SECONDS) return false;
const signedPayload = `${timestamp}.${rawBody.toString("utf8")}`;
const expected = createHmac("sha256", secret)
.update(signedPayload)
.digest("hex");
const a = Buffer.from(expected, "hex");
const b = Buffer.from(provided, "hex");
return a.length === b.length && timingSafeEqual(a, b);
}If you are on Stripe specifically, do not hand-roll this — call stripe.webhooks.constructEvent(rawBody, sigHeader, secret) from stripe@^18, which does the timestamp tolerance and constant-time compare for you. The hand-rolled version above is for the dozens of other senders that do not ship an SDK.
In Next.js 15 App Router, capturing the raw body is one line, because the route handler gives you the Request directly:
// app/api/webhooks/route.ts
import { verifyWebhook } from "@/lib/verify";
export async function POST(req: Request) {
const raw = Buffer.from(await req.arrayBuffer());
const sig = req.headers.get("webhook-signature") ?? "";
if (!verifyWebhook(raw, sig, process.env.WEBHOOK_SECRET!)) {
return new Response("invalid signature", { status: 401 });
}
// ... dedupe + enqueue (below)
}In Express you must opt out of JSON parsing for this route: app.post("/webhooks", express.raw({ type: "application/json" }), handler). Otherwise express.json() consumes the stream and you are signing reconstructed bytes.
Dedupe by event id, because delivery is at-least-once
Every well-designed event carries a stable, unique id (evt_1abc... in Stripe, the X-GitHub-Delivery UUID in GitHub). At-least-once delivery means you will see that id more than once — during sender retries, during your own crash-and-redeliver, during a load balancer hiccup. Idempotency is what makes the second, third, and tenth delivery harmless.
The cheapest durable dedupe is a unique constraint and an ON CONFLICT DO NOTHING. Let the database be the source of truth instead of a racy SELECT then INSERT.
CREATE TABLE webhook_events (
id BIGSERIAL PRIMARY KEY,
event_id TEXT NOT NULL UNIQUE, -- the sender's id
event_type TEXT NOT NULL,
payload JSONB NOT NULL,
status TEXT NOT NULL DEFAULT 'pending', -- pending|done|failed
received_at TIMESTAMPTZ NOT NULL DEFAULT now(),
processed_at TIMESTAMPTZ
);
-- Insert is the dedupe gate. If the row already exists, do nothing
-- and we still return 2xx so the sender stops retrying.
INSERT INTO webhook_events (event_id, event_type, payload)
VALUES ($1, $2, $3)
ON CONFLICT (event_id) DO NOTHING;If the INSERT affects zero rows, you have already seen this event. Acknowledge with 200 and stop. This single constraint is what protects you from the double-charge.
Respond fast, process async, and return the right status
Senders enforce tight receive timeouts — Stripe gives you about 10 seconds before it counts the delivery as failed and schedules a retry. If you run billing logic, send email, and call three downstream APIs inside the request, you will blow the budget, the sender retries, and now you are processing the same event concurrently. Do the verification and the dedupe INSERT synchronously, then enqueue and return immediately.
import { Queue } from "bullmq";
const queue = new Queue("webhooks", { connection: { url: process.env.REDIS_URL } });
export async function POST(req: Request) {
const raw = Buffer.from(await req.arrayBuffer());
const sig = req.headers.get("webhook-signature") ?? "";
if (!verifyWebhook(raw, sig, process.env.WEBHOOK_SECRET!)) {
return new Response("invalid signature", { status: 401 });
}
const event = JSON.parse(raw.toString("utf8"));
const inserted = await db.query(
`INSERT INTO webhook_events (event_id, event_type, payload)
VALUES ($1, $2, $3) ON CONFLICT (event_id) DO NOTHING RETURNING id`,
[event.id, event.type, raw.toString("utf8")],
);
// Duplicate: already recorded, ack and move on.
if (inserted.rowCount === 0) return new Response("ok (duplicate)", { status: 200 });
// jobId = event.id makes the queue itself idempotent too.
await queue.add(event.type, { eventId: event.id }, {
jobId: event.id,
attempts: 5,
backoff: { type: "exponential", delay: 1000 },
});
return new Response("queued", { status: 202 });
}The status code is a contract, and getting it wrong costs you either lost events or a retry storm:
| Situation | Return | Why |
|---|---|---|
| Verified and queued | 202 | Accepted, processing later |
| Duplicate event id | 200 | Already handled — stop retrying |
| Bad/missing signature | 401 | Do not retry; it is not a transient fault |
| Malformed body you will never parse | 400 | Permanent failure; retrying wastes both sides |
| Your DB or queue is down | 503 / 500 | Transient — you want the retry |
The trap is catching every exception and returning 200. That tells the sender "handled" and your transient outage silently eats the event. Only return 2xx once the event is durably recorded.
Tolerate duplicates and out-of-order arrival
There is no ordering guarantee. A subscription.updated can land before the subscription.created it logically follows. Two defenses: dedupe (above) handles duplicates; for ordering, carry a version or timestamp inside the event and ignore stale ones. When applying state, treat the event as a conditional update — only overwrite if the incoming version is newer than what you have stored. This connects directly to background jobs: the queue worker, not the HTTP handler, is where this reconciliation lives, and it is also where you re-fetch the object from the source API if the payload is too stale to trust.
Sending webhooks: at-least-once is a promise you have to keep
Now flip sides. You are the SaaS emitting events to customers' endpoints. The same constraints apply, but now you own delivery, retries, and the visibility surface.
Sign every payload and version it
Mirror what you expect on the receive side: HMAC-SHA256 over the raw body, signature plus timestamp in a header, secret per endpoint. Put a version field in the envelope so you can evolve the schema without breaking subscribers — never silently reshape data.
import { createHmac } from "node:crypto";
function signedRequest(body: object, secret: string) {
const payload = JSON.stringify({ version: "2025-12-01", ...body });
const t = Math.floor(Date.now() / 1000);
const signature = createHmac("sha256", secret)
.update(`${t}.${payload}`)
.digest("hex");
return {
payload,
headers: {
"content-type": "application/json",
"webhook-signature": `t=${t},v1=${signature}`,
},
};
}Retry with exponential backoff, then dead-letter
A consumer endpoint will be down sometimes. Retry on 5xx, timeouts, and connection errors — but not on 4xx, which signals a permanent problem on their side. Use exponential backoff with jitter so a thousand failing deliveries do not all retry on the same tick and hammer a recovering server. After the final attempt, move the delivery to a dead-letter state the customer can inspect and replay. Stripe's own model is instructive: it retries automatically for up to three days with exponential backoff, then disables the endpoint.
async function deliver(endpointUrl: string, body: object, secret: string) {
const { payload, headers } = signedRequest(body, secret);
const schedule = [0, 60, 300, 1800, 7200]; // seconds: ~0, 1m, 5m, 30m, 2h
for (let attempt = 0; attempt < schedule.length; attempt++) {
if (schedule[attempt] > 0) {
const jitter = Math.random() * schedule[attempt] * 0.2;
await sleep((schedule[attempt] + jitter) * 1000);
}
try {
const res = await safeFetch(endpointUrl, { method: "POST", headers, body: payload });
if (res.ok) return { ok: true, attempt, status: res.status };
if (res.status >= 400 && res.status < 500) {
return { ok: false, permanent: true, status: res.status }; // do not retry
}
} catch {
/* timeout or network error: fall through and retry */
}
}
return { ok: false, deadLetter: true }; // exhausted — park for manual replay
}A deliveries table is your support team's best friend
Record every attempt. When a customer opens a ticket saying "we never got the event," you need to answer in seconds: did we send it, what did their server return, and can they replay it. This table is also what powers a self-serve "Resend" button.
CREATE TABLE webhook_deliveries (
id BIGSERIAL PRIMARY KEY,
endpoint_id BIGINT NOT NULL REFERENCES webhook_endpoints(id),
event_id TEXT NOT NULL,
attempt INT NOT NULL DEFAULT 0,
response_status INT,
response_ms INT,
error TEXT,
state TEXT NOT NULL DEFAULT 'pending', -- pending|delivered|failed|dead
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX ON webhook_deliveries (endpoint_id, event_id);Security: the SSRF you are about to introduce
Here is the part that turns a feature into a CVE. When you let a customer register an arbitrary URL and your server makes outbound requests to it, you have built a Server-Side Request Forgery primitive (OWASP API Security Top 10, API7:2023). An attacker registers http://169.254.169.254/latest/meta-data/iam/security-credentials/ and your delivery worker happily fetches cloud credentials, or points at http://localhost:6379 to poke internal Redis.
Block it at the network layer. Resolve the hostname yourself, reject private and link-local ranges, and refuse redirects to anything you have not re-validated. Always set an explicit connect and read timeout — a slow-loris endpoint should not pin a worker.
import { lookup } from "node:dns/promises";
import ipaddr from "ipaddr.js";
async function assertPublicHost(url: string) {
const u = new URL(url);
if (u.protocol !== "https:") throw new Error("https required");
const { address } = await lookup(u.hostname);
const range = ipaddr.parse(address).range();
if (["private", "loopback", "linkLocal", "uniqueLocal", "reserved"].includes(range)) {
throw new Error(`blocked target range: ${range}`);
}
}
async function safeFetch(url: string, init: RequestInit) {
await assertPublicHost(url);
const ctrl = new AbortController();
const timer = setTimeout(() => ctrl.abort(), 5000); // 5s hard cap
try {
return await fetch(url, { ...init, redirect: "manual", signal: ctrl.signal });
} finally {
clearTimeout(timer);
}
}This is still racy against DNS rebinding (the name resolves public on check, private on connect). The bulletproof fix is to pin the connection to the validated IP, or route all webhook egress through a forward proxy on a locked-down subnet with no route to your internal network or the metadata endpoint. For most teams, the proxy is the better investment.
The checklist
Print this and tape it to the PR template.
Receiving:
- Verify HMAC over the raw body, constant-time compare, reject stale timestamps.
- Dedupe on the sender's event id with a
UNIQUEconstraint, not aSELECT. - Acknowledge
2xxonly after the event is durably stored; enqueue the real work. - Return
401/400for permanent failures,5xxfor transient ones — never blanket200. - Reconcile out-of-order events by version in the worker, re-fetching from source when stale.
Sending:
- Sign every payload, one secret per endpoint, version the envelope.
- Exponential backoff with jitter; retry
5xx/timeouts, not4xx; dead-letter after exhaustion. - Persist every attempt in a deliveries table; expose replay.
Security:
- Validate the target IP against private/link-local ranges; enforce
httpsand hard timeouts. - Route egress through a proxy or pin the resolved IP to defeat DNS rebinding.
Webhooks are not hard because the HTTP is hard. They are hard because the network is honest about being unreliable, and most handlers pretend otherwise. Build for at-least-once and unordered from line one, and the double-charge ticket never gets filed.
Further reading
- Stripe webhooks documentation —
docs.stripe.com/webhooks - OWASP API Security Top 10 (API7:2023, SSRF) —
owasp.org - RFC 2104, HMAC: Keyed-Hashing for Message Authentication
- MDN
AbortControllerandfetchreferences —developer.mozilla.org