Background Jobs at Scale: Queues, Workers, and Idempotency

A customer hits "Export" on a 40,000-row report. Your API thread spends 18 seconds rendering a CSV, the load balancer kills the connection at 30 seconds anyway, and the user clicks the button again. Now you've got two exports running, two emails queued, and a support ticket. The fix everyone reaches for is "move it to a background job." That part is easy. The hard part is what nobody tells you: the moment work leaves the request path, you've signed up for a distributed system with at-least-once delivery, and every assumption you held about "this code runs once" is now wrong.

I've shipped job systems on Redis, Postgres, and SQS across a dozen production SaaS apps. The queue backend is the boring part. The failure modes are where the engineering lives.

Why move work off the request path

Three reasons, in order of how often they actually matter:

Latency. The user shouldn't wait for a third-party API, a PDF render, or a fan-out to 500 webhook subscribers. Acknowledge fast, do the work later.
Resilience. If Stripe is having a slow afternoon, your checkout endpoint shouldn't be slow too. A queue decouples your availability from theirs.
Throughput shaping. A signup spike that triggers 10,000 welcome emails shouldn't melt your SMTP provider's rate limit. Workers with bounded concurrency smooth the spike into a flat line.

The thing you trade away is synchronous certainty. The request returns before the work is done, so you need a way to report status, and you need the work to actually happen even if a worker crashes mid-job.

Picking a backend

The right answer depends on what you already run in production. Adding a new stateful dependency to your infra is a real cost — backups, failover, on-call runbooks — so the default should be "use what's already there."

Backend	Delivery	Best for	Ordering	DLQ	Operational cost
BullMQ (Redis 7+)	At-least-once	Node apps already running Redis; rich features (rate limit, flows, repeatable jobs)	Per-queue FIFO option	Built-in failed set	Redis persistence + memory pressure
AWS SQS	At-least-once (standard) / exactly-once-ish (FIFO)	AWS-native, infinite scale, zero servers to run	FIFO queues only	Native redrive policy	Managed; near zero
pg-boss (Postgres 12+)	At-least-once	Teams who want one datastore; transactional enqueue for free	`singletonKey` / priority	Built-in	None beyond your DB
Durable workflows (Temporal, Inngest)	Exactly-once semantics on top of at-least-once	Multi-step sagas, long-running orchestration	Per-workflow	Automatic	Highest (new cluster or SaaS)

A rule I hold to: if your job volume is under ~50 jobs/second and you already run Postgres, start with pg-boss. You get atomic enqueue (more on that below) and you don't add a Redis instance you'll have to operate. Reach for BullMQ when you need its scheduling features or you're past Postgres's comfortable polling throughput. Reach for SQS when you're on AWS and want someone else to own the durability. Reach for a durable-workflow engine only when you have genuine multi-step orchestration — payment → provision → notify → reconcile — where you'd otherwise hand-roll a state machine.

The rule that governs everything: at-least-once means idempotent

Read the AWS SQS docs and you'll find this stated plainly: standard queues guarantee a message is delivered at least once, and occasionally a message is delivered more than once. BullMQ is the same — if a worker pulls a job, does the work, and dies before acking, the job's lock expires and another worker picks it up. The work already happened. It happens again.

So the non-negotiable: every handler must be idempotent. Running it twice with the same input must produce the same result as running it once. This is not a nice-to-have. It is the price of admission.

The cleanest way to enforce it is an idempotency key plus a dedupe table. Here's a BullMQ worker that guards against double-execution using a unique constraint in Postgres:

import { Worker, Queue } from "bullmq";
import { Pool } from "pg";
 
const connection = { host: "127.0.0.1", port: 6379 };
const pool = new Pool({ connectionString: process.env.DATABASE_URL });
 
export const emailQueue = new Queue("emails", {
  connection,
  defaultJobOptions: {
    attempts: 5,
    backoff: { type: "exponential", delay: 1000 }, // 1s, 2s, 4s, 8s, 16s
    removeOnComplete: 1000,
    removeOnFail: 5000,
  },
});
 
const worker = new Worker(
  "emails",
  async (job) => {
    const { userId, template, idempotencyKey } = job.data;
 
    // Atomic guard: succeeds only the first time this key is seen.
    const claim = await pool.query(
      `INSERT INTO processed_jobs (idempotency_key, queue, created_at)
       VALUES ($1, 'emails', now())
       ON CONFLICT (idempotency_key) DO NOTHING
       RETURNING idempotency_key`,
      [idempotencyKey]
    );
 
    if (claim.rowCount === 0) {
      // Already processed (or in-flight on a prior attempt). Treat as success.
      job.log(`Skipping duplicate for key ${idempotencyKey}`);
      return { skipped: true };
    }
 
    await sendEmail(userId, template); // the actual side effect
    return { sent: true };
  },
  { connection, concurrency: 20, lockDuration: 30_000 }
);
 
worker.on("failed", (job, err) => {
  console.error(`Job ${job?.id} failed (attempt ${job?.attemptsMade}):`, err.message);
});

Two subtleties here that bite people. First, concurrency: 20 is per-worker-process. If you run four worker pods, that's 80 concurrent jobs hitting your email provider — size it against the provider's rate limit, not your gut. Second, lockDuration: 30_000 is BullMQ's visibility timeout. If a job takes longer than 30 seconds and you don't extend the lock, BullMQ assumes the worker died and hands the job to someone else while it's still running. Call job.extendLock() for long jobs, or your idempotency guard is the only thing standing between you and a double-charge.

An honest caveat on the guard

The INSERT ... ON CONFLICT claims the key before the side effect runs. If sendEmail throws, the row stays, and a retry sees the conflict and skips — meaning the email never sends. For a non-critical email that's acceptable. For anything important, claim the key, do the work, and only mark it completed in the same row after success; on retry, re-run if the row is claimed-but-not-completed. Idempotency is a spectrum, and "at-most-once" and "effectively-once" are different guarantees. Pick deliberately.

Retries, backoff, and jitter

Exponential backoff alone has a sharp edge: if 1,000 jobs fail at the same instant (a provider blip), they all retry at exactly 1s, then exactly 2s, then exactly 4s — synchronized thundering herds hammering a service that's already struggling. The fix is jitter: randomize each retry delay so the herd spreads out. AWS's own architecture guidance has recommended full jitter for years.

BullMQ lets you supply a custom backoff strategy:

import { Queue } from "bullmq";
 
export const webhookQueue = new Queue("webhooks", {
  connection: { host: "127.0.0.1", port: 6379 },
  defaultJobOptions: {
    attempts: 8,
    backoff: { type: "expoJitter" },
  },
});
 
// Registered on the Worker via settings:
const settings = {
  backoffStrategy: (attemptsMade: number) => {
    const base = 1000;
    const cap = 60_000;
    const exp = Math.min(cap, base * 2 ** attemptsMade);
    return Math.floor(Math.random() * exp); // full jitter: [0, exp)
  },
};

Cap the backoff (here at 60s) so a job that keeps failing doesn't get scheduled three hours out. And bound the attempts. After 8 tries, the job is a poison message — something about it is fundamentally unprocessable (malformed payload, deleted entity, a bug in your handler). Retrying forever just burns CPU and hides the problem.

Dead-letter queues and poison messages

When a job exhausts its retries, it must go somewhere a human can see it. With SQS this is a redrive policy on the queue: after maxReceiveCount failed receives, the message moves to a dead-letter queue automatically. You configure it once:

aws sqs set-queue-attributes \
  --queue-url https://sqs.us-east-1.amazonaws.com/123456789012/jobs \
  --attributes '{
    "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:us-east-1:123456789012:jobs-dlq\",\"maxReceiveCount\":\"5\"}",
    "VisibilityTimeout": "60"
  }'

VisibilityTimeout is SQS's version of BullMQ's lock duration: once a worker receives a message, it's hidden from other consumers for that window. If the worker doesn't delete it in time, it reappears. Set it to comfortably exceed your p99 job duration, or you'll process the same message twice on slow jobs — which loops you right back to "your handler must be idempotent."

A DLQ that nobody watches is just a memory leak with extra steps. Alert on DLQ depth > 0. Every message in there is a job that affected a real customer and silently didn't run.

The transactional outbox: enqueue atomically with your DB write

Here's the bug that takes down the most teams. You write to your database and then enqueue a job:

await db.insert(orders).values(order);       // committed
await queue.add("fulfill", { orderId });     // process crashes here

If the process dies between those two lines, you have an order with no fulfillment job. The two operations aren't atomic — your database and your queue are separate systems, and there's no transaction spanning both.

The transactional outbox pattern fixes this. Instead of enqueuing directly, you insert the job into an outbox table in the same database transaction as your business write. A separate relay polls the outbox and pushes to the real queue. Because the insert is part of the transaction, the job exists if and only if the order exists.

BEGIN;
 
INSERT INTO orders (id, customer_id, total_cents, status)
VALUES ('ord_a1b2', 'cus_x9', 4999, 'pending');
 
INSERT INTO outbox (id, topic, payload, created_at)
VALUES (
  gen_random_uuid(),
  'order.fulfill',
  '{"orderId":"ord_a1b2"}'::jsonb,
  now()
);
 
COMMIT;

With pg-boss, you get this nearly for free, because the queue is your Postgres database. You enqueue inside the same transaction as your write, and a crash either rolls back both or commits both:

import PgBoss from "pg-boss";
 
const boss = new PgBoss(process.env.DATABASE_URL!);
await boss.start();
await boss.createQueue("order.fulfill");
 
// Enqueue in the SAME transaction as the business write.
async function placeOrder(db: Pool, order: Order) {
  const client = await db.connect();
  try {
    await client.query("BEGIN");
    await client.query(
      `INSERT INTO orders (id, customer_id, total_cents, status)
       VALUES ($1, $2, $3, 'pending')`,
      [order.id, order.customerId, order.totalCents]
    );
    // pg-boss can use the provided client so the job insert joins the txn.
    await boss.send(
      "order.fulfill",
      { orderId: order.id },
      { singletonKey: order.id, retryLimit: 6, retryBackoff: true }
    );
    await client.query("COMMIT");
  } catch (e) {
    await client.query("ROLLBACK");
    throw e;
  } finally {
    client.release();
  }
}

singletonKey: order.id is pg-boss's dedupe: it refuses to enqueue a second active job with the same key, so a double-submit doesn't create two fulfillment jobs. retryBackoff: true gives you exponential backoff with jitter built in. This is why I default new SaaS apps to pg-boss — the outbox problem disappears because there's only one system to be transactional about.

Observability: you can't fix what you can't see

A job system is a black box until you instrument it. The four numbers I put on a dashboard before anything ships:

Queue depth (lag). Jobs waiting / oldest waiting job's age. If lag grows monotonically, your workers can't keep up — scale out or you'll fall hours behind.
Processing rate vs. enqueue rate. When enqueue outpaces processing for more than a few minutes, page someone.
Retry rate and DLQ depth. A spike in retries usually means a downstream dependency is failing. DLQ depth > 0 means customers are affected.
Per-job-type p50/p99 duration. This is how you catch a job that's slowly outgrowing its visibility timeout before it starts double-processing.

Tag every span with the jobId and idempotencyKey and propagate trace context into the job payload, so a slow export traces from the originating HTTP request all the way through the worker. OpenTelemetry's conventions cover messaging spans explicitly — use them rather than inventing your own attribute names.

A decision checklist

Before you ship a background job, walk this list:

Is the handler idempotent? Decide your guarantee (at-most-once / effectively-once) and enforce it with a dedupe key. No exceptions.
Is enqueue atomic with the DB write? If a job must run after a write, use an outbox or a Postgres-backed queue. Don't write-then-enqueue.
Backoff with jitter, capped, bounded attempts. Pick a sane attempts (5–8) and cap the delay (30–60s).
Does the visibility timeout / lock exceed p99 duration? Extend the lock on long jobs, or accept double-processing.
Is there a DLQ, and is its depth alerted? A poison message must surface to a human.
Is concurrency bounded against downstream rate limits? Remember it multiplies by pod count.
Can you observe lag, throughput, retries, and duration? If not, you're flying blind.

The queue backend is a one-line config change you can swap later. Idempotency, atomic enqueue, and bounded retries are architectural decisions you can't bolt on after the double-charges start. Get those three right and the rest is tuning.