Zero-Downtime Database Migrations: A Playbook

I once watched a ALTER TABLE orders ADD COLUMN status text NOT NULL DEFAULT 'pending' take down checkout for eleven minutes on a Friday afternoon. The migration looked harmless in code review. In production, on a 40-million-row table, it grabbed an ACCESS EXCLUSIVE lock, queued behind a long-running report, and then every INSERT, UPDATE, and SELECT piled up behind it. The app servers exhausted their connection pools, health checks failed, and the load balancer started shedding traffic. Nobody touched the orders table directly. The lock did all the damage.

That outage is entirely avoidable. Postgres gives you every tool you need to change a schema while the application keeps serving traffic. You just have to know which DDL statements take heavy locks, and how to break a dangerous change into safe steps. That decomposition has a name: expand/contract, also called parallel change.

The core idea: never break the running code

The mistake people make is treating a schema change as one atomic event: the database changes, the app changes, done. But your deploy is not atomic. During a rolling deploy you run old code and new code against the same database for minutes. If the schema only works for one version, the other version throws errors.

Expand/contract fixes this by splitting every change into three phases, each independently deployable:

Expand — add the new shape (column, index, table) without removing the old. The schema now supports both old and new code.
Migrate — ship code that writes to both shapes and reads from the new one. Backfill historical rows.
Contract — once nothing reads or writes the old shape, drop it.

The golden rule sits between phases: never deploy a code change and the schema change it depends on at the same time. Ship the tolerant code first, then change the schema, then ship the code that depends on it. Each step is reversible on its own.

Which DDL actually hurts

Before the playbook, know your enemy. In PostgreSQL 17, these are the operations that take an ACCESS EXCLUSIVE lock — the strongest lock, which blocks every read and write — and how long they hold it:

Operation	Lock taken	Blocks reads?	Holds lock for
`ADD COLUMN` (no volatile default)	`ACCESS EXCLUSIVE`	Yes	Milliseconds (metadata only)
`ADD COLUMN ... DEFAULT <volatile>`	`ACCESS EXCLUSIVE`	Yes	Full table rewrite
`ALTER COLUMN ... TYPE` (most cases)	`ACCESS EXCLUSIVE`	Yes	Full table rewrite
`ADD CONSTRAINT ... NOT NULL` (direct)	`ACCESS EXCLUSIVE`	Yes	Full table scan
`CREATE INDEX` (plain)	`SHARE`	No	Full index build (blocks writes)
`CREATE INDEX CONCURRENTLY`	`SHARE UPDATE EXCLUSIVE`	No	Long, but writes proceed
`ADD CONSTRAINT ... CHECK NOT VALID`	`ACCESS EXCLUSIVE`	Yes	Milliseconds (no scan)
`VALIDATE CONSTRAINT`	`SHARE UPDATE EXCLUSIVE`	No	Full scan, writes proceed

The crucial insight from the Postgres docs on ALTER TABLE: the duration matters as much as the lock strength. A metadata-only ADD COLUMN takes ACCESS EXCLUSIVE but releases it in milliseconds — harmless if you don't have to wait to acquire it. A column rewrite takes the same lock but holds it for the entire rewrite. The difference between a non-event and an outage is whether the statement rewrites the table.

Since PostgreSQL 11, ADD COLUMN ... DEFAULT <constant> no longer rewrites the table — the default is stored in catalog metadata. But a volatile default like DEFAULT now() or DEFAULT gen_random_uuid() still forces a full rewrite, because each row needs a distinct value.

Always set a lock_timeout

This single line would have prevented my Friday outage. Before any DDL, cap how long the migration will wait to acquire its lock:

SET lock_timeout = '3s';
SET statement_timeout = '0';  -- but cap individual statements where it matters

Here is why this is non-negotiable. When ALTER TABLE can't get its lock immediately, it joins the lock queue — and in Postgres, a waiting ACCESS EXCLUSIVE request blocks every new query behind it, even cheap reads that would otherwise be compatible with the current lock holder. So a migration stuck waiting behind one slow query stops the whole table. With lock_timeout = '3s', the migration gives up after three seconds, fails cleanly, and you retry later. The application never notices.

I set lock_timeout at the start of every migration, full stop. If a statement can't grab its lock in a few seconds during business hours, I would rather it fail and alert me than silently freeze the table.

Adding a NOT NULL column, the safe way

The dangerous one-liner from my story does three things at once: adds a column, sets a volatile-feeling default, and enforces NOT NULL. Split it into steps that each take a brief lock or no exclusive lock at all.

-- Step 1 (Expand): add a nullable column. Metadata-only, milliseconds.
SET lock_timeout = '3s';
ALTER TABLE orders ADD COLUMN status text;
 
-- Step 2 (Migrate): backfill in batches (see below), and ship app code
-- that writes `status` on every new/updated row.
 
-- Step 3: add the constraint as NOT VALID — no table scan, brief lock.
SET lock_timeout = '3s';
ALTER TABLE orders
  ADD CONSTRAINT orders_status_not_null
  CHECK (status IS NOT NULL) NOT VALID;
 
-- Step 4: validate it — full scan, but only SHARE UPDATE EXCLUSIVE,
-- so reads and writes keep flowing.
ALTER TABLE orders VALIDATE CONSTRAINT orders_status_not_null;

The CHECK (... IS NOT NULL) NOT VALID followed by VALIDATE CONSTRAINT is the canonical trick. The NOT VALID clause tells Postgres to enforce the constraint on new rows but skip checking existing ones, so it takes only a momentary lock. VALIDATE CONSTRAINT then scans existing rows under a weak lock that doesn't block traffic. The result is a real check constraint enforced across the whole table, acquired without a single blocking scan.

On Postgres 12+ you can convert this into a true column NOT NULL cheaply: once a validated CHECK (status IS NOT NULL) exists, ALTER TABLE orders ALTER COLUMN status SET NOT NULL skips the full scan because the planner already knows no nulls exist. Then you can drop the redundant check constraint.

Backfilling without a 40-minute transaction

Never backfill a big table in one UPDATE. A single UPDATE orders SET status = 'pending' WHERE status IS NULL over 40 million rows holds row locks for the entire run, bloats the table with dead tuples, and can block VACUUM from cleaning up. Batch it, and commit between batches so locks release and autovacuum can keep up.

-- Run this loop from your migration tool or a one-off script.
-- Each batch is its own transaction.
DO $$
DECLARE
  rows_updated integer;
BEGIN
  LOOP
    UPDATE orders
    SET status = 'pending'
    WHERE id IN (
      SELECT id FROM orders
      WHERE status IS NULL
      ORDER BY id
      LIMIT 5000
      FOR UPDATE SKIP LOCKED
    );
    GET DIAGNOSTICS rows_updated = ROW_COUNT;
    EXIT WHEN rows_updated = 0;
    COMMIT;            -- release locks, let autovacuum breathe
    PERFORM pg_sleep(0.05);  -- throttle: be kind to replication lag
  END LOOP;
END $$;

A few things earn their keep here. FOR UPDATE SKIP LOCKED means the backfill never fights the live application for the same rows — if a user is updating order 12345 right now, the backfill skips it and moves on. The pg_sleep(0.05) throttle keeps replication lag in check; on a busy primary, a tight backfill loop can push read replicas seconds behind, and your read-after-write guarantees evaporate. I tune the batch size and sleep by watching pg_stat_replication lag in real time — 5,000 rows with a 50 ms pause is a reasonable starting point, not a law.

Building indexes without blocking writes

A plain CREATE INDEX takes a SHARE lock that blocks all writes to the table for the entire build — on a large table that's minutes of failed inserts. The fix is in the Postgres docs under "Building Indexes Concurrently":

-- Cannot run inside a transaction block.
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

CREATE INDEX CONCURRENTLY does two table scans and waits out in-flight transactions, so it's slower in wall-clock time but never blocks writes. The catch: it can't run inside a transaction block, which trips up migration tools that wrap everything in BEGIN/COMMIT. In a Rails migration you set disable_ddl_transaction!; in raw SQL tools like sqlx or golang-migrate, you mark the migration as no-transaction.

The other catch nobody warns you about: if CREATE INDEX CONCURRENTLY fails — a deadlock, a timeout, a deploy that got cancelled — it leaves behind an INVALID index that still costs you on writes but is never used for reads. Always check for and clean up the wreckage:

-- Find invalid indexes left by failed concurrent builds
SELECT indexrelid::regclass AS index_name
FROM pg_index
WHERE NOT indisvalid;
 
-- Drop and rebuild
DROP INDEX CONCURRENTLY idx_orders_status;
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

Renaming a column and changing a type

These are the two that catch people, because a RENAME or a TYPE change seems atomic but breaks every running instance of the old code. You never rename in place under load. You expand/contract.

To rename orders.amount to orders.total_cents:

-- Phase 1 (Expand): add the new column, nullable.
ALTER TABLE orders ADD COLUMN total_cents bigint;

Then ship application code that writes both columns and reads the new one (falling back to the old when new is null). Backfill total_cents from amount in batches. Once the backfill completes and you've confirmed every row is populated, deploy code that reads and writes only total_cents. Finally:

-- Phase 3 (Contract): nothing references the old column anymore.
ALTER TABLE orders DROP COLUMN amount;

A type change follows the same arc: add a new column with the target type, dual-write, backfill with the conversion, cut reads over, drop the old column. This avoids the in-place ALTER COLUMN ... TYPE rewrite, which takes ACCESS EXCLUSIVE for the whole table. The keyword for keeping both versions of the app happy during the middle phase is tolerant reads, dual writes — your code in the migrate phase must handle a row written by either side.

The deploy ordering that ties it together

The schema steps are only half the playbook. The deploy choreography is the other half. For the rename above, the correct order across multiple deploys is:

Deploy app code that tolerates the column being absent (no-op if you start fresh, but real when contracting).
Migrate: ADD COLUMN total_cents.
Deploy app code that dual-writes (amount and total_cents) and reads total_cents with fallback.
Migrate: batched backfill.
Deploy app code that reads and writes only total_cents.
Migrate: DROP COLUMN amount.

Six steps, each independently reversible, none of which requires both old and new code to agree on the schema at the same instant. It's more deploys than a single reckless ALTER, but every one of them is boring — and boring is the entire goal.

The checklist

Before any production migration touches a table large enough to matter, I run through this:

Does the statement rewrite the table? (ADD COLUMN with volatile default, ALTER TYPE.) If yes, redesign as add-new-column + backfill + drop-old.
Have I set lock_timeout (3s) at the top of the migration?
Are indexes built with CREATE INDEX CONCURRENTLY, with the transaction wrapper disabled?
Is every NOT NULL or FOREIGN KEY added as NOT VALID first, then VALIDATE separately?
Is the backfill batched and committed per batch, with a throttle for replication lag?
Does the currently running app code tolerate the schema both before and after this step?
Did I split this into expand → migrate → contract deploys, never coupling a schema change to the code that depends on it?
Do I have a one-line rollback for this exact step?

The discipline is simple to state and easy to skip when you're in a hurry: small, reversible steps, weak locks, and code that never assumes the schema is one specific version. Do that and the worst migration outage you'll ever have is a lock_timeout error in a log, retried five minutes later.