Reading EXPLAIN ANALYZE: A Practical Guide to Postgres Query Plans

A query that ran in 8ms in staging takes 4 seconds in production, and the first thing half the engineers I've worked with do is start guessing. Add an index. Rewrite the join. Bump the instance size. Sometimes one of those guesses lands, and then the same person spends the next year believing "indexes make queries faster" as if it were a law of physics.

Postgres will tell you exactly what it did and how long each step took. The output just looks like a wall of parentheses. Once you can read it, you stop guessing. This post is about reading that wall of parentheses — the node types, the numbers that matter, and how to find the one node that is actually slow. I'm assuming Postgres 16 or 17, but everything here has been stable since at least Postgres 12.

Run it the right way

EXPLAIN alone gives you the planner's estimates. It does not run the query. Useful, but the planner is frequently wrong, and the whole point is to compare what it expected against what happened.

EXPLAIN ANALYZE actually executes the query and reports real timings. Always add BUFFERS — it tells you whether data came from cache or disk, which is half the story. On Postgres 16 and 17 it's off unless you ask for it, so make it a habit; spelling it out works everywhere and documents intent.

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 4815 AND status = 'shipped';

Two warnings before you run this in production. First, ANALYZE executes the query, including INSERT/UPDATE/DELETE. Wrap mutations in a transaction you roll back:

BEGIN;
EXPLAIN (ANALYZE, BUFFERS) UPDATE orders SET status = 'cancelled' WHERE id = 991;
ROLLBACK;

Second, on tables wider than a few columns, FORMAT TEXT is the readable one for humans; FORMAT JSON is what you pipe into tooling. Stick with text while you're learning.

The anatomy of one node

Here's a real plan for the query above. Read it from the most indented node outward — the inner nodes run first and feed their parents.

Seq Scan on orders  (cost=0.00..18334.00 rows=12 width=84)
                    (actual time=0.512..201.882 rows=11 loops=1)
  Filter: ((customer_id = 4815) AND (status = 'shipped'::text))
  Rows Removed by Filter: 999989
  Buffers: shared hit=128 read=8206
Planning Time: 0.140 ms
Execution Time: 202.031 ms

Every number here means something specific:

cost=0.00..18334.00 — startup cost and total cost, in arbitrary planner units (roughly "cost of reading one sequential page" = 1.0). These are estimates the planner uses to choose a plan. They are not milliseconds and not directly comparable to wall-clock time. The startup cost (0.00) is the work before the first row can be emitted; the total cost (18334.00) is the work to emit the last row.
rows=12 — the planner's estimate of how many rows this node emits.
actual time=0.512..201.882 — real milliseconds: time to first row, then time to last row.
rows=11 — how many rows actually came out.
loops=1 — how many times this node executed. This is the trap. actual time and rows are reported per loop. If loops=5000, the real total time is actual time × loops. More on that in the join section.
Buffers: shared hit=128 read=8206 — 128 8KB pages came from Postgres's buffer cache (hit), 8206 had to be fetched from the OS/disk (read). That 8206 is your problem here.

The smoking gun in this plan: the planner estimated 12 rows and got 11 — that estimate is fine. But it read 8334 buffers (~65MB) to return 11 rows, because it scanned the entire table. That's a missing index, not bad statistics.

Knowing the node types

You don't need all forty node types. You need these seven, because they cover the vast majority of real plans.

Node	What it does	When it's good	When it's a smell
Seq Scan	Reads every row in the table	Small tables, or queries returning a large fraction of rows	Large table + selective filter = missing index
Index Scan	Walks an index, then fetches matching rows from the heap	Selective predicate, need columns not in the index	Returning most of the table (random I/O beats seq read only when selective)
Index Only Scan	Reads only the index, never touches the heap	All needed columns are in the index and pages are visible	Heap fetches spike if the visibility map is stale (run `VACUUM`)
Bitmap Index Scan + Bitmap Heap Scan	Builds a bitmap of matching pages, then reads them in physical order	Medium selectivity, or combining multiple indexes	`Rows Removed by Index Recheck` is high, or `lossy` blocks (raise `work_mem`)
Nested Loop	For each outer row, probe the inner side	Small outer set, indexed inner side	Outer side returns thousands of rows — quadratic blowup
Hash Join	Build a hash table from one side, probe with the other	Large unsorted joins, equality conditions	Hash spills to disk (`Batches > 1`) — raise `work_mem`
Merge Join	Sort both sides, merge in order	Both inputs already sorted (e.g., by index)	Forces expensive sorts on large inputs

The thing nobody tells you: a Seq Scan is often the correct choice. If your query returns 60% of a table, an index scan means thousands of random page fetches, and random I/O is slower than reading the table sequentially. The planner knows this and picks the Seq Scan on purpose. Don't "fix" it.

Turning a Seq Scan into an Index Scan

Back to the orders query. 8206 buffer reads to return 11 rows is the textbook case where an index does help, because the filter is selective. Let's add one.

CREATE INDEX CONCURRENTLY idx_orders_customer_status
  ON orders (customer_id, status);

Use CONCURRENTLY on any table that's taking live traffic — it builds the index without holding a lock that blocks writes. It's slower and can't run inside a transaction block, but it won't take your application down. Now the same query:

Index Scan using idx_orders_customer_status on orders
    (cost=0.42..47.36 rows=12 width=84)
    (actual time=0.031..0.067 rows=11 loops=1)
  Index Cond: ((customer_id = 4815) AND (status = 'shipped'::text))
  Buffers: shared hit=14
Execution Time: 0.094 ms

202ms down to 0.094ms. Buffers dropped from 8334 to 14, all hit — no disk reads at all. The filter moved from Filter: (checked after reading each row) to Index Cond: (used to navigate the index). That distinction matters: a condition under Filter: means Postgres read the row and then discarded it; under Index Cond: it never read non-matching rows in the first place. If you see a huge Rows Removed by Filter next to an index scan, your index isn't covering the selective part of the predicate.

Estimated vs actual: when stats lie

The single most useful diagnostic in any plan is the gap between estimated rows and actual rows. A small gap means the planner has good information. A gap of 100x or more means the planner is working from stale or missing statistics and is probably picking a bad plan because of it.

Nested Loop  (cost=0.85..23.4 rows=1 width=120)
             (actual time=0.05..1840.2 rows=48211 loops=1)
  ->  Index Scan using idx_users_plan on users
        (rows=1) (actual rows=48211 loops=1)
  ->  Index Scan using idx_orders_user on orders
        (rows=1) (actual rows=1 loops=48211)

The planner thought the outer scan would return 1 row, so a Nested Loop looked cheap. It returned 48,211, so the inner index scan ran 48,211 times (loops=48211). Each loop is fast, but 48,000 of them is 1.8 seconds. With an accurate estimate the planner would have chosen a Hash Join and scanned orders once.

The fix is almost never to rewrite the query. It's to give the planner correct numbers:

ANALYZE users;

ANALYZE resamples the table and updates pg_statistic. Autovacuum runs it automatically, but after a bulk load, a big migration, or on a table that just changed shape, the autovacuum threshold may not have triggered yet. If a 1-row estimate explodes into tens of thousands of actual rows, run ANALYZE first and re-check the plan before touching anything else.

For correlated columns the planner assumes independence and gets it badly wrong — WHERE city = 'Berlin' AND country = 'Germany' will be underestimated because the planner multiplies two selectivities as if they were unrelated. Extended statistics fix this:

CREATE STATISTICS stx_orders_geo (dependencies)
  ON city, country FROM orders;
ANALYZE orders;

This is in the PostgreSQL documentation under "Multivariate Statistics," and it's one of the most underused features in the database.

Reading a join plan

Joins are where people get lost, mostly because of the per-loop reporting. Here's a Hash Join doing the right thing:

Hash Join  (cost=412.00..9871.55 rows=48000 width=96)
           (actual time=4.1..71.3 rows=48211 loops=1)
  Hash Cond: (orders.user_id = users.id)
  Buffers: shared hit=2204 read=51
  ->  Seq Scan on orders  (actual time=0.01..18.2 rows=120000 loops=1)
  ->  Hash  (actual time=3.9..3.9 rows=4800 loops=1)
        Buckets: 8192  Batches: 1  Memory Usage: 412kB
        ->  Seq Scan on users (actual time=0.01..1.2 rows=4800 loops=1)

The Hash node built a 412kB table from 4,800 users in memory, then the join probed it once per order. Batches: 1 is the number you want — it means the hash table fit in work_mem. The two Seq Scans here are correct: the query touches most of both tables, so sequential reads beat random index lookups.

Now watch what happens when the hash doesn't fit:

->  Hash  (actual time=812.5..812.5 rows=9200000 loops=1)
      Buckets: 65536  Batches: 32  Memory Usage: 4096kB

Batches: 32 means Postgres ran out of work_mem and spilled the hash table to disk across 32 passes. Same thing shows up on sorts:

Sort  (actual time=2104.8..2380.1 rows=2000000 loops=1)
  Sort Key: orders.created_at
  Sort Method: external merge  Disk: 248560kB

Sort Method: external merge Disk: 248560kB is the spill. An in-memory sort would say quicksort Memory: .... The fix is work_mem, which is per sort/hash node, per connection — not global. Set it too high with 200 connections each running 3 sorts and you'll OOM the box. Tune it for the session running the heavy query:

SET work_mem = '256MB';
-- run the query, confirm Sort Method flips to quicksort, then RESET
RESET work_mem;

Default work_mem is 4MB. For an analytics query sorting millions of rows, raising it per-session to 256MB is reasonable; setting it globally to 256MB usually is not.

A method, not a vibe

When a query is slow, work the plan in this order:

Find the expensive node. Multiply actual time × loops for each node and find the largest total. The slowest node is rarely the outermost one — it's usually buried deep with a high loop count. Tools like explain.depesz.com and explain.dalibo.com do this multiplication for you and highlight the culprit; for production, the auto_explain module logs plans of slow queries automatically.
Check estimated vs actual rows on that node. A gap over ~100x means stale stats. Run ANALYZE on the table and re-plan before doing anything else.
Check buffers. High read (vs hit) means cold cache or genuinely too much data. High buffer count for few result rows means a missing or wrong index.
Check for disk spills. Batches > 1 on a hash, or Sort Method: external merge on a sort, means raise work_mem for that session.
Only now consider an index or rewrite. And remember a Seq Scan returning most of a table is correct — don't index it away.

Most "slow query" tickets I've closed came down to step 2: a stale pg_statistic row sending the planner down a Nested Loop that should have been a Hash Join. One ANALYZE and the problem evaporates. The discipline is resisting the urge to add an index before you've read what the database already told you it did.