Postgres Slow Query Debugging

Q: Can EXPLAIN lie about query performance?

EXPLAIN without ANALYZE gives estimated costs, not real timing. Always use EXPLAIN ANALYZE to measure actual execution time and row counts.

Q: How do I know if I need to VACUUM?

If pg_stat_user_tables shows n_dead_tup much higher than live rows, or autovacuum hasn’t run in days, run VACUUM ANALYZE. Frequent slowdowns after large deletes/updates are another sign.

Q: Does increasing work_mem always help slow queries?

No. For some sorts or aggregations, it helps, but excessive work_mem in parallel queries can exhaust memory and hurt overall performance. Tune cautiously.

Q: Is adding an index always the right solution?

No. Too many indexes slow down writes and bloats storage. Only add indexes based on actual query patterns and their observed frequency.

What this usually means

Slow queries in Postgres usually result from inadequate indexing, poor query plans due to outdated statistics, I/O starvation, or bloated tables. Sometimes, suboptimal SQL statements or misplaced functions (e.g., using LIKE '%foo%' on large text columns) are to blame. Less often, resource starvation (CPU, RAM, or disk) or lock contention causes cascading slowdowns. The trick is isolating the real cause, not just the query shape.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run SELECT * FROM pg_stat_activity WHERE state = 'active' AND now() - query_start > interval '2 seconds';
2Capture the exact slow statement from the log: grep 'duration:' /var/log/postgresql/postgresql-*.log | tail -20
3Use EXPLAIN (ANALYZE, BUFFERS) <query> to see the actual plan and hotspots
4Check for missing or misused indexes: \\di+ or \\d <table>
5Inspect pg_stat_user_tables for sequential scans: SELECT relname, seq_scan, idx_scan FROM pg_stat_user_tables WHERE seq_scan > 0;
6Verify autovacuum health: SELECT relname, last_vacuum, last_autovacuum FROM pg_stat_user_tables ORDER BY last_autovacuum ASC;

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

search/var/log/postgresql/postgresql-*.log (or your log_directory)
searchpg_stat_activity for current queries
searchpg_stat_statements (if enabled) for historical slow queries
searchEXPLAIN (ANALYZE) output for the target query
searchTable statistics: pg_class, pg_stat_user_tables
searchSchema: \\d+ <table>, to check column types and existing indexes
searchDisk and I/O metrics (e.g., iostat, vmstat, or cloud disk dashboards)

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningMissing composite or partial indexes on WHERE/JOIN columns
warningFunctions, casts, or wildcards in WHERE clauses preventing index use
warningOutdated table statistics (ANALYZE not run recently)
warningTable or index bloat causing extra I/O (often after mass deletes/updates)
warningHotspotting/lock contention on a single row or table
warningUnder-provisioned storage (cloud EBS in 'cold' state, insufficient IOPS)
warningInefficient query patterns (unbounded CTEs, unnecessary subqueries)

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildCreate the minimal covering or partial index (e.g., CREATE INDEX idx ON users(email) WHERE active = true;)
buildRewrite queries to avoid non-SARGable predicates—move functions/casts out of WHERE
buildRun ANALYZE or VACUUM ANALYZE to refresh statistics
buildRebuild bloated tables: VACUUM FULL <table>; or pg_repack for less downtime
buildTune work_mem and effective_cache_size appropriately (see current settings: SHOW work_mem;)
buildDecompose large, monolithic queries into smaller steps where possible

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedRerun EXPLAIN (ANALYZE) and compare the new cost and actual time
verifiedCheck pg_stat_statements for decreased mean and max execution time
verifiedMonitor application-layer metrics for improved latency
verifiedConfirm index usage in the plan (look for Index Scan, not Seq Scan)
verifiedWatch disk and CPU utilization for normalization
verifiedValidate with representative production traffic loads—not just single test queries

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningBlindly adding indexes without checking their actual impact or maintenance cost
warningIgnoring autovacuum settings—leaving bloated tables unfixed
warningTesting only with small datasets or unrepresentative parameter values
warningNot enabling pg_stat_statements for query insight
warningRelying on EXPLAIN output without (ANALYZE)—estimated plans often mislead
warningChanging work_mem or shared_buffers globally without understanding workload characteristics

( 07 )War story

Customer Search Endpoint Slows Down After Product Launch

Backend EngineerNode.js (Express), PostgreSQL 13 on AWS RDS, pg-pool

Timeline

10:02PagerDuty alert: /search API latency exceeds 4s, major customer impact.
10:05grep 'duration:' in RDS logs shows repeated 6s+ queries on users table.
10:07pg_stat_activity confirms 7 queries running over 5s, all hitting WHERE LOWER(email) = $1
10:10EXPLAIN (ANALYZE) shows full table scan, 900k rows, no index use.
10:12Realize the query cannot use the (email) index due to LOWER().
10:16Create functional index: CREATE INDEX users_email_lower_idx ON users (LOWER(email));
10:19API latency drops back to baseline (<400ms).

I was on-call when the /search endpoint crawled to a halt minutes after a marketing campaign hit. At first, I suspected downstream service issues, but tracing pointed to a specific query against our users table.

Digging into RDS logs, I saw every search was spending 6-7 seconds on a SELECT WHERE LOWER(email) = $1. I initially thought the email index was broken, but EXPLAIN quickly showed a sequential scan. I realized the LOWER function was preventing index use.

Adding a functional index on LOWER(email) instantly fixed the endpoint for all queries. I’d seen missing indexes before, but misapplied function usage was a subtler miss. Updating our query patterns and review process caught similar problems elsewhere.

Root cause

A function (LOWER) in the WHERE clause blocked use of the existing index, causing full table scans under high load.

The fix

Added a functional index on LOWER(email), allowing the planner to choose an index scan.

The lesson

Always check for function calls in query predicates and use functional indexes when necessary. Review explain plans, not just schema.

( 08 )Interpreting Real-World EXPLAIN Output

Don’t just skim for 'Seq Scan' vs 'Index Scan'. Look at actual time per node, rows returned, and buffers read. If a node returns far more rows than expected, statistics may be stale.

Watch for 'Rows Removed by Filter'—a sign your index isn’t covering all filter clauses, or the planner underestimated selectivity. Analyze misestimates by manually running ANALYZE or updating statistics.

( 09 )Autovacuum, Table Bloat, and Their Hidden Costs

Large, infrequent updates or deletes cause dead tuples. Without vacuum, Postgres must scan through obsolete rows, turning what should be index lookups into I/O heavy operations. Check for high reltuples and pg_stat_user_tables.n_dead_tup.

If autovacuum is lagging (autovacuum_lag metrics, or last_autovacuum timestamp staleness), manually force VACUUM ANALYZE and investigate tuning autovacuum_vacuum_scale_factor and autovacuum_max_workers.

( 10 )Common Query Anti-Patterns and Refactors

Beware of wildcards at the start of LIKE ('%foo'): these always force a seq scan. Where possible, switch to trigram indexes (pg_trgm), or redesign the search.

Avoid casting indexed columns in WHERE clauses. Instead, cast parameter values at the application layer, or use expression indexes.

( 11 )Resource Starvation: Disk, Memory, and Concurrency

Cloud providers may throttle disk IOPS; check with iostat or provider dashboards if buffer read times spike. Slow disks make every sequential scan much worse.

Check for lock contention: SELECT * FROM pg_locks WHERE granted = false;—especially during large migrations or bulk data jobs.

Frequently asked questions

Can EXPLAIN lie about query performance?

EXPLAIN without ANALYZE gives estimated costs, not real timing. Always use EXPLAIN ANALYZE to measure actual execution time and row counts.

How do I know if I need to VACUUM?

If pg_stat_user_tables shows n_dead_tup much higher than live rows, or autovacuum hasn’t run in days, run VACUUM ANALYZE. Frequent slowdowns after large deletes/updates are another sign.

Does increasing work_mem always help slow queries?

No. For some sorts or aggregations, it helps, but excessive work_mem in parallel queries can exhaust memory and hurt overall performance. Tune cautiously.

Is adding an index always the right solution?

No. Too many indexes slow down writes and bloats storage. Only add indexes based on actual query patterns and their observed frequency.

Analyzing and Debugging Slow Queries in Postgres

What this usually means

Frequently asked questions