Postgres Connection Pool Exhausted: Real Debugging Guide

Q: What is the difference between 'idle' and 'idle in transaction' connections?

'Idle' means the connection is open but not currently executing a query or inside a transaction — it's waiting for the next command from the client. 'Idle in transaction' means a `BEGIN` has been issued but no `COMMIT` or `ROLLBACK` yet. The latter holds transactional resources (locks, MVCC snapshots) and is much more dangerous because it can block other operations and will not release until explicitly ended or terminated.

Q: Should I use PgBouncer or built-in connection pooling (e.g., HikariCP)?

Use PgBouncer when you need to multiplex many client connections (e.g., hundreds of application servers) to a limited number of database connections, especially with transaction pooling. Built-in pooling (like HikariCP) is fine for a single application with moderate concurrency. In many architectures, both are used: application pool to PgBouncer (transaction mode) to PostgreSQL. This provides a safety layer and allows connection scaling without changing application code.

Q: How do I find which application or user is causing connection leaks?

Query `pg_stat_activity` with `SELECT application_name, usename, count(*) FROM pg_stat_activity GROUP BY 1,2;` to group by application and user. Also check `client_addr` to see which server is opening many connections. If using PgBouncer, `SHOW CLIENTS;` shows the client IP and database user. Correlate with deployment times to identify recent changes.

Q: What is a safe value for `max_connections`?

The default is 100, but many production setups use 200-500. The limit is memory: each connection uses about 10MB (more if sorting large result sets). So 500 connections use ~5GB of shared buffers and backend memory. Calculate: `max_connections * (work_mem + shared_buffers_per_connection)`. For a machine with 16GB RAM, 400 connections is reasonable if `work_mem` is 4MB. Monitor `SHOW max_connections` and `SHOW work_mem`.

Q: Can connection pool exhaustion cause data corruption?

No, PostgreSQL is crash-safe. If a connection is killed, the transaction rolls back and the database remains consistent. However, application state may be inconsistent if the application assumes the transaction committed. For example, if a user's payment went through but the transaction rolled back, you might have a charge without a record. Use idempotency keys and verify after recovery.

What this usually means

The PostgreSQL server has a hard limit (max_connections, typically 100–500) and the connection pool (either built-in or via PgBouncer) has an equally hard limit. When all slots are occupied, new requests queue or fail. The common cause is not a sudden traffic spike (though that can trigger it) but a leak: transactions that begin but never commit/rollback, ORM connections that aren't returned to the pool, or long-running queries holding connections open. Each open connection consumes ~10MB of RAM, so exhaustion also stresses memory. The pool fills silently until the tipping point, then everything breaks at once.

( 01 )Fast diagnosis

The first ten minutes — establish facts before touching code.

1Run `SELECT count(*) FROM pg_stat_activity;` to see total connections and `SELECT state, count(*) FROM pg_stat_activity GROUP BY state;` to see distribution (idle, active, idle in transaction).
2Check `SHOW max_connections;` and compare with current count. If count >= max_connections, you're at the ceiling.
3If using PgBouncer, run `SHOW POOLS;` on the admin console (psql -h localhost -p 6432 pgbouncer) to see per-database pool usage and wait counts.
4Look at `SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';` — these are suspects holding connections open while waiting for commit/rollback.
5Check application connection pool metrics: e.g., for Rails/Puma, log `ActiveRecord::Base.connection_pool.stat`; for Java, check HikariCP pool metrics via JMX.
6Run `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change < now() - interval '5 minutes';` to clear stuck connections (use with care).

( 02 )Where to look

The specific files, logs, configs, and dashboards that usually own this bug.

searchPostgreSQL config: `/etc/postgresql/*/main/postgresql.conf` — check `max_connections` and `superuser_reserved_connections`
searchPgBouncer config: `/etc/pgbouncer/pgbouncer.ini` — check `pool_mode`, `default_pool_size`, `max_client_conn`, `reserve_pool_size`
searchApplication connection pool config: `database.yml` (Rails), `application.properties` (Spring Boot), `pool_size` in Django/ Celery
search`pg_stat_activity` view: real-time snapshot of all backend processes and their state
searchPostgreSQL logs: `pg_log/` directory — look for 'remaining connection slots are reserved' errors
searchApplication logs: grep for 'connection pool exhausted', 'timeout', 'cannot acquire connection'
searchMonitoring dashboards: Datadog, New Relic, or Grafana with metrics for 'postgresql.connections', 'pgbouncer.pools.active_connections'

( 03 )Common root causes

Practical causes, not theory. These are the things you will actually find.

warningApplication does not close database connections after use (e.g., missing `connection.close()` in raw SQL, not using `with` context managers)
warningLong-running idle-in-transaction connections: a transaction starts, runs a query, then waits for user input or application logic before committing — holding the connection
warningConnection pool size set too low relative to application thread/process count (e.g., 25 pool size for 100 Puma threads)
warningPgBouncer in session pooling mode (default) holds connections for the duration of a session, not a transaction — leads to connection starvation under high concurrency
warningORM (Hibernate, ActiveRecord) connection leak due to misconfigured eviction or validation queries
warningBackground job workers (Sidekiq, Celery) hold connections while idle, not releasing them back to the pool between jobs
warningNetwork latency or database slow queries cause connections to be held longer, reducing effective pool capacity

( 04 )Fix patterns

Concrete fix directions. Pick the one that matches your root cause.

buildSwitch PgBouncer to transaction pooling: `pool_mode = transaction` — connections are released after each transaction, allowing many more clients to share fewer DB connections.
buildIncrease `max_connections` in postgresql.conf (but watch memory: each connection uses ~10MB). Also increase `superuser_reserved_connections` (default 3) to allow admin access during exhaustion.
buildSet connection pool timeouts: e.g., `connection_lifetime` in PgBouncer to 30 minutes, `idle_in_transaction_session_timeout` in PostgreSQL to 5 minutes (Postgres 9.6+).
buildFix application code: ensure every `SELECT` or `INSERT` is inside a transaction that commits/rollbacks promptly. Use `try-finally` or context managers to close connections.
buildImplement connection pool validation: e.g., `test_on_borrow=true` in DBCP, `validationQuery=SELECT 1` — this evicts stale connections.
buildAdd a circuit breaker or queue limit in the application to prevent overwhelming the pool (e.g., Hystrix, Bulwark).
buildFor high-throughput services, consider connection pooling with a dedicated pool per service and a shared PgBouncer pool with `max_db_connections` to prevent one tenant from starving others.

( 05 )How to verify

A fix you cannot prove is a guess. Close the loop.

verifiedAfter fix, use `SELECT count(*) FROM pg_stat_activity;` to confirm connections are below `max_connections` — target 80% utilization max.
verifiedRun load test with expected peak concurrency: connections should stay stable, not grow unbounded.
verifiedCheck application logs for 'connection pool exhausted' or 'timeout' errors — they should be zero after fix.
verifiedIf using PgBouncer, run `SHOW STATS;` and verify `avg_wait_time` is low (<1ms) and `total_wait_time` is not growing.
verifiedMonitor `idle_in_transaction` count over 24h — should rarely exceed 0-2.
verifiedVerify that killing idle-in-transaction connections via `pg_terminate_backend` is no longer needed.

( 06 )Mistakes to avoid

Things that make this bug worse or harder to find.

warningJumping to increase `max_connections` without investigating the leak — this masks the problem and can cause OOM.
warningSetting PgBouncer `pool_mode = transaction` without also setting `idle_in_transaction_session_timeout` — transactions that never end will still leak.
warningAdding more application threads/processes without increasing pool size accordingly — makes pool exhaustion worse.
warningUsing connection pooling without proper validation or eviction — stale connections accumulate and reduce effective pool size.
warningAssuming a connection pool library (like HikariCP) is leak-proof — application code can still fail to return connections under exception paths.
warningNot monitoring `pg_stat_activity` proactively — only checking after an incident means you missed early warning signs.

( 07 )War story

Sidekiq Workers Held Connections Until the Pool Drowned

Senior Backend EngineerRuby on Rails 5.2, PostgreSQL 11, Sidekiq 5, PgBouncer 1.12, Heroku Standard-2x dynos

Timeline

09:15Deploy new background job that queries a large table without a transaction block
09:30PagerDuty alert: 'PostgreSQL connection pool exhausted' from Datadog
09:33Run `SELECT count(*) FROM pg_stat_activity;` — 200 connections active, max_connections=200
09:35Check state distribution: 150 connections are 'idle in transaction'
09:38Run `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction';` — frees 150 connections
09:40Application recovers, but pool fills again within 2 minutes
09:45Examine Sidekiq worker code: job starts with `ActiveRecord::Base.transaction` but fails to commit on error (missing `ensure` block)
09:50Hotfix: add `rescue` and `ensure` to close transaction; redeploy
10:00Connections stabilize at 40-50 active, no further alerts

We pushed a new Sidekiq job that processed a batch of user records. The job opened a database transaction, iterated over records, and updated each one. The developer forgot to wrap the logic in a `begin-rescue-ensure` block, so when an exception occurred (a validation error), the transaction was left open — never committed or rolled back. That held the database connection in 'idle in transaction' state until the worker process was killed or the connection timed out (which was 30 minutes by default). With 50 Sidekiq workers each processing multiple jobs, we quickly consumed all 200 connections.

The first symptom was a Datadog alert: 'Connection pool exhausted.' Our PgBouncer had a default pool size of 200, matching `max_connections`. I ran `SELECT count(*) FROM pg_stat_activity` and saw exactly 200 connections. The breakdown showed 150 were 'idle in transaction' — a clear red flag. I immediately terminated those backends (`SELECT pg_terminate_backend(...)`) to restore service, but within minutes the pool filled again. That told me the root cause was still running.

I examined the new job's code. The transaction was opened with `ActiveRecord::Base.transaction do` but there was no `rescue` around the `update!` calls. When a record failed validation, the exception propagated out of the block without committing or rolling back. The fix was simple: add an `ensure` block to call `raise ActiveRecord::Rollback` if needed, or better, use a proper transaction block with error handling. After redeploying, connections dropped to 40-50 active, and we set `idle_in_transaction_session_timeout` to 5 minutes as a safety net.

Root cause

Missing error handling in a Sidekiq worker left database transactions open indefinitely, causing idle-in-transaction connections to accumulate and exhaust the pool.

The fix

Added a `rescue` block to complete the transaction (commit or rollback) and set `idle_in_transaction_session_timeout = 5min` in PostgreSQL config.

The lesson

Always ensure database transactions are closed in exception paths. Use `idle_in_transaction_session_timeout` as a safety net. Monitor `pg_stat_activity` state distribution proactively.

( 08 )Understanding PostgreSQL Connection States

Every connection to PostgreSQL has a state displayed in `pg_stat_activity.state`. The possible values are: `active` (currently running a query), `idle` (waiting for a new query from client), `idle in transaction` (inside a `BEGIN` but not yet committed/rolled back), `fastpath function call`, and `disabled`. The critical one for pool exhaustion is `idle in transaction`.

When a connection is `idle in transaction`, it holds a transaction slot and all locks acquired within that transaction. This prevents vacuum from cleaning up dead tuples and blocks other connections from acquiring certain locks. More importantly, it holds the physical database connection, which is a finite resource. Even if the application's connection pool returns the connection to the pool, the underlying PostgreSQL backend remains until the transaction ends. So the pool sees the connection as available, but PostgreSQL sees it as busy — leading to confusion and eventual exhaustion.

( 09 )PgBouncer Pooling Modes: Session vs Transaction

PgBouncer has three pooling modes: session, transaction, and statement. The default is session pooling, where a client connection is assigned a PostgreSQL server connection for the entire session (until the client disconnects). This is fine for low concurrency but wasteful for web applications with many short-lived connections. In session pooling, a connection that is idle (not in a transaction) still holds a server slot.

Transaction pooling assigns a server connection only for the duration of a single transaction. When the transaction ends, the server connection is returned to the pool. This allows many client connections to share a small number of server connections. However, it requires that clients not rely on session-level features (e.g., temporary tables, `SET` statements, `LISTEN/NOTIFY`). Switching to transaction pooling often resolves pool exhaustion immediately, but you must audit application code for session-level dependencies.

( 10 )Killing Connections: When and How

When pool exhaustion hits, the fastest recovery is to kill idle-in-transaction or long-running connections. Use `SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change < now() - interval '5 minutes';`. This sends SIGTERM to the backend, causing the transaction to roll back and freeing the connection. Be careful: terminating a connection that's in the middle of a write operation may cause partial writes or application errors.

For a more surgical approach, use `pg_cancel_backend(pid)` to cancel the current query (if any) without disconnecting the session. This is safer than terminate but may still leave the connection idle. In practice, during an incident, terminating idle-in-transaction connections is the quickest way to restore service. After the incident, prevent recurrence with timeouts: `idle_in_transaction_session_timeout` (Postgres 9.6+) and `statement_timeout`.

( 11 )Connection Pool Configuration Anti-Patterns

A common mistake is setting the application-level connection pool size equal to the database `max_connections`. For example, if you have 10 application servers each with a pool of 20 connections, that's 200 connections total, which may match `max_connections`. But if one server's pool is fully utilized, it consumes 20 connections, leaving only 180 for others. Under uneven load, one server can starve others. Instead, set the pool size per server to `max_connections / (number of servers) - buffer`.

Another anti-pattern: using connection pools without validation. Connections can become stale (e.g., after a database restart or network drop). Without validation, the pool may return a broken connection, leading to errors and hanging threads that never release the connection. Always configure `test_on_borrow` or `validation_query` (e.g., `SELECT 1`) with a reasonable timeout.

( 12 )Monitoring and Alerting for Connection Exhaustion

Don't wait for total exhaustion. Set up alerts on connection utilization percentage. For PostgreSQL, monitor `pg_stat_activity` count vs `max_connections`. Alert at 80% utilization. For PgBouncer, monitor `SHOW POOLS` columns: `cl_active`, `cl_waiting`, `sv_active`, `sv_idle`, `sv_used`. A growing `cl_waiting` count indicates clients queuing for connections — alert on that.

Also track `idle_in_transaction` count. A sudden spike often precedes exhaustion. Tools like Datadog, New Relic, or pgDash can visualize this. Set a separate alert if `idle_in_transaction` exceeds 10% of total connections for more than 5 minutes.

Frequently asked questions

What is the difference between 'idle' and 'idle in transaction' connections?

'Idle' means the connection is open but not currently executing a query or inside a transaction — it's waiting for the next command from the client. 'Idle in transaction' means a `BEGIN` has been issued but no `COMMIT` or `ROLLBACK` yet. The latter holds transactional resources (locks, MVCC snapshots) and is much more dangerous because it can block other operations and will not release until explicitly ended or terminated.

Should I use PgBouncer or built-in connection pooling (e.g., HikariCP)?

Use PgBouncer when you need to multiplex many client connections (e.g., hundreds of application servers) to a limited number of database connections, especially with transaction pooling. Built-in pooling (like HikariCP) is fine for a single application with moderate concurrency. In many architectures, both are used: application pool to PgBouncer (transaction mode) to PostgreSQL. This provides a safety layer and allows connection scaling without changing application code.

How do I find which application or user is causing connection leaks?

Query `pg_stat_activity` with `SELECT application_name, usename, count(*) FROM pg_stat_activity GROUP BY 1,2;` to group by application and user. Also check `client_addr` to see which server is opening many connections. If using PgBouncer, `SHOW CLIENTS;` shows the client IP and database user. Correlate with deployment times to identify recent changes.

What is a safe value for `max_connections`?

The default is 100, but many production setups use 200-500. The limit is memory: each connection uses about 10MB (more if sorting large result sets). So 500 connections use ~5GB of shared buffers and backend memory. Calculate: `max_connections * (work_mem + shared_buffers_per_connection)`. For a machine with 16GB RAM, 400 connections is reasonable if `work_mem` is 4MB. Monitor `SHOW max_connections` and `SHOW work_mem`.

Can connection pool exhaustion cause data corruption?

No, PostgreSQL is crash-safe. If a connection is killed, the transaction rolls back and the database remains consistent. However, application state may be inconsistent if the application assumes the transaction committed. For example, if a user's payment went through but the transaction rolled back, you might have a charge without a record. Use idempotency keys and verify after recovery.

PostgreSQL Connection Pool Exhaustion: Debugging When All Connections Are Taken

What this usually means

Frequently asked questions