
It's 2 AM. The pager goes off. API p99 is climbing through the roof, dashboards are full of red, and the application logs aren't very helpful — they show queries taking 8 seconds that used to take 80ms. Someone SSHes into the primary, runs SELECT * FROM pg_stat_activity WHERE state = 'active' and sees 40 sessions all blocked on the same row in a table you haven't thought about in six months. An hour later, the actual cause turns out to be that autovacuum hasn't successfully completed on that table in three weeks, the index bloat is at 1.4GB, the planner switched to a sequential scan on a 50M-row table, and every API request is now serialising on the same lock.
PostgreSQL is wonderful, but it has a long tail of failure modes that don't surface until production load hits them. Most of them are measurable — Postgres exposes some of the best built-in observability of any database — but only if you know which views to read, which counters to track, and which thresholds actually matter. The teams that don't have this knowledge spend their incidents in pg_stat_activity. The teams that do have it spot the problem before the pager fires.
This guide is the production-monitoring layer for PostgreSQL: which catalog views matter, how to instrument them, what to alert on, and the patterns that cause the most outages in 2026. It complements the external-uptime layer covered in Database Monitoring: MySQL, PostgreSQL & Redis Uptime — internal metrics, the ones you only see by reading pg_stat_*, are where the slow disasters actually start.
The Catalog Views You Will Live In
Postgres exposes its operational state through pg_stat_* views. Five carry 80% of the diagnostic value:
| View | What it tells you |
|---|---|
pg_stat_statements |
Aggregated query workload — totals, mean, p95 per normalised query |
pg_stat_activity |
Current sessions — what's running right now, what they're waiting on |
pg_stat_database |
Per-database aggregates — commits, rollbacks, deadlocks, conflicts, blocks read |
pg_stat_user_tables |
Per-table activity — sequential scans, index scans, dead tuples, autovacuum activity |
pg_stat_replication |
Primary-side replication state — per-replica lag, WAL flush/write/replay positions |
A monitoring agent that polls these five views every 30-60 seconds, stores deltas in a time-series store, and graphs them gives you the foundation. Everything else is incremental.
Slow Queries — The pg_stat_statements Workflow
The pg_stat_statements extension is the single most important monitoring tool Postgres ships. It records per-normalised-query aggregates: total calls, total time, mean/min/max/stddev time, rows returned, shared buffer hits and reads. Enable it everywhere; it's cheap and the operational payoff is enormous.
Enable in postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000
Then CREATE EXTENSION pg_stat_statements; per database. After a restart, you can run:
SELECT
substring(query, 1, 60) AS short_query,
calls,
round(total_exec_time::numeric, 2) AS total_ms,
round(mean_exec_time::numeric, 2) AS mean_ms,
round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
The top 20 by total_exec_time is your real workload picture — much more useful than the top 20 by mean_exec_time, because mean catches occasional slow queries but total_exec_time catches queries that are individually fast but called 100 million times a day. Both matter, but the latter is where the capacity bottleneck lives.
Reset periodically (SELECT pg_stat_statements_reset();) so the data window matches your monitoring window. Many teams reset hourly or daily; some reset on deploy so they can compare workload pre-vs-post.
The slow-query log as a complement
log_min_duration_statement writes queries longer than N ms to the Postgres log. Tail-driven:
log_min_duration_statement = 500ms
log_line_prefix = '%m [%p] %q%u@%d/%a '
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = on
auto_explain.log_buffers = on
auto_explain is the cherry on top: queries that cross the threshold get an EXPLAIN ANALYZE written to the log automatically. This is how you catch the plan that the query planner regressed to overnight when statistics drifted.
Pipe these into your log monitoring stack (we like ELK / Loki / OpenSearch). Alert on any query > 5s.
EXPLAIN ANALYZE patterns
The plans worth being literate in:
- Seq Scan on a large table where you expected Index Scan — stats out of date, planner statistics need refreshing (
ANALYZE), or the index isn't there at all - Bitmap Heap Scan with Rows Removed by Filter > 50% of scanned — index is too non-selective; consider a partial or composite index
- Nested Loop with rows × rows that explodes — missing join column index
- Sort with
external merge Disk:in the output —work_memtoo low, sort is spilling to disk - Hash Aggregate that became Group Aggregate — same
work_memstory; can be 100× slower
Monitor temp_bytes in pg_stat_database and temp_files. If they're climbing, you have queries spilling to disk that shouldn't be.
pg_stat_activity — The Live View
pg_stat_activity is the "what is happening right now" view. The queries you'll write most often:
What's currently slow
SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle'
AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;
What's blocking what (the blocking tree)
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query,
blocked.wait_event_type,
blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT pg_try_advisory_xact_lock(blocked.pid);
The pg_blocking_pids() function (Postgres 9.6+) is the cleanest way to walk the lock tree. Wire this query into a monitoring panel — when the tree has > 3 levels and a session has been blocked for > 60 seconds, you have an incident.
Idle in transaction sessions
SELECT pid, usename, datname, state, query_start, now() - query_start AS idle_time
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND now() - query_start > interval '5 minutes';
A long-running idle in transaction session is a connection holding a transaction open while the application is doing something else — usually a code path that opens a transaction, does some application work, then forgets to commit. These prevent vacuum from cleaning up dead tuples (every dead tuple newer than the oldest open transaction has to stick around). Alert on any session > 5 minutes in idle in transaction.
Replication Lag
Two replication topologies to monitor:
Streaming replication
Per-replica state lives in pg_stat_replication on the primary:
SELECT
application_name,
client_addr,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_send_bytes,
pg_wal_lsn_diff(sent_lsn, flush_lsn) AS pending_flush_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS pending_replay_bytes,
write_lag, flush_lag, replay_lag
FROM pg_stat_replication;
The three lag columns are time-based (since Postgres 10) and the cleanest to alert on. The byte columns show where in the pipeline the bottleneck is — high pending_send_bytes means network/primary issue; high pending_replay_bytes means replica is CPU-bound applying WAL.
On the replica side, pg_stat_wal_receiver shows what the replica thinks of its connection. A replica reporting healthy state with replay_lag climbing on the primary is a classic "the wal-replay process is single-threaded and CPU-bound" symptom.
Logical replication
pg_stat_subscription on the subscriber + pg_stat_replication on the publisher. The same lag metric applies. Watch for apply_error_count > 0 (Postgres 14+) which indicates a replication conflict that the subscriber paused on.
Alert thresholds
replay_lag > 30sfor 5 minutes — high (your read replicas are returning stale data)replay_lag > 5 minutes— critical (you've crossed most ORM's freshness assumptions)- WAL pending bytes > 16GB — critical (you're approaching
wal_keep_size/ replication slot exhaustion) - A replica missing from
pg_stat_replicationthat was there 60s ago — critical (replication broke)
Autovacuum, Bloat, and the Slow Disaster
Postgres uses MVCC: every UPDATE leaves a dead tuple that autovacuum has to clean up. When autovacuum can't keep up, bloat grows, queries slow down, and eventually the table refuses to be queried efficiently.
Monitoring autovacuum
SELECT
schemaname, relname,
n_live_tup, n_dead_tup,
round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
last_autovacuum, last_autoanalyze,
autovacuum_count, vacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;
The signal: dead_pct > 20% and last_autovacuum > 1 day ago on a hot table = autovacuum is failing to keep up. Common causes:
autovacuum_vacuum_cost_limittoo low — increase from the default 200 to 1000+ on busy systems- Autovacuum workers all busy — increase
autovacuum_max_workers - A
idle in transactionsession blocking vacuum's snapshot horizon (see above) - A long-running
pg_dumpor analytical query holding xmin
Bloat estimation
The community-maintained bloat-estimation query (search: "ioguix bloat query") is the gold standard for spotting tables and indexes with significant bloat. Schedule it to run weekly and alert when any single table/index exceeds 30% bloat.
Transaction ID wraparound
The catastrophic version: pg_database.datfrozenxid walks toward autovacuum_freeze_max_age (default 200M transactions). If it hits the wrap-around limit, Postgres refuses writes to protect itself. Monitor:
SELECT datname, age(datfrozenxid) AS xid_age,
round(100.0 * age(datfrozenxid) / 2000000000, 2) AS pct_toward_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
Alert when pct_toward_wraparound > 50%. The fix is to investigate why autovacuum's anti-wraparound runs aren't completing.
Connection Pool Saturation
Postgres handles concurrency via processes, not threads. Each connection is a forked backend with its own memory footprint (typically 5-20MB). The practical cap on a 16GB server is around 200-400 connections; beyond that you need a connection pooler.
PgBouncer monitoring
PgBouncer is the de-facto pooler. Its admin console exposes:
-- via PgBouncer's admin database
SHOW POOLS;
SHOW CLIENTS;
SHOW SERVERS;
SHOW STATS;
SHOW POOLS gives cl_active, cl_waiting, sv_active, sv_idle per pool. The headline metrics:
cl_waiting > 0for sustained periods = client wait queue formingsv_active == pool_size= pool exhausted; all server connections in useavg_wait_time_usclimbing from microseconds to milliseconds = saturation building
Postgres-side connection metrics
SELECT
state,
count(*) AS connections,
max(now() - state_change) AS oldest_in_state
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state;
Alert when:
count(*) > 80% of max_connections— about to refuse connectionscount(*) WHERE state = 'idle in transaction' > 5— leaking transactionscount(*) WHERE state = 'active' > pool_size(if poolerless) — application overshoot
For applications behind PgBouncer, the saturation usually shows up in PgBouncer's SHOW POOLS first; Postgres-side counts look fine until the pool empties.
Lock Monitoring
The pg_locks view shows every lock in the system. Most are uninteresting. The pattern worth alerting on is the blocking-tree query above plus this aggregate:
SELECT mode, count(*)
FROM pg_locks
WHERE NOT granted
GROUP BY mode
ORDER BY count(*) DESC;
Any row in this output is by definition a lock wait. A handful is normal under load. Tens of AccessExclusiveLock waits means someone is running a DDL during peak (ALTER TABLE, CREATE INDEX without CONCURRENTLY, REINDEX without CONCURRENTLY). This is the kind of thing that nukes an API for 30 seconds and looks mysterious if you're not monitoring it.
Lock timeouts at the session level
Set lock_timeout and statement_timeout in the application or per-connection role. Better to fail fast and surface the problem than to have one query hold the planet:
ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE app_user SET lock_timeout = '5s';
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '5min';
Buffer Cache and IO
Postgres serves most reads from shared_buffers (the in-memory buffer cache, typically 25% of RAM). The hit ratio:
SELECT datname,
round(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY blks_hit + blks_read DESC;
In well-tuned systems with adequate RAM, hit ratio is > 99%. When it drops below 95%, either:
- The working set has grown past
shared_buffers(resize) - A new query pattern is doing large sequential scans (cache pollution; investigate
pg_stat_user_tables.seq_tup_read) - A large maintenance operation is in progress (REINDEX, VACUUM FULL, pg_dump)
pg_stat_io (Postgres 16+)
pg_stat_io exposes read/write/extend rates by backend type and IO context. Alert on:
writebacksclimbing — checkpoint pressurefsyncs > 0from a backend type other than checkpointer — synchronous commit pressureevictions / reads > 0.1— buffer cache thrashing
Deadlocks, Conflicts, and Other Counters
pg_stat_database has the under-appreciated counters:
SELECT datname, deadlocks, conflicts, temp_files, temp_bytes
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');
deadlocks— Postgres detected a deadlock and aborted one of the offending transactions. Should be very rare. A non-zero growth rate is worth investigating; usually a code-ordering bug (two transactions update tables A → B and B → A respectively).conflicts— replica-side cancellations because a long-running query on the replica conflicted with WAL replay. If non-zero, sethot_standby_feedback = onon the replica ormax_standby_streaming_delayhigher.temp_files,temp_bytes— queries spilling sorts/hashes to disk. Sustained growth =work_memtoo low.
Hosted Postgres Caveats
Most teams are on RDS, Aurora, Cloud SQL, Supabase, Neon, or Crunchy. The monitoring story shifts:
AWS RDS / Aurora PostgreSQL
CloudWatch covers OS-level metrics (CPU, IOPS, freeable memory) plus a few Postgres-specific ones (database connections, replica lag). For internal Postgres metrics, install Performance Insights — it surfaces top SQL by wait event, and you can correlate with pg_stat_statements. Aurora has aurora_replica_lag instead of streaming replication metrics.
Google Cloud SQL
Cloud SQL Insights ≈ Performance Insights. Same idea. The query insights surface pg_stat_statements data plus optional EXPLAIN plans for the worst queries.
Supabase / Neon / Crunchy
Supabase and Neon both run Postgres with pg_stat_statements and standard catalog views available; you can run all the queries in this guide. Neon's branching adds an interesting wrinkle — monitor per-branch and per-compute-endpoint separately.
Aurora's quirks
Aurora replaces the storage layer; some standard counters behave differently. pg_stat_io is not as meaningful since the buffer cache is part of the shared storage. Lean on Aurora's Performance Insights for IO-side observability.
What to Alert On
Critical (page)
- Any query > 60s in
pg_stat_activity - Replication
replay_lag > 5 minutes - Connection count > 90% of
max_connections pct_toward_wraparound > 80%- A previously-present replica missing from
pg_stat_replication - Deadlock count growth > 10/hour
- Hit ratio drop > 10pp in 15 minutes (cache wiped or working set blew)
High (notification)
- p99 query time on top-10 queries up 50% week-over-week
- Replication
replay_lag > 30ssustained - Autovacuum has not completed on a hot table for > 24h
- Dead tuple % on any tracked table > 20%
idle in transactionsessions > 5 minutestemp_bytesper hour climbing- Connection count > 70% of
max_connections
Informational
- A new query appears in the top 20 by
total_exec_time - A query's plan changed (compare hashes across days)
- Bloat estimate > 20% on a tracked table
pg_stat_ioevictions climbing
See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.
PostgreSQL Monitoring Checklist
-
pg_stat_statementsextension enabled on every database;track = all -
log_min_duration_statement = 500ms+auto_explaincapturing plans - Slow-query log shipped to log aggregation
- Polling agent collects
pg_stat_activity,pg_stat_replication,pg_stat_database,pg_stat_user_tablesevery 30-60s - Blocking-tree query wired into a dashboard panel
-
idle in transactionsession alerting (> 5 min) - Replication lag in seconds and bytes, per-replica
- Autovacuum activity per top-20 hot tables
- Transaction wraparound monitor (% toward
autovacuum_freeze_max_age) - PgBouncer pool saturation metrics if applicable
-
lock_timeout,statement_timeout,idle_in_transaction_session_timeoutset per role - Buffer cache hit ratio per database
-
pg_stat_iocollected on Postgres 16+ - Deadlock, conflict, temp_files, temp_bytes counters tracked
- Vendor tooling (Performance Insights, Cloud SQL Insights) integrated
- External uptime monitoring complements internal — see Database Monitoring foundation
How Webalert Helps With PostgreSQL Monitoring
Webalert provides the external-monitoring layer that complements your in-database telemetry:
- HTTP monitoring — Watch the API endpoints backed by Postgres; when a slow query starts surfacing as user-visible latency, you see it at the edge
- Content validation — Hit an internal
/internal/db-healthendpoint that surfacespg_stat_activitysummary, replication lag, and autovacuum status; alert when those exceed thresholds - Multi-region checks — Reads served from read replicas in multiple regions; multi-region monitoring confirms freshness from the user's vantage point
- Status page — Communicate "we're seeing elevated database latency" without dumping internal metrics
- Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
- 1-minute check intervals — Detect outages within 60 seconds
- 5-minute setup — Add endpoints, set thresholds, done
See features and pricing for details.
Summary
- Postgres exposes excellent observability through
pg_stat_*views. Five views (pg_stat_statements,pg_stat_activity,pg_stat_database,pg_stat_user_tables,pg_stat_replication) carry 80% of the diagnostic value. pg_stat_statementsis the foundation of query monitoring; sort bytotal_exec_timefor capacity bottlenecks and bymean_exec_timefor outliers. Combine withlog_min_duration_statement+auto_explainfor plan-level detail.pg_stat_activity+pg_blocking_pids()is how you walk the lock tree at 2 AM. The blocking-tree query belongs in a dashboard panel.- Replication:
pg_stat_replicationfor streaming,pg_stat_subscriptionfor logical. Watchreplay_lag(time) and the WAL byte gap. - Autovacuum is the silent killer. Track
n_dead_tup,last_autovacuum, and transaction wraparound. Hunt downidle in transactionsessions that block vacuum's snapshot horizon. - Connection pooling via PgBouncer; monitor
cl_waitingandsv_active / pool_size. - Buffer cache hit ratio,
pg_stat_io(PG16+), deadlocks, conflicts, temp_bytes — supporting counters. - Hosted Postgres (RDS, Aurora, Cloud SQL, Supabase, Neon) layers vendor tooling on top — use it, but the catalog views still work.
PostgreSQL incidents are rarely "the database is down". They're "this query is slow", "the lock tree is deep", "autovacuum stopped working last Thursday", "a replica is 8 minutes behind". The monitoring story is about catching these while they're slow, not after they've escalated into an outage. Build the foundation once — pg_stat_statements, the blocking-tree panel, replication-lag alerts, autovacuum tracking — and the next 2 AM page will arrive with the diagnosis already on the dashboard.