PostgreSQL Production Monitoring: Slow Queries & Replication

It's 2 AM. The pager goes off. API p99 is climbing through the roof, dashboards are full of red, and the application logs aren't very helpful — they show queries taking 8 seconds that used to take 80ms. Someone SSHes into the primary, runs SELECT * FROM pg_stat_activity WHERE state = 'active' and sees 40 sessions all blocked on the same row in a table you haven't thought about in six months. An hour later, the actual cause turns out to be that autovacuum hasn't successfully completed on that table in three weeks, the index bloat is at 1.4GB, the planner switched to a sequential scan on a 50M-row table, and every API request is now serialising on the same lock.

PostgreSQL is wonderful, but it has a long tail of failure modes that don't surface until production load hits them. Most of them are measurable — Postgres exposes some of the best built-in observability of any database — but only if you know which views to read, which counters to track, and which thresholds actually matter. The teams that don't have this knowledge spend their incidents in pg_stat_activity. The teams that do have it spot the problem before the pager fires.

This guide is the production-monitoring layer for PostgreSQL: which catalog views matter, how to instrument them, what to alert on, and the patterns that cause the most outages in 2026. It complements the external-uptime layer covered in Database Monitoring: MySQL, PostgreSQL & Redis Uptime — internal metrics, the ones you only see by reading pg_stat_*, are where the slow disasters actually start.

The Catalog Views You Will Live In

Postgres exposes its operational state through pg_stat_* views. Five carry 80% of the diagnostic value:

View	What it tells you
`pg_stat_statements`	Aggregated query workload — totals, mean, p95 per normalised query
`pg_stat_activity`	Current sessions — what's running right now, what they're waiting on
`pg_stat_database`	Per-database aggregates — commits, rollbacks, deadlocks, conflicts, blocks read
`pg_stat_user_tables`	Per-table activity — sequential scans, index scans, dead tuples, autovacuum activity
`pg_stat_replication`	Primary-side replication state — per-replica lag, WAL flush/write/replay positions

A monitoring agent that polls these five views every 30-60 seconds, stores deltas in a time-series store, and graphs them gives you the foundation. Everything else is incremental.

Slow Queries — The pg_stat_statements Workflow

The pg_stat_statements extension is the single most important monitoring tool Postgres ships. It records per-normalised-query aggregates: total calls, total time, mean/min/max/stddev time, rows returned, shared buffer hits and reads. Enable it everywhere; it's cheap and the operational payoff is enormous.

Enable in postgresql.conf:

shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all
pg_stat_statements.max = 10000

Then CREATE EXTENSION pg_stat_statements; per database. After a restart, you can run:

SELECT
    substring(query, 1, 60) AS short_query,
    calls,
    round(total_exec_time::numeric, 2) AS total_ms,
    round(mean_exec_time::numeric, 2) AS mean_ms,
    round((100 * total_exec_time / sum(total_exec_time) OVER ())::numeric, 1) AS pct_total
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

The top 20 by total_exec_time is your real workload picture — much more useful than the top 20 by mean_exec_time, because mean catches occasional slow queries but total_exec_time catches queries that are individually fast but called 100 million times a day. Both matter, but the latter is where the capacity bottleneck lives.

Reset periodically (SELECT pg_stat_statements_reset();) so the data window matches your monitoring window. Many teams reset hourly or daily; some reset on deploy so they can compare workload pre-vs-post.

The slow-query log as a complement

log_min_duration_statement writes queries longer than N ms to the Postgres log. Tail-driven:

log_min_duration_statement = 500ms
log_line_prefix = '%m [%p] %q%u@%d/%a '
auto_explain.log_min_duration = '500ms'
auto_explain.log_analyze = on
auto_explain.log_buffers = on

auto_explain is the cherry on top: queries that cross the threshold get an EXPLAIN ANALYZE written to the log automatically. This is how you catch the plan that the query planner regressed to overnight when statistics drifted.

Pipe these into your log monitoring stack (we like ELK / Loki / OpenSearch). Alert on any query > 5s.

EXPLAIN ANALYZE patterns

The plans worth being literate in:

Seq Scan on a large table where you expected Index Scan — stats out of date, planner statistics need refreshing (ANALYZE), or the index isn't there at all
Bitmap Heap Scan with Rows Removed by Filter > 50% of scanned — index is too non-selective; consider a partial or composite index
Nested Loop with rows × rows that explodes — missing join column index
Sort with external merge Disk: in the output — work_mem too low, sort is spilling to disk
Hash Aggregate that became Group Aggregate — same work_mem story; can be 100× slower

Monitor temp_bytes in pg_stat_database and temp_files. If they're climbing, you have queries spilling to disk that shouldn't be.

pg_stat_activity — The Live View

pg_stat_activity is the "what is happening right now" view. The queries you'll write most often:

What's currently slow

SELECT pid, now() - query_start AS duration, state, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE state != 'idle'
  AND now() - query_start > interval '30 seconds'
ORDER BY duration DESC;

What's blocking what (the blocking tree)

SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    blocked.wait_event_type,
    blocked.wait_event
FROM pg_stat_activity blocked
JOIN pg_stat_activity blocking
  ON blocking.pid = ANY(pg_blocking_pids(blocked.pid))
WHERE NOT pg_try_advisory_xact_lock(blocked.pid);

The pg_blocking_pids() function (Postgres 9.6+) is the cleanest way to walk the lock tree. Wire this query into a monitoring panel — when the tree has > 3 levels and a session has been blocked for > 60 seconds, you have an incident.

Idle in transaction sessions

SELECT pid, usename, datname, state, query_start, now() - query_start AS idle_time
FROM pg_stat_activity
WHERE state = 'idle in transaction'
  AND now() - query_start > interval '5 minutes';

A long-running idle in transaction session is a connection holding a transaction open while the application is doing something else — usually a code path that opens a transaction, does some application work, then forgets to commit. These prevent vacuum from cleaning up dead tuples (every dead tuple newer than the oldest open transaction has to stick around). Alert on any session > 5 minutes in idle in transaction.

Replication Lag

Two replication topologies to monitor:

Streaming replication

Per-replica state lives in pg_stat_replication on the primary:

SELECT
    application_name,
    client_addr,
    state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_send_bytes,
    pg_wal_lsn_diff(sent_lsn, flush_lsn) AS pending_flush_bytes,
    pg_wal_lsn_diff(flush_lsn, replay_lsn) AS pending_replay_bytes,
    write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

The three lag columns are time-based (since Postgres 10) and the cleanest to alert on. The byte columns show where in the pipeline the bottleneck is — high pending_send_bytes means network/primary issue; high pending_replay_bytes means replica is CPU-bound applying WAL.

On the replica side, pg_stat_wal_receiver shows what the replica thinks of its connection. A replica reporting healthy state with replay_lag climbing on the primary is a classic "the wal-replay process is single-threaded and CPU-bound" symptom.

Logical replication

pg_stat_subscription on the subscriber + pg_stat_replication on the publisher. The same lag metric applies. Watch for apply_error_count > 0 (Postgres 14+) which indicates a replication conflict that the subscriber paused on.

Alert thresholds

replay_lag > 30s for 5 minutes — high (your read replicas are returning stale data)
replay_lag > 5 minutes — critical (you've crossed most ORM's freshness assumptions)
WAL pending bytes > 16GB — critical (you're approaching wal_keep_size / replication slot exhaustion)
A replica missing from pg_stat_replication that was there 60s ago — critical (replication broke)

Autovacuum, Bloat, and the Slow Disaster

Postgres uses MVCC: every UPDATE leaves a dead tuple that autovacuum has to clean up. When autovacuum can't keep up, bloat grows, queries slow down, and eventually the table refuses to be queried efficiently.

Monitoring autovacuum

SELECT
    schemaname, relname,
    n_live_tup, n_dead_tup,
    round(100.0 * n_dead_tup / NULLIF(n_live_tup + n_dead_tup, 0), 1) AS dead_pct,
    last_autovacuum, last_autoanalyze,
    autovacuum_count, vacuum_count
FROM pg_stat_user_tables
WHERE n_dead_tup > 10000
ORDER BY n_dead_tup DESC
LIMIT 20;

The signal: dead_pct > 20% and last_autovacuum > 1 day ago on a hot table = autovacuum is failing to keep up. Common causes:

autovacuum_vacuum_cost_limit too low — increase from the default 200 to 1000+ on busy systems
Autovacuum workers all busy — increase autovacuum_max_workers
A idle in transaction session blocking vacuum's snapshot horizon (see above)
A long-running pg_dump or analytical query holding xmin

Bloat estimation

The community-maintained bloat-estimation query (search: "ioguix bloat query") is the gold standard for spotting tables and indexes with significant bloat. Schedule it to run weekly and alert when any single table/index exceeds 30% bloat.

Transaction ID wraparound

The catastrophic version: pg_database.datfrozenxid walks toward autovacuum_freeze_max_age (default 200M transactions). If it hits the wrap-around limit, Postgres refuses writes to protect itself. Monitor:

SELECT datname, age(datfrozenxid) AS xid_age,
       round(100.0 * age(datfrozenxid) / 2000000000, 2) AS pct_toward_wraparound
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

Alert when pct_toward_wraparound > 50%. The fix is to investigate why autovacuum's anti-wraparound runs aren't completing.

Connection Pool Saturation

Postgres handles concurrency via processes, not threads. Each connection is a forked backend with its own memory footprint (typically 5-20MB). The practical cap on a 16GB server is around 200-400 connections; beyond that you need a connection pooler.

PgBouncer monitoring

PgBouncer is the de-facto pooler. Its admin console exposes:

-- via PgBouncer's admin database
SHOW POOLS;
SHOW CLIENTS;
SHOW SERVERS;
SHOW STATS;

SHOW POOLS gives cl_active, cl_waiting, sv_active, sv_idle per pool. The headline metrics:

cl_waiting > 0 for sustained periods = client wait queue forming
sv_active == pool_size = pool exhausted; all server connections in use
avg_wait_time_us climbing from microseconds to milliseconds = saturation building

Postgres-side connection metrics

SELECT
    state,
    count(*) AS connections,
    max(now() - state_change) AS oldest_in_state
FROM pg_stat_activity
WHERE backend_type = 'client backend'
GROUP BY state;

Alert when:

count(*) > 80% of max_connections — about to refuse connections
count(*) WHERE state = 'idle in transaction' > 5 — leaking transactions
count(*) WHERE state = 'active' > pool_size (if poolerless) — application overshoot

For applications behind PgBouncer, the saturation usually shows up in PgBouncer's SHOW POOLS first; Postgres-side counts look fine until the pool empties.

Lock Monitoring

The pg_locks view shows every lock in the system. Most are uninteresting. The pattern worth alerting on is the blocking-tree query above plus this aggregate:

SELECT mode, count(*)
FROM pg_locks
WHERE NOT granted
GROUP BY mode
ORDER BY count(*) DESC;

Any row in this output is by definition a lock wait. A handful is normal under load. Tens of AccessExclusiveLock waits means someone is running a DDL during peak (ALTER TABLE, CREATE INDEX without CONCURRENTLY, REINDEX without CONCURRENTLY). This is the kind of thing that nukes an API for 30 seconds and looks mysterious if you're not monitoring it.

Lock timeouts at the session level

Set lock_timeout and statement_timeout in the application or per-connection role. Better to fail fast and surface the problem than to have one query hold the planet:

ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE app_user SET lock_timeout = '5s';
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '5min';

Buffer Cache and IO

Postgres serves most reads from shared_buffers (the in-memory buffer cache, typically 25% of RAM). The hit ratio:

SELECT datname,
    round(100.0 * blks_hit / NULLIF(blks_hit + blks_read, 0), 2) AS cache_hit_pct
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1', 'postgres')
ORDER BY blks_hit + blks_read DESC;

In well-tuned systems with adequate RAM, hit ratio is > 99%. When it drops below 95%, either:

The working set has grown past shared_buffers (resize)
A new query pattern is doing large sequential scans (cache pollution; investigate pg_stat_user_tables.seq_tup_read)
A large maintenance operation is in progress (REINDEX, VACUUM FULL, pg_dump)

pg_stat_io (Postgres 16+)

pg_stat_io exposes read/write/extend rates by backend type and IO context. Alert on:

writebacks climbing — checkpoint pressure
fsyncs > 0 from a backend type other than checkpointer — synchronous commit pressure
evictions / reads > 0.1 — buffer cache thrashing

Deadlocks, Conflicts, and Other Counters

pg_stat_database has the under-appreciated counters:

SELECT datname, deadlocks, conflicts, temp_files, temp_bytes
FROM pg_stat_database
WHERE datname NOT IN ('template0', 'template1');

deadlocks — Postgres detected a deadlock and aborted one of the offending transactions. Should be very rare. A non-zero growth rate is worth investigating; usually a code-ordering bug (two transactions update tables A → B and B → A respectively).
conflicts — replica-side cancellations because a long-running query on the replica conflicted with WAL replay. If non-zero, set hot_standby_feedback = on on the replica or max_standby_streaming_delay higher.
temp_files, temp_bytes — queries spilling sorts/hashes to disk. Sustained growth = work_mem too low.

Hosted Postgres Caveats

Most teams are on RDS, Aurora, Cloud SQL, Supabase, Neon, or Crunchy. The monitoring story shifts:

AWS RDS / Aurora PostgreSQL

CloudWatch covers OS-level metrics (CPU, IOPS, freeable memory) plus a few Postgres-specific ones (database connections, replica lag). For internal Postgres metrics, install Performance Insights — it surfaces top SQL by wait event, and you can correlate with pg_stat_statements. Aurora has aurora_replica_lag instead of streaming replication metrics.

Google Cloud SQL

Cloud SQL Insights ≈ Performance Insights. Same idea. The query insights surface pg_stat_statements data plus optional EXPLAIN plans for the worst queries.

Supabase / Neon / Crunchy

Supabase and Neon both run Postgres with pg_stat_statements and standard catalog views available; you can run all the queries in this guide. Neon's branching adds an interesting wrinkle — monitor per-branch and per-compute-endpoint separately.

Aurora's quirks

Aurora replaces the storage layer; some standard counters behave differently. pg_stat_io is not as meaningful since the buffer cache is part of the shared storage. Lean on Aurora's Performance Insights for IO-side observability.

What to Alert On

Critical (page)

Any query > 60s in pg_stat_activity
Replication replay_lag > 5 minutes
Connection count > 90% of max_connections
pct_toward_wraparound > 80%
A previously-present replica missing from pg_stat_replication
Deadlock count growth > 10/hour
Hit ratio drop > 10pp in 15 minutes (cache wiped or working set blew)

High (notification)

p99 query time on top-10 queries up 50% week-over-week
Replication replay_lag > 30s sustained
Autovacuum has not completed on a hot table for > 24h
Dead tuple % on any tracked table > 20%
idle in transaction sessions > 5 minutes
temp_bytes per hour climbing
Connection count > 70% of max_connections

Informational

A new query appears in the top 20 by total_exec_time
A query's plan changed (compare hashes across days)
Bloat estimate > 20% on a tracked table
pg_stat_io evictions climbing

See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.

PostgreSQL Monitoring Checklist

pg_stat_statements extension enabled on every database; track = all
log_min_duration_statement = 500ms + auto_explain capturing plans
Slow-query log shipped to log aggregation
Polling agent collects pg_stat_activity, pg_stat_replication, pg_stat_database, pg_stat_user_tables every 30-60s
Blocking-tree query wired into a dashboard panel
idle in transaction session alerting (> 5 min)
Replication lag in seconds and bytes, per-replica
Autovacuum activity per top-20 hot tables
Transaction wraparound monitor (% toward autovacuum_freeze_max_age)
PgBouncer pool saturation metrics if applicable
lock_timeout, statement_timeout, idle_in_transaction_session_timeout set per role
Buffer cache hit ratio per database
pg_stat_io collected on Postgres 16+
Deadlock, conflict, temp_files, temp_bytes counters tracked
Vendor tooling (Performance Insights, Cloud SQL Insights) integrated
External uptime monitoring complements internal — see Database Monitoring foundation

How Webalert Helps With PostgreSQL Monitoring

Webalert provides the external-monitoring layer that complements your in-database telemetry:

HTTP monitoring — Watch the API endpoints backed by Postgres; when a slow query starts surfacing as user-visible latency, you see it at the edge
Content validation — Hit an internal /internal/db-health endpoint that surfaces pg_stat_activity summary, replication lag, and autovacuum status; alert when those exceed thresholds
Multi-region checks — Reads served from read replicas in multiple regions; multi-region monitoring confirms freshness from the user's vantage point
Status page — Communicate "we're seeing elevated database latency" without dumping internal metrics
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
1-minute check intervals — Detect outages within 60 seconds
5-minute setup — Add endpoints, set thresholds, done

See features and pricing for details.

Summary

Postgres exposes excellent observability through pg_stat_* views. Five views (pg_stat_statements, pg_stat_activity, pg_stat_database, pg_stat_user_tables, pg_stat_replication) carry 80% of the diagnostic value.
pg_stat_statements is the foundation of query monitoring; sort by total_exec_time for capacity bottlenecks and by mean_exec_time for outliers. Combine with log_min_duration_statement + auto_explain for plan-level detail.
pg_stat_activity + pg_blocking_pids() is how you walk the lock tree at 2 AM. The blocking-tree query belongs in a dashboard panel.
Replication: pg_stat_replication for streaming, pg_stat_subscription for logical. Watch replay_lag (time) and the WAL byte gap.
Autovacuum is the silent killer. Track n_dead_tup, last_autovacuum, and transaction wraparound. Hunt down idle in transaction sessions that block vacuum's snapshot horizon.
Connection pooling via PgBouncer; monitor cl_waiting and sv_active / pool_size.
Buffer cache hit ratio, pg_stat_io (PG16+), deadlocks, conflicts, temp_bytes — supporting counters.
Hosted Postgres (RDS, Aurora, Cloud SQL, Supabase, Neon) layers vendor tooling on top — use it, but the catalog views still work.

PostgreSQL incidents are rarely "the database is down". They're "this query is slow", "the lock tree is deep", "autovacuum stopped working last Thursday", "a replica is 8 minutes behind". The monitoring story is about catching these while they're slow, not after they've escalated into an outage. Build the foundation once — pg_stat_statements, the blocking-tree panel, replication-lag alerts, autovacuum tracking — and the next 2 AM page will arrive with the diagnosis already on the dashboard.

Catch Postgres regressions before they cascade into an outage

Start monitoring with Webalert →

See features and pricing. No credit card required.