Database Replication Lag: Causes, Monitoring, and Fixes

You add a read replica to take load off your primary database, and most of the time it works beautifully — reads scale, the primary breathes easier. Then a user updates their profile, immediately reloads the page, and sees the old data. Or you fail over to a replica during an outage and discover it was thirty seconds behind, quietly losing the last half-minute of writes. Both symptoms have the same root cause: replication lag, the delay between a write landing on the primary and that write appearing on a replica.

This guide explains what replication lag is, what causes it, and — most importantly — how to monitor it, because lag you can't see is lag you can't act on.

What Replication Lag Is

In a typical setup, one database server is the primary (it accepts writes) and one or more replicas (also called read replicas, standbys, or secondaries) copy those writes and serve reads. The primary streams its changes — the write-ahead log in PostgreSQL, the binlog in MySQL, the oplog in MongoDB — and each replica applies them in order.

Replication lag is how far behind a replica is, usually measured in seconds (how old the replica's data is) or in bytes (how much un-applied log it has queued). Zero lag means the replica is caught up. Five seconds of lag means a read from that replica reflects the database as it looked five seconds ago.

The catch is that most replication is asynchronous: the primary commits a write and tells the client "done" without waiting for replicas to catch up. That's great for write performance, but it means replicas are always at least a little behind — and under stress, a lot behind.

Why Replication Lag Happens

Lag builds up whenever a replica can't apply changes as fast as the primary produces them. The usual causes:

Write-heavy bursts on the primary. A bulk import, a batch job, or a traffic spike generates more changes than the replica can replay in real time.
A slow or under-powered replica. If the replica has weaker hardware, slower disks, or is busy serving heavy read queries, applying the replication stream competes for the same resources.
Single-threaded apply. Some systems apply the replication stream on one thread even though the primary wrote in parallel, so a write-heavy primary outruns a serial replica.
Long-running queries on the replica that block or delay applying new changes (a known source of lag in PostgreSQL when queries conflict with replay).
Network latency or saturation between primary and replica, especially across regions or availability zones.
Lock contention and long transactions on the primary that ship as large, slow-to-apply chunks.

The common thread: lag is a symptom of the replica falling behind on work, whether that's CPU, disk I/O, network, or query contention.

Why Lag Matters: Stale Reads and Risky Failover

Replication lag causes two distinct categories of pain:

1. Stale reads. If you route reads to a replica, lag means users can read data that's behind the primary. The classic bug is "read your own writes": a user saves a change, the read goes to a lagging replica, and they see the old value — looking like the save failed. For dashboards or analytics a few seconds of staleness is fine; for a user editing their own data, it's a visible bug.

2. Data loss on failover. When the primary dies and you promote a replica (see our guide to database failover and high availability), any writes that hadn't replicated yet are gone. Five seconds of lag at the moment of failure can mean five seconds of committed orders, payments, or signups silently lost. This is why lag isn't just a performance metric — it's a measure of your potential data-loss window, closely tied to your recovery point objective.

High lag also undermines the whole point of replicas: if a replica is too far behind to trust, you can't safely read from it or promote it, and your expensive standby becomes dead weight.

How to Monitor Replication Lag

Lag is invisible until you measure it, and by the time users complain it's already a problem. What to track:

Lag in seconds (and bytes) per replica. PostgreSQL exposes this via pg_stat_replication and pg_last_wal_replay_lag; MySQL via Seconds_Behind_Master / Seconds_Behind_Source; MongoDB via the replication oplog window. Watch each replica, not just an average.
Alert on thresholds that match your tolerance. A few seconds may be fine; tens of seconds usually isn't. Set the threshold to what your application can actually tolerate, and treat sustained breaches as an incident.
Trend over time, not just the instant value. A slowly climbing lag line is an early warning that a replica is losing the race before it falls minutes behind.
Replica health and connectivity. A replica that stops replicating reads as "zero lag" right before it falls off a cliff — monitor that replication is actually running, not just the number.
Correlate with primary write volume and replica load, so when lag spikes you can tell whether it's a write burst, a heavy query, or a struggling replica.

Tie these to alerts that reach a human, but tune the thresholds so routine, harmless lag doesn't cause alert fatigue.

How to Reduce Replication Lag

Once you can see lag, you can attack it:

Reduce write bursts. Throttle or batch bulk imports and large UPDATE/DELETE operations into smaller chunks so replicas can keep up.
Give replicas enough headroom. Match (or exceed) the primary's hardware, and don't let heavy read queries starve the apply process.
Use parallel apply where your database supports it, so a write-heavy primary doesn't outrun a single-threaded replica.
Route reads deliberately. Send latency-sensitive, read-your-own-writes traffic to the primary (or use a "read from primary after write" window) and send only lag-tolerant reads to replicas.
Consider synchronous or semi-synchronous replication for the data you can't afford to lose — the primary waits for at least one replica to confirm, trading a little write latency for a much smaller data-loss window.
Fix the slow queries and long transactions on both primary and replica that generate or block replication work.

How Webalert Helps

Webalert monitors your application from the outside, which complements internal lag metrics in a way one alone can't:

Catching the user-visible symptoms — stale data, failed "read your own write" flows, or errors during failover — by checking real endpoints and content, not just whether the database process is alive.
Confirming recovery after failover. When you promote a replica, Webalert verifies your app is actually serving correct responses again, closing the loop on the incident.
Early warning on degradation, so the slow API responses that often accompany a struggling, lagging replica surface before users pile up complaints.
Independent uptime evidence that pairs with your internal database dashboards for a complete picture — inside-out and outside-in.

Webalert won't tune your replication, but it tells you when lag has crossed from a metric into a user-facing problem — and confirms when you've fixed it.

Summary

Replication lag is the delay between a write hitting the primary and appearing on a replica, and because most replication is asynchronous, some lag is always present. It builds up from write bursts, under-powered or busy replicas, single-threaded apply, long queries, and network latency. Two consequences make it matter: stale reads (users seeing old data) and data loss on failover (un-replicated writes vanishing when you promote a replica).

Monitor lag in seconds and bytes per replica, alert on thresholds that match your tolerance, and watch the trend so you catch a replica losing the race early. Reduce it by throttling write bursts, sizing replicas properly, using parallel apply, routing reads deliberately, and using synchronous replication for data you can't lose. Pair internal lag metrics with outside-in monitoring so you know the moment lag becomes a real user problem.

Know when your database is letting users down

Start monitoring with Webalert ->

See features and pricing. No credit card required.

Database Replication Lag: Causes, Monitoring, and Fixes

What Replication Lag Is

Why Replication Lag Happens

Why Lag Matters: Stale Reads and Risky Failover

How to Monitor Replication Lag

How to Reduce Replication Lag

How Webalert Helps

Summary

Know when your database is letting users down

Related Articles

Database Monitoring: How to Monitor MySQL, PostgreSQL, and Redis Uptime

Database Failover and High Availability Explained

Database Connection Pool Exhaustion: Causes and Fixes

Ready to Monitor Your Website?