
A single database server is a single point of failure. The disk fills, the host crashes, the availability zone goes dark — and if everything depends on that one server, your whole application goes down with it. High availability is the discipline of making sure one failure doesn't become an outage, and failover is the mechanism that makes it happen: when the primary database dies, a standby takes over. It sounds simple, but failover is where a lot of "highly available" systems discover their HA was never actually tested — and fails exactly when it's needed.
This guide explains how database failover and high availability work, the traps (split-brain, quorum, lost writes), and how to verify your setup actually recovers.
What High Availability and Failover Mean
High availability (HA) means a system is designed to keep running despite the failure of individual components — typically by having redundancy so there's no single point of failure. For databases, that usually means running a primary plus one or more standbys (replicas) that can take over.
Failover is the act of promoting a standby to become the new primary when the old one fails. After failover, the promoted standby accepts writes and the application reconnects to it. The goal is to minimize two things:
- Downtime — how long the database is unavailable during the switch (your recovery time, or RTO).
- Data loss — how many recent writes didn't make it to the standby before the primary died (your recovery point, tied directly to replication lag).
HA reduces both. It does not eliminate them — a failover is a disruption, however brief, and asynchronous replication means some data loss is possible. Anyone who promises zero of both is overselling.
Automatic vs. Manual Failover
There are two ways failover happens:
- Manual failover. A human detects the failure and promotes a standby. Slower (minutes, and only as fast as your on-call response), but safe — a person confirms the primary is really dead before promoting, avoiding false alarms.
- Automatic failover. A monitoring/orchestration layer (Patroni, Orchestrator, a managed service like RDS Multi-AZ, etc.) detects the failure and promotes a standby with no human in the loop. Fast (seconds to a couple of minutes), but only as good as its failure detection — and that's where the danger lies.
The hard problem in automatic failover is deciding whether the primary is truly down. A network blip between the orchestrator and the primary looks identical to a dead primary. Promote too eagerly and you risk split-brain; promote too cautiously and you extend the outage. This tension is the heart of HA design.
The Big Traps: Split-Brain, Quorum, and Lost Writes
Three failure modes turn a "highly available" setup into an incident:
Split-brain. If the old primary isn't really dead — just temporarily unreachable — and a standby gets promoted anyway, you now have two primaries both accepting writes. They diverge, and reconciling the conflicting data afterward is painful or impossible. Split-brain is the nightmare scenario of naive automatic failover.
Quorum and fencing. The defense against split-brain is quorum: failover decisions require a majority of nodes to agree the primary is gone, so a minority partition can't promote a new primary on its own. Fencing (or STONITH — "shoot the other node in the head") forcibly ensures the old primary can't keep accepting writes after it's replaced. This is why HA clusters want an odd number of nodes (3, 5) — so a majority always exists.
Lost writes. With asynchronous replication, writes committed on the primary but not yet replicated are lost when you promote a standby. The bigger your replication lag at the moment of failure, the bigger the data-loss window — which is why monitoring lag is a core part of HA, not a separate concern.
A fourth, quieter trap: the application doesn't reconnect. The database fails over perfectly, but connection pools cling to dead connections or DNS caches point at the old host, so the app stays down even though the database recovered. Failover isn't done until clients are talking to the new primary.
How to Monitor That Failover Actually Works
HA that's never tested is just a hope. What to monitor and verify:
- Replication health and lag on every standby, so you always know a promotable, reasonably-current replica exists. A standby that silently stopped replicating is a failover that will lose data or fail outright.
- Cluster and quorum state. Watch that the expected number of nodes are healthy and that the cluster has a quorum — a degraded cluster may not be able to fail over at all.
- Failover events themselves. Alert when a promotion happens, both so humans know and so you can review whether it was correct.
- End-to-end recovery, from the outside. The real question isn't "did the database promote a standby" — it's "is the application serving correct responses again." Only outside-in monitoring answers that.
- Practice failovers regularly (game days / chaos drills). A failover you've rehearsed is routine; one you've never tested is a coin flip during your worst hour.
How Webalert Helps
Internal cluster tooling knows the database's view of failover; Webalert provides the independent, outside-in view that tells you whether users actually recovered:
- End-to-end recovery confirmation. When a failover fires, Webalert verifies your real endpoints are serving correct responses again — catching the case where the database recovered but the app never reconnected.
- Independent outage detection that doesn't depend on the same infrastructure that's failing, so a cluster-wide problem can't blind your monitoring at the same time.
- Downtime measurement for the failover window, giving you real data on your availability and SLA instead of assumptions.
- Fast alerting so even a "few seconds" automatic failover that goes wrong reaches a human immediately.
Webalert won't run your failover, but it's the impartial witness that confirms your HA actually delivered availability — to users, not just on paper.
Summary
High availability means designing a database so one failure doesn't cause an outage, and failover is the mechanism — promoting a standby when the primary dies — that delivers it. Failover aims to minimize downtime (RTO) and data loss (RPO), but eliminates neither: it's a disruption, and asynchronous replication means recent writes can be lost. Automatic failover is fast but risks split-brain if it misjudges a network blip for a dead primary; quorum, fencing, and odd node counts are the defenses.
The traps that turn HA into an incident are split-brain (two primaries diverging), lost writes from replication lag, and applications that never reconnect to the new primary. Monitor replication health on every standby, cluster quorum state, and failover events — but above all, verify end-to-end from the outside that users actually recovered, and rehearse failovers before you need them for real.