
A bad schema migration is one of the most reliable ways to take a production database down. A CREATE INDEX that locks the table for ten minutes. An ALTER TABLE that rewrites every row and blocks writes for an hour. A column rename deployed a moment before the code that reads the new name — so every request errors. A migration that runs fine in staging with 1,000 rows and locks the production table for half an hour with 50 million. Schema migrations are necessary, but they're also the change most likely to cause an unexpected outage, because they touch the one component everything depends on, under constraints that vary wildly between staging and production.
This guide explains what makes a migration safe or unsafe, how to run zero-downtime migrations, and how to monitor them so a bad DDL never takes you down.
Why Migrations Cause Outages
The database is the component with the most locking, the most state, and the most sensitivity to scale. Migrations break it in a few recurring ways:
- Locks that block writes. Many DDL operations take an exclusive lock on the table while they run. On a small table that's milliseconds; on a large table it can be minutes or hours — during which every
INSERT,UPDATE, andDELETEqueues up, connections saturate, and the app appears down. - Table rewrites. Some
ALTER TABLEoperations rebuild the entire table copy-by-copy. On a 100 GB table that's a long outage unless you use online DDL. - Schema/code mismatches. The migration and the code have to deploy in sync. Deploy the migration first and the old code errors on the new schema; deploy the code first and it errors on the old schema. The classic failure is renaming a column and deploying the rename before the code that uses the new name.
- Staging/prod scale mismatch. A migration that takes 200 ms on staging's 1,000-row table can take 30 minutes on production's 50 million rows. The lock duration isn't even proportional — it can be dramatically worse because of cache behavior, disk pressure, and lock contention.
- Long-running migrations blocking replicas. A migration that takes a long replication lag to ship to replicas can leave you with stale reads or — worse — a failover that loses the migration entirely.
The common thread: a migration is a change to the most shared, most locked, most stateful component, and that combination is what produces outages.
What Makes DDL Safe or Unsafe
Not all DDL is risky. The key distinction is whether the operation locks or rewrites:
- Safe (online, no long lock):
CREATE INDEX CONCURRENTLY(PostgreSQL) /CREATE INDEX ... ALGORITHM=INPLACE(MySQL 8+), adding a nullable column with a default that the database can apply instantly, mostCREATE TABLE/DROP TABLEfor unused tables, adding a check constraint that's already satisfied (NOT VALID+VALIDATEin PostgreSQL). - Unsafe (long lock or full rewrite):
ALTER TABLE ... ADD COLUMN ... NOT NULLwith a default on older MySQL, renaming a column, changing a column type, dropping a column while code still reads it,CREATE INDEXwithoutCONCURRENTLY, mostALTER TABLEoperations that change the physical row layout.
The rule of thumb: if the DDL has to rewrite every row or take an exclusive lock to do its job, it's unsafe to run directly on a large production table. Online DDL tools (gh-ost, pt-online-schema-change, pg-osc, or native online DDL) exist precisely to make unsafe operations safe by copying the table in the background and swapping at the end.
The Pattern That Makes Migrations Safe: Expand and Contract
The reliable way to deploy schema changes without downtime is expand-and-contract (also called parallel-change). Instead of changing the schema in one step, you split it into phases that are each individually safe:
- Expand: Add the new structure without removing the old. Add the new column (nullable, no default that requires a rewrite). Create the new index concurrently. Both old and new schema work.
- Migrate: Backfill the new structure in the background (copy old values to the new column, populate the new table). Deploy code that writes to both old and new. This phase can take hours or days for large tables — that's fine, because nothing is locked.
- Switch: Deploy code that reads from the new structure. Verify it works.
- Contract: Remove the old structure — drop the old column, drop the old index. By now nothing uses it.
Each step is independently safe and reversible. The total cost is more steps and more code, but the payoff is that no single step can lock the database or break the app. This is the only reliable way to do non-trivial schema changes on a live production database.
How to Run Migrations Safely
Beyond expand-and-contract, a few operational disciplines:
- Always use
CONCURRENTLYfor indexes on Postgres, andALGORITHM=INPLACE/ online DDL on MySQL 8+. Never let a defaultCREATE INDEXrun on a large table — it will lock. - Use online schema-change tools for unsafe alters on large tables —
gh-ost/pt-online-schema-changefor MySQL,pg-oscorpg_repackfor PostgreSQL. - Run migrations during low-traffic windows, not because the migration is dangerous, but because if it goes wrong the blast radius is smaller.
- Set lock timeout and statement timeout on the migration session. A
SET lock_timeout = '3s'makes a migration that can't acquire its lock fail fast instead of hanging the app. Better a failed migration you retry than a stuck one that holds locks for an hour. - Test on a production-sized dataset, not just staging. A copy of production (scrubbed) is the only way to know how long the migration will actually take.
- Never deploy code and schema in the same step. Decouple them so you can roll back the code without rolling back the schema, and vice versa.
- Have a rollback plan for every migration. Some are reversible (
DROP INDEX); some aren't (DROP COLUMN). For irreversible ones, the rollback is "deploy forward" — make sure you can do that quickly.
How to Monitor Migrations
A migration in progress is a moment of elevated risk. What to watch:
- Lock waits during the migration. A spike in lock waits means the migration is blocking the app.
lock_timeoutshould fail it before this gets bad, but monitor so you know. - Replication lag during and after. A long-running DDL ships to replicas as a single (or few) operations, but the apply work is large. Watch replication lag spike and recover.
- Query latency on the affected table. Even "online" DDL can slow concurrent queries. Watch p95/p99 latency on the table's queries during the migration.
- Migration job completion. For background backfills, monitor the backfill job's progress and completion — a stalled backfill is a migration that never finishes.
- User-visible errors and latency from the outside. The real question is whether the migration is affecting users. Only outside-in monitoring answers that.
Treat a migration as an incident-in-waiting: monitor actively while it runs, and confirm recovery when it's done.
How Webalert Helps
Migrations are an internal database operation, but their symptoms are user-visible — and outside-in monitoring catches them:
- Outside-in latency and error monitoring that catches the slow responses and 5xx errors a locking migration causes — so you find out in seconds that your
CREATE INDEXis locking the users table. - Database and dependency monitoring that catches the connection-pool exhaustion, replication lag, and lock waits that accompany a bad migration.
- Confirmation of recovery — once the migration finishes or you kill it, monitoring verifies real requests succeed on time again.
- Independent uptime evidence — if a migration goes wrong, Webalert confirms whether the user-visible impact has recovered once you've rolled back or completed the change.
Webalert won't run your migration, but it tells you the moment a DDL change has crossed from a database operation into a user-facing problem — and confirms when it's over.
Summary
Schema migrations cause outages because they change the most shared, most locked, most stateful component under constraints that vary wildly between staging and production. The recurring failures are locks that block writes, full table rewrites, schema/code mismatches, staging/prod scale mismatches, and migrations that lag to replicas. The safety distinction is whether the DDL locks or rewrites — CONCURRENTLY / online DDL / nullable columns are safe; default CREATE INDEX, column renames, type changes, and column drops while code reads them are unsafe.
The reliable pattern is expand-and-contract: add the new structure without removing the old, backfill in the background, switch the code, then contract by removing the old. Run migrations with CONCURRENTLY / online DDL tools, lock and statement timeouts, production-sized test data, decoupled code and schema deploys, and a rollback plan. Monitor lock waits, replication lag, query latency, and migration completion during the change — and pair internal database metrics with outside-in monitoring so a bad migration never silently degrades your product.