
A Black Friday sale starts at midnight. By 12:04 your checkout p99 has gone from 180ms to 12 seconds. The application logs are a wall of MySQL deadlocks. The DBA tries SHOW REPLICA STATUS and discovers the read replica is 4 minutes behind because the source's binlog rate is 80MB/s and the replica's SQL thread is single-threaded. By 12:30 you've taken the read replicas out of rotation and pushed everything to the primary, which now has 4,000 connections sitting on Threads_running = 230 and a wait queue on the InnoDB row locks for the orders table.
This is a textbook MySQL incident. Every individual symptom — deadlocks on contention rows, replica falling behind, connection saturation, lock waits — is something MySQL exposes via Performance Schema and SHOW STATUS. None of it is invisible. What separates "we have a problem at 12:04" from "we have an incident at 12:30 and dashboards are no help" is whether the right counters were being collected before the incident started.
This guide is the production-monitoring layer for MySQL: what to enable, which counters to track, how the slow query log and Performance Schema fit together, how to read InnoDB's internals, and what to alert on. It pairs with the external-uptime layer in Database Monitoring: MySQL, PostgreSQL & Redis Uptime; together they cover both "is MySQL responding" and "is MySQL healthy enough to keep responding".
Performance Schema vs information_schema
MySQL has two observability surfaces, with confusing overlap:
information_schema— schema metadata (tables, columns, indexes, sizes) plus a few cumulative stats. Read-heavy queries here are expensive on busy servers because some views walk all open tables.performance_schema(PS) — the modern instrumentation. Per-statement events, per-table I/O, lock waits, mutex contention, stage timings. Cheap to query, designed for monitoring agents.
Enable Performance Schema (default on in MySQL 8) and turn on the instrumentation you need. The headline tables:
| PS table | What it gives you |
|---|---|
events_statements_summary_by_digest |
Normalised query workload — your pg_stat_statements equivalent |
events_statements_history_long |
Recent individual statement events |
table_io_waits_summary_by_table |
I/O per table |
data_locks |
Currently held locks (MySQL 8.0+) |
data_lock_waits |
Currently waiting locks (MySQL 8.0+) |
replication_connection_status / replication_applier_status_by_worker |
Replica internals (MySQL 8.0+) |
metadata_locks |
Metadata-lock contention |
For most monitoring agents, events_statements_summary_by_digest and data_locks / data_lock_waits are the two heavy-hitters.
Slow Queries
Slow query log
Enable in my.cnf:
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1
log_slow_admin_statements = 1
log_slow_replica_statements = 1
Ship the file to log aggregation and also run pt-query-digest (from Percona Toolkit) on the file on a schedule:
pt-query-digest /var/log/mysql/slow.log > digest-$(date +%F).txt
pt-query-digest produces a ranked report of normalised queries with totals, p95, max time, and a sample query for each. It is the de-facto MySQL workload-analysis tool, and the report is what you bring to a query-tuning conversation.
Performance Schema digest
For a real-time view without parsing log files:
SELECT
LEFT(DIGEST_TEXT, 60) AS query,
COUNT_STAR AS calls,
ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_seconds,
ROUND(AVG_TIMER_WAIT / 1e9, 2) AS avg_ms,
ROUND(MAX_TIMER_WAIT / 1e9, 2) AS max_ms,
SUM_ROWS_EXAMINED AS rows_examined,
SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME IS NOT NULL
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;
Sorted by SUM_TIMER_WAIT (total time) — same logic as Postgres' total_exec_time. The top 20 by total time is your workload picture. The ratio rows_examined / rows_sent is your "is this query using the right index" signal: ratios above ~100 are typically missing-index symptoms.
EXPLAIN FORMAT=TREE / JSON
MySQL 8.0+ supports EXPLAIN FORMAT=TREE which gives you the operator tree the optimizer chose plus row-count estimates per operator. EXPLAIN ANALYZE runs the query and adds real timings.
EXPLAIN FORMAT=TREE SELECT ... ;
EXPLAIN ANALYZE SELECT ... ;
The patterns worth recognising:
Table scan on ton a > 100k-row table — missing indexFilter: ... (cost=... rows=...)with estimated rows >> actual rows fromEXPLAIN ANALYZE— stale stats; runANALYZE TABLE tNested loopwith a multiplied cost > 1e6 — missing composite or join-column indexSortof a large set in a query without an index that could satisfy ORDER BY — sort buffer pressure
InnoDB Internals
InnoDB is the only storage engine you should be running for OLTP in 2026. Its internals are exposed through SHOW ENGINE INNODB STATUS (firehose) and a series of SHOW STATUS and performance_schema views.
Buffer pool
InnoDB caches data + indexes in the buffer pool. The hit ratio:
SELECT
ROUND(100 * (1 - SUM(VARIABLE_VALUE) /
(SELECT VARIABLE_VALUE FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')), 4) AS hit_pct
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads';
(A more readable version: 1 - Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests.)
Target > 99% on systems with adequate RAM. Falling below 95% means either the working set exceeds innodb_buffer_pool_size (resize), or a query pattern is doing large scans that pollute the cache (investigate Handler_read_rnd_next).
Dirty pages and checkpointing
Innodb_buffer_pool_pages_dirty / Innodb_buffer_pool_pages_total should typically be < 75%. Above that, InnoDB is struggling to flush; if it hits the dirty-page hard limit, writes stall. Tune innodb_io_capacity and innodb_io_capacity_max to your storage's actual IOPS.
Redo log pressure
Innodb_log_waits — the count of times a transaction had to wait because the redo log buffer was full. Non-zero growth = innodb_log_buffer_size too small for your write rate, or innodb_redo_log_capacity (MySQL 8.0.30+) too small.
Row operations
SHOW GLOBAL STATUS LIKE 'Innodb_rows_%';
Returns Innodb_rows_read, Innodb_rows_inserted, Innodb_rows_updated, Innodb_rows_deleted. Track per-second deltas; sudden spikes in Innodb_rows_read often signal a bad query plan in production.
Replication
Source-replica setup
SHOW REPLICA STATUS\G on a replica shows the headline replication state. The fields that matter most:
Replica_IO_Running— must be YesReplica_SQL_Running— must be YesSeconds_Behind_Source— the user-visible lag metricSource_Log_File/Read_Source_Log_Pos— what's been pulledRelay_Source_Log_File/Exec_Source_Log_Pos— what's been appliedLast_IO_Errno,Last_SQL_Errno— non-zero is bad
Seconds_Behind_Source is the canonical lag metric, but be aware: it's measured by the SQL thread, and it freezes at a stale value if the IO thread is broken. For ground truth use a heartbeat table: a cron writes now() into a heartbeat table on the source every second; the replica reads it; lag = source's timestamp minus replica's now(). pt-heartbeat (Percona Toolkit) is the standard implementation.
GTID-based replication
If you're on GTID (which you should be — it's the default since 8.0):
SELECT WAIT_FOR_EXECUTED_GTID_SET('source-uuid:1-1000', 5);
This is your read-after-write guarantee primitive. Application code that needs to read its own writes can wait for the GTID. Monitor:
Retrieved_Gtid_SetvsExecuted_Gtid_Set— the gap is what's still in the relay log
Parallel applier monitoring
In MySQL 8.0+ replicas can apply in parallel by group (replica_parallel_type = LOGICAL_CLOCK, replica_parallel_workers = N). Monitor per-worker status:
SELECT WORKER_ID, LAST_APPLIED_TRANSACTION,
APPLYING_TRANSACTION, APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;
A single worker stuck on a long transaction will look like "high lag" without obvious cause. Identifying that worker and the offending transaction often takes you straight to the bad query.
Alert thresholds
Replica_IO_Running != YesorReplica_SQL_Running != Yes— criticalSeconds_Behind_Source > 30for 5 minutes — highSeconds_Behind_Source > 5 minutes— critical- Heartbeat-table lag > 60s — high (catches
Seconds_Behind_Sourcelying) - GTID gap > 10,000 transactions for 5 minutes — high
Deadlocks
InnoDB detects deadlocks automatically and kills the smaller of the two transactions. The log of the most recent deadlock lives in SHOW ENGINE INNODB STATUS under "LATEST DETECTED DEADLOCK":
SHOW ENGINE INNODB STATUS\G
For automated capture, set:
innodb_print_all_deadlocks = ON
Every deadlock is then printed to the error log. Pipe the error log to log aggregation; alert when the deadlock rate exceeds your baseline.
A small steady rate of deadlocks is acceptable on busy OLTP systems. A growing rate, or a sudden burst, signals an application-side ordering bug — two code paths updating the same rows in opposite orders. The "latest deadlock" log includes the queries each transaction was running, which is usually enough to find the bug.
Lock-Wait Monitoring (MySQL 8.0+)
data_locks and data_lock_waits are the lock-wait equivalents of Postgres' pg_locks:
SELECT
r.trx_id AS waiting_trx,
r.trx_query AS waiting_query,
b.trx_id AS blocking_trx,
b.trx_query AS blocking_query,
w.LOCK_TYPE, w.LOCK_MODE,
w.OBJECT_SCHEMA, w.OBJECT_NAME
FROM performance_schema.data_lock_waits w
JOIN information_schema.innodb_trx r ON r.trx_id = w.REQUESTING_ENGINE_TRANSACTION_ID
JOIN information_schema.innodb_trx b ON b.trx_id = w.BLOCKING_ENGINE_TRANSACTION_ID;
Wire this query into a monitoring panel. Any row sustained for > 30 seconds is an incident worth paging on.
Metadata locks
DDL statements (ALTER TABLE, RENAME TABLE, etc) acquire metadata locks that conflict with ongoing transactions. The classic incident: someone runs ALTER TABLE during peak, every query against that table queues behind it, and the queue grows faster than it drains. Monitor:
SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, COUNT(*)
FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING'
GROUP BY OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS;
The fix for DDL during peak: use online DDL (ALGORITHM=INPLACE, LOCK=NONE), or pt-online-schema-change / gh-ost for schema changes that can't run online.
Connections and Threads
SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Max_used_connections';
SHOW GLOBAL STATUS LIKE 'Connection_errors_%';
Key counters:
Threads_connected— currently connected clientsThreads_running— clients actually executing a query right nowMax_used_connections— historical peak; compare againstmax_connectionsConnection_errors_max_connections— non-zero means MySQL is refusing connections
The Threads_running metric is the most operationally useful: a sudden spike from 10 to 200 is the "MySQL is melting" signal. It often precedes Seconds_Behind_Source spikes and connection rejection.
Most teams should run a pooler (ProxySQL is the de-facto choice for MySQL). Monitor ProxySQL's stats_mysql_connection_pool and stats_mysql_query_digest similarly.
Query Cache (Mostly Dead) and Table Cache
MySQL 8 removed the query cache. If you're still on 5.7, disable it (query_cache_type = 0) — its contention overhead exceeds its benefit on modern multi-core servers.
The table-open cache still matters:
Table_open_cache_hits / (Table_open_cache_hits + Table_open_cache_misses)should be > 99%Opened_tablesgrowing fast =table_open_cachetoo small
Hosted MySQL Caveats
AWS RDS / Aurora MySQL
CloudWatch covers OS metrics + a curated set of MySQL metrics (Active connections, Aurora replica lag, BinLogDiskUsage, etc). Enable Performance Insights for top-SQL by wait event. Aurora replaces InnoDB's storage layer with its own; many InnoDB-internal counters become less meaningful, but Performance Insights surfaces what matters.
Google Cloud SQL
Cloud SQL Insights ≈ Performance Insights. Top queries by wait event, query plans, recommendations. Use it.
PlanetScale (Vitess)
Vitess monitoring is its own world — vttablet status, query consolidator hit rate, OLAP vs OLTP routing. PlanetScale's Insights surfaces top queries; the underlying Vitess metrics are exposed via Prometheus.
Aurora's parallel query
Aurora MySQL's parallel query feature shifts where slow queries live; monitor aurora_parallel_query status alongside the standard counters.
What to Alert On
Critical (page)
Replica_IO_Running != YesorReplica_SQL_Running != YesSeconds_Behind_Source > 5 minutes(and heartbeat lag confirms)Threads_connected > 90% of max_connectionsConnection_errors_max_connectionsgrowing- Lock wait > 30s on critical tables (orders, payments, sessions)
- Deadlock rate spike > 10× baseline
- Buffer pool hit ratio drop > 10pp in 15 minutes
High (notification)
- p99 query time on top-10 queries up 50% week-over-week
Seconds_Behind_Source > 30ssustainedThreads_runningspike > 5× baselineInnodb_log_waits > 0growthInnodb_buffer_pool_pages_dirty / total > 75%- Metadata locks pending during business hours
Informational
- A new query appears in the top 20 by total time
- Plan changes detected (digest hash + plan changes)
- Table size growth > 50% week-over-week
Slow_queriescounter accelerating
See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.
MySQL Monitoring Checklist
- Performance Schema enabled and tuned (most consumers ON by default in 8.0)
- Slow query log enabled with
long_query_time = 0.5(or stricter on cold paths) -
pt-query-digestrunning on a schedule against the slow log -
events_statements_summary_by_digestqueried by monitoring agent every 60s -
data_locks+data_lock_waitswired into a dashboard panel -
innodb_print_all_deadlocks = ON, error log shipped to log aggregation - Heartbeat-table replica lag (
pt-heartbeator equivalent) tracked -
SHOW REPLICA STATUSpolled per-replica, including parallel-applier worker status - Buffer pool hit ratio + dirty page % + redo log waits tracked
-
Threads_runningandThreads_connectedtracked per-second -
Innodb_rows_*deltas tracked - ProxySQL / pool monitoring if applicable
- DDL safety:
pt-online-schema-changeorgh-ostworkflow for any non-online schema change - Vendor tooling (Performance Insights, Cloud SQL Insights) integrated
- Heartbeat-table lag alerting catches
Seconds_Behind_Sourcelying - External uptime monitoring complements internal — see Database Monitoring foundation
How Webalert Helps With MySQL Monitoring
Webalert provides the external-monitoring layer that complements your in-database telemetry:
- HTTP monitoring — Watch the API endpoints backed by MySQL; when a slow query starts surfacing as user-visible latency, you see it at the edge
- Content validation — Hit an internal
/internal/db-healthendpoint that surfacesevents_statements_summary_by_digesttop-N, replica lag, andThreads_running; alert when those exceed thresholds - WordPress and WooCommerce monitoring — Most WordPress and WooCommerce sites are MySQL-backed; see WordPress Uptime Monitoring and WooCommerce Monitoring
- Multi-region checks — Read traffic served from replicas in multiple regions; multi-region monitoring confirms freshness from the user's vantage point
- Status page — Communicate "we're seeing elevated database latency" without dumping internal metrics
- Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
- 1-minute check intervals — Detect outages within 60 seconds
- 5-minute setup — Add endpoints, set thresholds, done
See features and pricing for details.
Summary
- MySQL ships excellent observability through Performance Schema.
events_statements_summary_by_digestis the workload-analysis foundation;data_locks/data_lock_waitsare the lock-wait foundation. - The slow query log plus
pt-query-digestis still the canonical workflow for query tuning. Run it on a schedule and review the digest. - InnoDB internals worth monitoring: buffer pool hit ratio, dirty-page %, redo log waits, row-operation deltas.
- Replication:
Replica_IO_Running/Replica_SQL_Runningfirst;Seconds_Behind_Sourceas the headline lag metric, with heartbeat-table lag as the ground truth. - GTID replication makes read-after-write deterministic via
WAIT_FOR_EXECUTED_GTID_SET. - Parallel-applier worker status pinpoints which worker is stuck on which transaction during replica lag.
- Deadlock log via
innodb_print_all_deadlocks = ONexposes application-side ordering bugs. Threads_runningspikes precede most "MySQL is melting" incidents — alert on it.- DDL during peak is a top cause of metadata-lock incidents — use online DDL or
pt-online-schema-change/gh-ost. - Hosted MySQL (RDS, Aurora, Cloud SQL, PlanetScale) layers vendor tooling on top — use it, but the Performance Schema surfaces still apply.
MySQL incidents are rarely "the database is down". They're "the replica is 4 minutes behind", "we're deadlocking on orders", "an ALTER TABLE blocked checkout for 30 seconds", "Threads_running is at 200 and climbing". Each of those is visible in counters before it becomes an outage. Build the monitoring foundation once — Performance Schema digest, slow query log, replica lag with heartbeat, lock-wait panel, InnoDB buffer-pool stats — and the next Black Friday is a different conversation.