MySQL Production Monitoring: Slow Queries, Replication, Locks

A Black Friday sale starts at midnight. By 12:04 your checkout p99 has gone from 180ms to 12 seconds. The application logs are a wall of MySQL deadlocks. The DBA tries SHOW REPLICA STATUS and discovers the read replica is 4 minutes behind because the source's binlog rate is 80MB/s and the replica's SQL thread is single-threaded. By 12:30 you've taken the read replicas out of rotation and pushed everything to the primary, which now has 4,000 connections sitting on Threads_running = 230 and a wait queue on the InnoDB row locks for the orders table.

This is a textbook MySQL incident. Every individual symptom — deadlocks on contention rows, replica falling behind, connection saturation, lock waits — is something MySQL exposes via Performance Schema and SHOW STATUS. None of it is invisible. What separates "we have a problem at 12:04" from "we have an incident at 12:30 and dashboards are no help" is whether the right counters were being collected before the incident started.

This guide is the production-monitoring layer for MySQL: what to enable, which counters to track, how the slow query log and Performance Schema fit together, how to read InnoDB's internals, and what to alert on. It pairs with the external-uptime layer in Database Monitoring: MySQL, PostgreSQL & Redis Uptime; together they cover both "is MySQL responding" and "is MySQL healthy enough to keep responding".

Performance Schema vs information_schema

MySQL has two observability surfaces, with confusing overlap:

information_schema — schema metadata (tables, columns, indexes, sizes) plus a few cumulative stats. Read-heavy queries here are expensive on busy servers because some views walk all open tables.
performance_schema (PS) — the modern instrumentation. Per-statement events, per-table I/O, lock waits, mutex contention, stage timings. Cheap to query, designed for monitoring agents.

Enable Performance Schema (default on in MySQL 8) and turn on the instrumentation you need. The headline tables:

PS table	What it gives you
`events_statements_summary_by_digest`	Normalised query workload — your `pg_stat_statements` equivalent
`events_statements_history_long`	Recent individual statement events
`table_io_waits_summary_by_table`	I/O per table
`data_locks`	Currently held locks (MySQL 8.0+)
`data_lock_waits`	Currently waiting locks (MySQL 8.0+)
`replication_connection_status` / `replication_applier_status_by_worker`	Replica internals (MySQL 8.0+)
`metadata_locks`	Metadata-lock contention

For most monitoring agents, events_statements_summary_by_digest and data_locks / data_lock_waits are the two heavy-hitters.

Slow Queries

Slow query log

Enable in my.cnf:

slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.5
log_queries_not_using_indexes = 1
log_slow_admin_statements = 1
log_slow_replica_statements = 1

Ship the file to log aggregation and also run pt-query-digest (from Percona Toolkit) on the file on a schedule:

pt-query-digest /var/log/mysql/slow.log > digest-$(date +%F).txt

pt-query-digest produces a ranked report of normalised queries with totals, p95, max time, and a sample query for each. It is the de-facto MySQL workload-analysis tool, and the report is what you bring to a query-tuning conversation.

Performance Schema digest

For a real-time view without parsing log files:

SELECT
    LEFT(DIGEST_TEXT, 60) AS query,
    COUNT_STAR AS calls,
    ROUND(SUM_TIMER_WAIT / 1e12, 2) AS total_seconds,
    ROUND(AVG_TIMER_WAIT / 1e9, 2) AS avg_ms,
    ROUND(MAX_TIMER_WAIT / 1e9, 2) AS max_ms,
    SUM_ROWS_EXAMINED AS rows_examined,
    SUM_ROWS_SENT AS rows_sent
FROM performance_schema.events_statements_summary_by_digest
WHERE SCHEMA_NAME IS NOT NULL
ORDER BY SUM_TIMER_WAIT DESC
LIMIT 20;

Sorted by SUM_TIMER_WAIT (total time) — same logic as Postgres' total_exec_time. The top 20 by total time is your workload picture. The ratio rows_examined / rows_sent is your "is this query using the right index" signal: ratios above ~100 are typically missing-index symptoms.

EXPLAIN FORMAT=TREE / JSON

MySQL 8.0+ supports EXPLAIN FORMAT=TREE which gives you the operator tree the optimizer chose plus row-count estimates per operator. EXPLAIN ANALYZE runs the query and adds real timings.

EXPLAIN FORMAT=TREE SELECT ... ;
EXPLAIN ANALYZE SELECT ... ;

The patterns worth recognising:

Table scan on t on a > 100k-row table — missing index
Filter: ... (cost=... rows=...) with estimated rows >> actual rows from EXPLAIN ANALYZE — stale stats; run ANALYZE TABLE t
Nested loop with a multiplied cost > 1e6 — missing composite or join-column index
Sort of a large set in a query without an index that could satisfy ORDER BY — sort buffer pressure

InnoDB Internals

InnoDB is the only storage engine you should be running for OLTP in 2026. Its internals are exposed through SHOW ENGINE INNODB STATUS (firehose) and a series of SHOW STATUS and performance_schema views.

Buffer pool

InnoDB caches data + indexes in the buffer pool. The hit ratio:

SELECT
    ROUND(100 * (1 - SUM(VARIABLE_VALUE) / 
        (SELECT VARIABLE_VALUE FROM performance_schema.global_status 
         WHERE VARIABLE_NAME = 'Innodb_buffer_pool_read_requests')), 4) AS hit_pct
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'Innodb_buffer_pool_reads';

(A more readable version: 1 - Innodb_buffer_pool_reads / Innodb_buffer_pool_read_requests.)

Target > 99% on systems with adequate RAM. Falling below 95% means either the working set exceeds innodb_buffer_pool_size (resize), or a query pattern is doing large scans that pollute the cache (investigate Handler_read_rnd_next).

Dirty pages and checkpointing

Innodb_buffer_pool_pages_dirty / Innodb_buffer_pool_pages_total should typically be < 75%. Above that, InnoDB is struggling to flush; if it hits the dirty-page hard limit, writes stall. Tune innodb_io_capacity and innodb_io_capacity_max to your storage's actual IOPS.

Redo log pressure

Innodb_log_waits — the count of times a transaction had to wait because the redo log buffer was full. Non-zero growth = innodb_log_buffer_size too small for your write rate, or innodb_redo_log_capacity (MySQL 8.0.30+) too small.

Row operations

SHOW GLOBAL STATUS LIKE 'Innodb_rows_%';

Returns Innodb_rows_read, Innodb_rows_inserted, Innodb_rows_updated, Innodb_rows_deleted. Track per-second deltas; sudden spikes in Innodb_rows_read often signal a bad query plan in production.

Replication

Source-replica setup

SHOW REPLICA STATUS\G on a replica shows the headline replication state. The fields that matter most:

Replica_IO_Running — must be Yes
Replica_SQL_Running — must be Yes
Seconds_Behind_Source — the user-visible lag metric
Source_Log_File / Read_Source_Log_Pos — what's been pulled
Relay_Source_Log_File / Exec_Source_Log_Pos — what's been applied
Last_IO_Errno, Last_SQL_Errno — non-zero is bad

Seconds_Behind_Source is the canonical lag metric, but be aware: it's measured by the SQL thread, and it freezes at a stale value if the IO thread is broken. For ground truth use a heartbeat table: a cron writes now() into a heartbeat table on the source every second; the replica reads it; lag = source's timestamp minus replica's now(). pt-heartbeat (Percona Toolkit) is the standard implementation.

GTID-based replication

If you're on GTID (which you should be — it's the default since 8.0):

SELECT WAIT_FOR_EXECUTED_GTID_SET('source-uuid:1-1000', 5);

This is your read-after-write guarantee primitive. Application code that needs to read its own writes can wait for the GTID. Monitor:

Retrieved_Gtid_Set vs Executed_Gtid_Set — the gap is what's still in the relay log

Parallel applier monitoring

In MySQL 8.0+ replicas can apply in parallel by group (replica_parallel_type = LOGICAL_CLOCK, replica_parallel_workers = N). Monitor per-worker status:

SELECT WORKER_ID, LAST_APPLIED_TRANSACTION,
       APPLYING_TRANSACTION, APPLYING_TRANSACTION_ORIGINAL_COMMIT_TIMESTAMP
FROM performance_schema.replication_applier_status_by_worker;

A single worker stuck on a long transaction will look like "high lag" without obvious cause. Identifying that worker and the offending transaction often takes you straight to the bad query.

Alert thresholds

Replica_IO_Running != Yes or Replica_SQL_Running != Yes — critical
Seconds_Behind_Source > 30 for 5 minutes — high
Seconds_Behind_Source > 5 minutes — critical
Heartbeat-table lag > 60s — high (catches Seconds_Behind_Source lying)
GTID gap > 10,000 transactions for 5 minutes — high

Deadlocks

InnoDB detects deadlocks automatically and kills the smaller of the two transactions. The log of the most recent deadlock lives in SHOW ENGINE INNODB STATUS under "LATEST DETECTED DEADLOCK":

SHOW ENGINE INNODB STATUS\G

For automated capture, set:

innodb_print_all_deadlocks = ON

Every deadlock is then printed to the error log. Pipe the error log to log aggregation; alert when the deadlock rate exceeds your baseline.

A small steady rate of deadlocks is acceptable on busy OLTP systems. A growing rate, or a sudden burst, signals an application-side ordering bug — two code paths updating the same rows in opposite orders. The "latest deadlock" log includes the queries each transaction was running, which is usually enough to find the bug.

Lock-Wait Monitoring (MySQL 8.0+)

data_locks and data_lock_waits are the lock-wait equivalents of Postgres' pg_locks:

SELECT
    r.trx_id AS waiting_trx,
    r.trx_query AS waiting_query,
    b.trx_id AS blocking_trx,
    b.trx_query AS blocking_query,
    w.LOCK_TYPE, w.LOCK_MODE,
    w.OBJECT_SCHEMA, w.OBJECT_NAME
FROM performance_schema.data_lock_waits w
JOIN information_schema.innodb_trx r ON r.trx_id = w.REQUESTING_ENGINE_TRANSACTION_ID
JOIN information_schema.innodb_trx b ON b.trx_id = w.BLOCKING_ENGINE_TRANSACTION_ID;

Wire this query into a monitoring panel. Any row sustained for > 30 seconds is an incident worth paging on.

Metadata locks

DDL statements (ALTER TABLE, RENAME TABLE, etc) acquire metadata locks that conflict with ongoing transactions. The classic incident: someone runs ALTER TABLE during peak, every query against that table queues behind it, and the queue grows faster than it drains. Monitor:

SELECT OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS, COUNT(*)
FROM performance_schema.metadata_locks
WHERE LOCK_STATUS = 'PENDING'
GROUP BY OBJECT_SCHEMA, OBJECT_NAME, LOCK_TYPE, LOCK_STATUS;

The fix for DDL during peak: use online DDL (ALGORITHM=INPLACE, LOCK=NONE), or pt-online-schema-change / gh-ost for schema changes that can't run online.

Connections and Threads

SHOW GLOBAL STATUS LIKE 'Threads_%';
SHOW GLOBAL STATUS LIKE 'Max_used_connections';
SHOW GLOBAL STATUS LIKE 'Connection_errors_%';

Key counters:

Threads_connected — currently connected clients
Threads_running — clients actually executing a query right now
Max_used_connections — historical peak; compare against max_connections
Connection_errors_max_connections — non-zero means MySQL is refusing connections

The Threads_running metric is the most operationally useful: a sudden spike from 10 to 200 is the "MySQL is melting" signal. It often precedes Seconds_Behind_Source spikes and connection rejection.

Most teams should run a pooler (ProxySQL is the de-facto choice for MySQL). Monitor ProxySQL's stats_mysql_connection_pool and stats_mysql_query_digest similarly.

Query Cache (Mostly Dead) and Table Cache

MySQL 8 removed the query cache. If you're still on 5.7, disable it (query_cache_type = 0) — its contention overhead exceeds its benefit on modern multi-core servers.

The table-open cache still matters:

Table_open_cache_hits / (Table_open_cache_hits + Table_open_cache_misses) should be > 99%
Opened_tables growing fast = table_open_cache too small

Hosted MySQL Caveats

AWS RDS / Aurora MySQL

CloudWatch covers OS metrics + a curated set of MySQL metrics (Active connections, Aurora replica lag, BinLogDiskUsage, etc). Enable Performance Insights for top-SQL by wait event. Aurora replaces InnoDB's storage layer with its own; many InnoDB-internal counters become less meaningful, but Performance Insights surfaces what matters.

Google Cloud SQL

Cloud SQL Insights ≈ Performance Insights. Top queries by wait event, query plans, recommendations. Use it.

PlanetScale (Vitess)

Vitess monitoring is its own world — vttablet status, query consolidator hit rate, OLAP vs OLTP routing. PlanetScale's Insights surfaces top queries; the underlying Vitess metrics are exposed via Prometheus.

Aurora's parallel query

Aurora MySQL's parallel query feature shifts where slow queries live; monitor aurora_parallel_query status alongside the standard counters.

What to Alert On

Critical (page)

Replica_IO_Running != Yes or Replica_SQL_Running != Yes
Seconds_Behind_Source > 5 minutes (and heartbeat lag confirms)
Threads_connected > 90% of max_connections
Connection_errors_max_connections growing
Lock wait > 30s on critical tables (orders, payments, sessions)
Deadlock rate spike > 10× baseline
Buffer pool hit ratio drop > 10pp in 15 minutes

High (notification)

p99 query time on top-10 queries up 50% week-over-week
Seconds_Behind_Source > 30s sustained
Threads_running spike > 5× baseline
Innodb_log_waits > 0 growth
Innodb_buffer_pool_pages_dirty / total > 75%
Metadata locks pending during business hours

Informational

A new query appears in the top 20 by total time
Plan changes detected (digest hash + plan changes)
Table size growth > 50% week-over-week
Slow_queries counter accelerating

See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.

MySQL Monitoring Checklist

Performance Schema enabled and tuned (most consumers ON by default in 8.0)
Slow query log enabled with long_query_time = 0.5 (or stricter on cold paths)
pt-query-digest running on a schedule against the slow log
events_statements_summary_by_digest queried by monitoring agent every 60s
data_locks + data_lock_waits wired into a dashboard panel
innodb_print_all_deadlocks = ON, error log shipped to log aggregation
Heartbeat-table replica lag (pt-heartbeat or equivalent) tracked
SHOW REPLICA STATUS polled per-replica, including parallel-applier worker status
Buffer pool hit ratio + dirty page % + redo log waits tracked
Threads_running and Threads_connected tracked per-second
Innodb_rows_* deltas tracked
ProxySQL / pool monitoring if applicable
DDL safety: pt-online-schema-change or gh-ost workflow for any non-online schema change
Vendor tooling (Performance Insights, Cloud SQL Insights) integrated
Heartbeat-table lag alerting catches Seconds_Behind_Source lying
External uptime monitoring complements internal — see Database Monitoring foundation

How Webalert Helps With MySQL Monitoring

Webalert provides the external-monitoring layer that complements your in-database telemetry:

HTTP monitoring — Watch the API endpoints backed by MySQL; when a slow query starts surfacing as user-visible latency, you see it at the edge
Content validation — Hit an internal /internal/db-health endpoint that surfaces events_statements_summary_by_digest top-N, replica lag, and Threads_running; alert when those exceed thresholds
WordPress and WooCommerce monitoring — Most WordPress and WooCommerce sites are MySQL-backed; see WordPress Uptime Monitoring and WooCommerce Monitoring
Multi-region checks — Read traffic served from replicas in multiple regions; multi-region monitoring confirms freshness from the user's vantage point
Status page — Communicate "we're seeing elevated database latency" without dumping internal metrics
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
1-minute check intervals — Detect outages within 60 seconds
5-minute setup — Add endpoints, set thresholds, done

See features and pricing for details.

Summary

MySQL ships excellent observability through Performance Schema. events_statements_summary_by_digest is the workload-analysis foundation; data_locks / data_lock_waits are the lock-wait foundation.
The slow query log plus pt-query-digest is still the canonical workflow for query tuning. Run it on a schedule and review the digest.
InnoDB internals worth monitoring: buffer pool hit ratio, dirty-page %, redo log waits, row-operation deltas.
Replication: Replica_IO_Running / Replica_SQL_Running first; Seconds_Behind_Source as the headline lag metric, with heartbeat-table lag as the ground truth.
GTID replication makes read-after-write deterministic via WAIT_FOR_EXECUTED_GTID_SET.
Parallel-applier worker status pinpoints which worker is stuck on which transaction during replica lag.
Deadlock log via innodb_print_all_deadlocks = ON exposes application-side ordering bugs.
Threads_running spikes precede most "MySQL is melting" incidents — alert on it.
DDL during peak is a top cause of metadata-lock incidents — use online DDL or pt-online-schema-change / gh-ost.
Hosted MySQL (RDS, Aurora, Cloud SQL, PlanetScale) layers vendor tooling on top — use it, but the Performance Schema surfaces still apply.

MySQL incidents are rarely "the database is down". They're "the replica is 4 minutes behind", "we're deadlocking on orders", "an ALTER TABLE blocked checkout for 30 seconds", "Threads_running is at 200 and climbing". Each of those is visible in counters before it becomes an outage. Build the monitoring foundation once — Performance Schema digest, slow query log, replica lag with heartbeat, lock-wait panel, InnoDB buffer-pool stats — and the next Black Friday is a different conversation.

Catch MySQL regressions before they cascade into an outage

Start monitoring with Webalert →

See features and pricing. No credit card required.