
The on-call channel pings with "API latency p99 just doubled". The application is the same; the database looks fine. Someone glances at Redis: INFO memory shows used_memory_human:11.94G / maxmemory_human:12.00G and evicted_keys is climbing by 50,000 per minute. Half the cache is being evicted as fast as it's being written, the hit ratio has collapsed from 92% to 38%, and every cache miss is now a slow query against the primary database. The Redis box is the proximate cause; an unbounded growth in one specific cache key family is the actual cause, and nobody's been watching it.
Redis is the component teams reach for when something needs to be fast, which means everyone has it in production and almost nobody monitors it well. The default reaction to a Redis problem is to look at OS-level metrics (CPU, memory, network) — but the interesting failure modes (eviction storms, latency spikes from KEYS-class commands, persistence stalls, replication backlog, big keys, hot keys, cluster MOVED rates) live inside Redis itself and are exposed through INFO, LATENCY, SLOWLOG, and a handful of related commands.
This guide is the production-monitoring layer for Redis: which INFO sections matter, how memory and eviction actually work, how to capture latency spikes with attribution, what to do about big and hot keys, and how the monitoring story shifts in Cluster, Sentinel, and managed-service deployments. It complements the external-uptime layer in Database Monitoring: MySQL, PostgreSQL & Redis Uptime.
The INFO Sections Worth Polling
INFO returns a categorised dump of Redis state. Poll a subset every 30-60 seconds; store deltas in a time-series store.
| Section | What's there |
|---|---|
Memory |
used_memory, maxmemory, fragmentation ratio, peak |
Stats |
total_connections_received, total_commands_processed, keyspace_hits, keyspace_misses, evicted_keys, expired_keys, rejected_connections |
Clients |
connected_clients, blocked_clients, tracking_clients |
Replication |
role, connected_slaves/replicas, master_repl_offset, per-replica replica_repl_offset |
Persistence |
rdb_last_save_time, rdb_last_bgsave_status, aof_rewrite_in_progress, aof_last_rewrite_time_sec |
CPU |
used_cpu_sys, used_cpu_user |
Commandstats |
per-command call count, total time, p50/p99 (Redis 7+) |
Run INFO MEMORY, INFO STATS, INFO REPLICATION, INFO CLIENTS, INFO PERSISTENCE, INFO COMMANDSTATS — never a bare INFO from a monitoring agent (avoids needlessly serialising the kitchen sink).
Memory and Eviction — The Single Most Important Topic
Almost every Redis-shaped incident traces back to memory. Two things to internalise:
maxmemory is a soft limit on data only
maxmemory caps the dataset size. Redis still uses memory beyond it for the COB (client output buffer), replication backlog, AOF buffer, and forked children during RDB / AOF rewrite. A Redis with maxmemory = 12GB on a 16GB box can OOM during BGSAVE because the forked child's copy-on-write footprint pushes total RSS over physical RAM. Always leave 30-50% headroom over maxmemory.
maxmemory-policy decides what eviction does
The policy options:
| Policy | Behavior |
|---|---|
noeviction |
Write commands return errors when at maxmemory. Reads still work. Caches usually don't want this; data stores usually do |
allkeys-lru |
Evict the approximate-least-recently-used key from anywhere. Sensible default for caches |
allkeys-lfu |
Evict approximate-least-frequently-used. Better when hot vs cold differs by frequency, not recency |
volatile-lru / volatile-lfu |
Same, but only on keys with TTL set |
allkeys-random / volatile-random |
Random eviction — rarely the right choice |
volatile-ttl |
Evict keys with the soonest expiration |
The single biggest configuration mistake we see: caches deployed with maxmemory-policy = noeviction. Then maxmemory fills up, every SET starts erroring, the application starts failing writes to the cache, and somehow nobody is sure why.
Eviction metrics
The headline counters:
evicted_keys— total since boot. Track per-second delta. A non-zero sustained delta on a cache is expected (cache is full); a sustained delta on a primary data store is a misconfiguration alarm.keyspace_hits/keyspace_misses— hit ratio ishits / (hits + misses). Cache hit ratio < 80% is rarely worth the cache.expired_keys— keys removed via TTL expiration, not eviction. Distinct from eviction.
Fragmentation ratio
mem_fragmentation_ratio = used_memory_rss / used_memory. The interesting bands:
- 1.0 – 1.5 — normal
- 1.5 – 2.0 — high; consider
activedefrag yes(Redis 4+) 2.0 — the allocator is holding lots of unused memory; eventual restart needed
- < 1.0 — Redis has swapped to disk; very bad — performance is now random-access disk speeds
Alert on mem_fragmentation_ratio < 1.0 immediately. It's not subtle: swap-backed Redis is faster to take out of rotation than to leave running.
Latency Monitoring — LATENCY and SLOWLOG
Redis is single-threaded for command execution. A single slow command stalls everything. The two surfaces for catching this:
LATENCY HISTORY and LATENCY LATEST
Enable the latency monitor:
CONFIG SET latency-monitor-threshold 100
Redis now records any event that takes longer than 100ms into a ring buffer per event-type:
LATENCY LATEST
LATENCY HISTORY event-name
LATENCY DOCTOR
LATENCY DOCTOR returns a human-readable analysis with the top events, ranges, and likely causes. Run it in incident response. For monitoring, poll LATENCY LATEST every 30 seconds and ship the structured output to your time-series store.
SLOWLOG
Redis tracks the N most-recent commands that exceeded a threshold:
CONFIG SET slowlog-log-slower-than 10000 # 10 ms in microseconds
CONFIG SET slowlog-max-len 1024
SLOWLOG GET 50
This is your "what query was actually slow" surface. The classic offenders:
KEYS patternon a large dataset — O(N) scan, blocks the event loop. ForbidKEYSin production; useSCANinstead.SMEMBERS,LRANGE 0 -1,HGETALLon large collections — O(N) where N can be hugeDEBUG SLEEP— someone left a debug command running- Large
MGET/MSETbatches — usually fine but worth knowing if they're in the top - Lua scripts (
EVAL/EVALSHA) that loop over many keys
Wire SLOWLOG GET 100 into a monitoring panel that runs every minute and aggregates by normalised command. Anything new appearing at the top of the list is worth investigating.
Per-command stats (Redis 7+)
INFO COMMANDSTATS
Returns per-command call count, total time, p50, p99 (and rejected_calls/failed_calls in 7+). This is the closest Redis equivalent to pg_stat_statements / events_statements_summary_by_digest. Sort by total time to find capacity bottlenecks.
Big Keys and Hot Keys
Two distinct failure modes:
Big keys
A single key (typically a List, Set, Hash, or Sorted Set) that has grown to hundreds of MB. Symptoms: any command touching it takes seconds, eviction policy can't shed it efficiently, replication of a write that touches it stalls.
Find them:
redis-cli --bigkeys
Or for surgical inspection:
redis-cli MEMORY USAGE key
Or run SCAN periodically across the keyspace and check MEMORY USAGE on each. The signal: any key > 10MB on a typical workload is suspicious; > 100MB needs an immediate plan.
The fix: shard the key (user:123:notifications → user:123:notifications:202602), cap the size with LTRIM / ZREMRANGEBYRANK, or move it out of Redis to a more appropriate store.
Hot keys
A single key receiving an outsized share of traffic. Even small hot keys can become a bottleneck because Redis is single-threaded. Symptoms: high CPU on Redis, fine memory, decent hit ratio, but latency p99 climbing.
Find them:
redis-cli --hotkeys
(Requires maxmemory-policy of allkeys-lfu or volatile-lfu to collect frequency data.)
The fix: client-side caching for that key with TTL, key-level replication, or sharding the access pattern.
Replication
INFO REPLICATION on the source shows role + connected_slaves + each replica's offset and lag:
role:master
connected_slaves:2
slave0:ip=10.0.1.42,port=6379,state=online,offset=894012348,lag=0
slave1:ip=10.0.1.43,port=6379,state=online,offset=894008923,lag=1
master_repl_offset:894012348
The headline metrics:
stateshould beonlineper replicalagis in seconds based on last ack; > 5s is worth a lookoffsetgap betweenmaster_repl_offsetand per-replicaoffsetis the byte-distance
On the replica side:
role:slave
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:894012348
Watch:
master_link_status:down— replication brokenmaster_sync_in_progress:1for sustained periods — full resync ongoing, replica unusable for serving readsrepl_backlog_sizeexhausted (replication backlog buffer too small for the disconnect duration) forces a full resync — sized viarepl-backlog-size
Cluster mode replication
In Cluster mode each shard is a primary + N replicas. CLUSTER NODES and CLUSTER INFO show the topology. Monitor:
cluster_state:ok— anything else is brokencluster_slots_assigned:16384— all slots coveredcluster_slots_pfail/cluster_slots_fail— non-zero = node-failure detection in progress
Persistence
Redis offers two persistence modes, often combined:
RDB snapshots
BGSAVE forks a child that writes a point-in-time snapshot. Monitor:
rdb_last_save_time— Unix timestamp of last successful saverdb_last_bgsave_status— must beokrdb_last_bgsave_time_sec— duration of last save; a sudden climb means the dataset grew or disk slowed
The "fork latency spike": when Redis forks a child for BGSAVE on a memory-pressured host, the OS pauses the parent for the duration of the copy-on-write page-table copy. On a 12GB dataset this can be 100-200ms. Capture in LATENCY HISTORY fork.
AOF
appendonly yes writes every command to an append-only log. Monitor:
aof_enabled— should match your configaof_rewrite_in_progress— 1 during rewrite, ok; sustained at 1 means rewrite stalledaof_last_rewrite_time_sec— rewrite durationaof_pending_bio_fsync— pending fsyncs on the background IO thread; > 0 sustained = disk struggling
Rewrite stalls (aof_last_rewrite_time_sec growing across runs) are usually disk-bound. They block AOF growth without blocking writes, but write traffic eventually outpaces the rewriter and disk fills.
Connections and Clients
INFO CLIENTS
connected_clients— total client connectionsblocked_clients— clients inBLPOP,WAIT, etc; this is expected for queues and pub/sub workersmaxclients— hard cap (default 10,000)rejected_connectionsfromINFO STATS— incremented whenmaxclientsis hit
Alert when connected_clients > 80% of maxclients and when rejected_connections grows.
Client output buffer
A subscriber that can't keep up with a fast publisher accumulates output in its client output buffer. The buffer has a limit; when reached, the client is disconnected (client-output-buffer-limit). Symptoms: pub/sub subscribers silently disconnecting. Monitor CLIENT LIST periodically for clients with high omem.
Cluster Mode Specifics
In Cluster mode, monitoring adds:
- MOVED / ASK error rates — a client routed a request to the wrong node. Some
MOVEDtraffic is normal during failover; sustainedMOVEDrates mean clients haven't refreshed their slot map. Most modern clients handle this; legacy clients don't. - CLUSTER FAILOVER events — track them as discrete events with timestamps
- Per-slot key counts —
CLUSTER COUNTKEYSINSLOTper slot; sustained imbalance suggests a key-distribution problem - Cross-slot operations — disallowed by Cluster; client errors show in application logs. Worth tracking from the application side
Sentinel Mode
Sentinel coordinates failover for non-Cluster setups. Monitor:
- Sentinel count alive (quorum requirement)
- Last failover timestamp
- Master/replica known to Sentinel matches actual topology
+sdown/+odown/+failover-triggeredevents from the Sentinel pub/sub channel
Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud)
AWS ElastiCache
CloudWatch covers most counters: CPUUtilization, EngineCPUUtilization (single-threaded measure — the one that matters), DatabaseMemoryUsagePercentage, Evictions, ReplicationLag, CurrConnections. Cluster mode includes per-shard metrics. Performance Insights for Redis surfaces top commands.
EngineCPUUtilization > 80% is the canonical "Redis is the bottleneck" signal — it measures the single-threaded event loop, which CPUUtilization doesn't isolate.
Google MemoryStore
Cloud Monitoring covers similar ground. The metrics names differ but the structure is identical.
Upstash
Upstash's HTTP/REST API model changes the connection-monitoring story (no persistent connections) but the keyspace metrics, eviction, and slowlog still apply. The Upstash dashboard exposes daily request count and bandwidth.
Dragonfly
Dragonfly is a multi-threaded Redis-compatible drop-in (2023+). Many of the single-threaded constraints (KEYS blocking the planet) don't apply, but the same monitoring surfaces (INFO, SLOWLOG, LATENCY) work because of API compatibility. Some Dragonfly-specific metrics (shard_count, per-shard CPU) need their own panels.
Redis Cloud / Redis Enterprise
Redis Inc.'s managed service exposes the same surfaces plus enterprise-specific (proxy CPU, shard placement) metrics.
What to Alert On
Critical (page)
used_memory > 95% of maxmemoryand policy isnoevictionmem_fragmentation_ratio < 1.0(swap-backed Redis)evicted_keysrate > 10× baseline (eviction storm)master_link_status:downon any replicacluster_state != okrdb_last_bgsave_status != okrejected_connectionsgrowth (connection saturation)EngineCPUUtilization > 90%(ElastiCache) or equivalent single-thread CPU
High (notification)
- Cache hit ratio < 80% on a workload where it was historically > 90%
- p99 latency > 10ms sustained (Redis should serve from RAM in single-digit ms)
- Big key detected via
--bigkeysorMEMORY USAGE> 100MB aof_last_rewrite_time_secgrowing across runs (AOF rewriter falling behind)repl_backlog_active = 0(replicas relying on full resync after disconnect)- New slow command appears in
SLOWLOG blocked_clients> 10× baseline (queue backlog forming)
Informational
mem_fragmentation_ratio > 1.5(defrag opportunity)- Per-shard imbalance in Cluster mode
- A new command appearing in top of
INFO COMMANDSTATS - Sentinel failover event logged
See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.
Redis Monitoring Checklist
- Polling agent runs
INFO MEMORY,INFO STATS,INFO REPLICATION,INFO CLIENTS,INFO PERSISTENCE,INFO COMMANDSTATSevery 30-60s -
CONFIG SET latency-monitor-threshold 100enabled;LATENCY HISTORYpolled -
slowlog-log-slower-than 10000configured;SLOWLOG GETshipped to logs - Hit ratio (
keyspace_hits / (hits + misses)) tracked - Eviction rate (
evicted_keysdelta) tracked - Fragmentation ratio tracked; alert on < 1.0
-
maxmemoryandmaxmemory-policyreviewed for every Redis instance (nonoevictionon caches by accident) - OS-level free memory > 30-50% over
maxmemory(room for fork + AOF buffers) - Big-key job: weekly
--bigkeysor periodicMEMORY USAGEscan - Hot-key job:
--hotkeys(with LFU policy enabled) on the workload - Replication lag (offset gap +
lagseconds) tracked per replica -
repl-backlog-sizesized to cover expected disconnect duration - Persistence:
rdb_last_bgsave_status, AOF rewrite duration, fork-latency events -
blocked_clientsandrejected_connectionstracked - Cluster: per-shard
cluster_state, slot assignment, MOVED rates - Sentinel: quorum count, failover events
- Managed-service: vendor-specific single-thread CPU metric (
EngineCPUUtilizationon ElastiCache, equivalents elsewhere) -
KEYSdisallowed in production (clients useSCAN) - External uptime monitoring complements internal — see Database Monitoring foundation
- If Redis backs queues (BullMQ, Sidekiq) — also see Job Queue Monitoring
- If Redis backs rate limiting — also see API Rate Limit Monitoring
How Webalert Helps With Redis Monitoring
Webalert provides the external-monitoring layer that complements your in-Redis telemetry:
- HTTP monitoring — Watch the API endpoints backed by Redis (caches, sessions, rate limiters); when cache hit ratio collapses, you see it as edge latency
- Content validation — Hit an internal
/internal/redis-healthendpoint that surfacesINFO MEMORY, hit ratio, eviction rate, and replication state; alert when thresholds are crossed - CDN + Redis cache layering — see CDN Monitoring: Edge Cache & Origin Uptime
- Multi-region checks — Redis caches replicated to multiple regions; multi-region monitoring confirms freshness from the user's vantage point
- Status page — Communicate "we're seeing elevated cache latency" without dumping internal metrics
- Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
- 1-minute check intervals — Detect outages within 60 seconds
- 5-minute setup — Add endpoints, set thresholds, done
See features and pricing for details.
Summary
- Redis is single-threaded for command execution; almost every incident relates to memory, latency from a slow command, or replication / persistence.
- Poll
INFOselectively:MEMORY,STATS,REPLICATION,CLIENTS,PERSISTENCE,COMMANDSTATS. maxmemoryis a soft limit on data; leave 30-50% OS headroom for fork, AOF buffers, replication backlog.maxmemory-policyis the most commonly misconfigured setting —noevictionon a cache is a guaranteed incident.- Eviction rate, hit ratio, fragmentation ratio are the headline memory metrics. Fragmentation < 1.0 means Redis has swapped — page immediately.
- Enable
LATENCYmonitoring andSLOWLOG. They are the closest equivalents Redis has to query monitoring. INFO COMMANDSTATS(Redis 7+) is the per-command workload picture — sort by total time.- Big keys block the event loop; find with
--bigkeysandMEMORY USAGE; shard or cap them. - Hot keys saturate the single CPU thread; find with
--hotkeys(requires LFU policy). - Replication: per-replica offset, lag, link status; size
repl-backlog-sizecorrectly to avoid full resyncs. - Persistence: BGSAVE status, fork-latency spikes, AOF rewrite duration.
- Cluster:
cluster_state, slot coverage, MOVED rates. - Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud) layers vendor tooling —
EngineCPUUtilizationis the single-thread signal that matters most.
Redis problems are almost always memory problems, slow-command problems, or persistence problems — all three are visible in counters before they become user-visible incidents. Build the monitoring foundation once — selective INFO polling, LATENCY / SLOWLOG, hit ratio, eviction rate, big-key job, replication panels — and the next cache eviction storm shows up on a graph 10 minutes before it shows up on the pager.