Redis Production Monitoring: Memory, Eviction & Latency

The on-call channel pings with "API latency p99 just doubled". The application is the same; the database looks fine. Someone glances at Redis: INFO memory shows used_memory_human:11.94G / maxmemory_human:12.00G and evicted_keys is climbing by 50,000 per minute. Half the cache is being evicted as fast as it's being written, the hit ratio has collapsed from 92% to 38%, and every cache miss is now a slow query against the primary database. The Redis box is the proximate cause; an unbounded growth in one specific cache key family is the actual cause, and nobody's been watching it.

Redis is the component teams reach for when something needs to be fast, which means everyone has it in production and almost nobody monitors it well. The default reaction to a Redis problem is to look at OS-level metrics (CPU, memory, network) — but the interesting failure modes (eviction storms, latency spikes from KEYS-class commands, persistence stalls, replication backlog, big keys, hot keys, cluster MOVED rates) live inside Redis itself and are exposed through INFO, LATENCY, SLOWLOG, and a handful of related commands.

This guide is the production-monitoring layer for Redis: which INFO sections matter, how memory and eviction actually work, how to capture latency spikes with attribution, what to do about big and hot keys, and how the monitoring story shifts in Cluster, Sentinel, and managed-service deployments. It complements the external-uptime layer in Database Monitoring: MySQL, PostgreSQL & Redis Uptime.

The INFO Sections Worth Polling

INFO returns a categorised dump of Redis state. Poll a subset every 30-60 seconds; store deltas in a time-series store.

Section	What's there
`Memory`	`used_memory`, `maxmemory`, fragmentation ratio, peak
`Stats`	`total_connections_received`, `total_commands_processed`, `keyspace_hits`, `keyspace_misses`, `evicted_keys`, `expired_keys`, `rejected_connections`
`Clients`	`connected_clients`, `blocked_clients`, `tracking_clients`
`Replication`	role, connected_slaves/replicas, `master_repl_offset`, per-replica `replica_repl_offset`
`Persistence`	`rdb_last_save_time`, `rdb_last_bgsave_status`, `aof_rewrite_in_progress`, `aof_last_rewrite_time_sec`
`CPU`	`used_cpu_sys`, `used_cpu_user`
`Commandstats`	per-command call count, total time, p50/p99 (Redis 7+)

Run INFO MEMORY, INFO STATS, INFO REPLICATION, INFO CLIENTS, INFO PERSISTENCE, INFO COMMANDSTATS — never a bare INFO from a monitoring agent (avoids needlessly serialising the kitchen sink).

Memory and Eviction — The Single Most Important Topic

Almost every Redis-shaped incident traces back to memory. Two things to internalise:

`maxmemory` is a soft limit on data only

maxmemory caps the dataset size. Redis still uses memory beyond it for the COB (client output buffer), replication backlog, AOF buffer, and forked children during RDB / AOF rewrite. A Redis with maxmemory = 12GB on a 16GB box can OOM during BGSAVE because the forked child's copy-on-write footprint pushes total RSS over physical RAM. Always leave 30-50% headroom over maxmemory.

`maxmemory-policy` decides what eviction does

The policy options:

Policy	Behavior
`noeviction`	Write commands return errors when at maxmemory. Reads still work. Caches usually don't want this; data stores usually do
`allkeys-lru`	Evict the approximate-least-recently-used key from anywhere. Sensible default for caches
`allkeys-lfu`	Evict approximate-least-frequently-used. Better when hot vs cold differs by frequency, not recency
`volatile-lru` / `volatile-lfu`	Same, but only on keys with TTL set
`allkeys-random` / `volatile-random`	Random eviction — rarely the right choice
`volatile-ttl`	Evict keys with the soonest expiration

The single biggest configuration mistake we see: caches deployed with maxmemory-policy = noeviction. Then maxmemory fills up, every SET starts erroring, the application starts failing writes to the cache, and somehow nobody is sure why.

Eviction metrics

The headline counters:

evicted_keys — total since boot. Track per-second delta. A non-zero sustained delta on a cache is expected (cache is full); a sustained delta on a primary data store is a misconfiguration alarm.
keyspace_hits / keyspace_misses — hit ratio is hits / (hits + misses). Cache hit ratio < 80% is rarely worth the cache.
expired_keys — keys removed via TTL expiration, not eviction. Distinct from eviction.

Fragmentation ratio

mem_fragmentation_ratio = used_memory_rss / used_memory. The interesting bands:

1.0 – 1.5 — normal
1.5 – 2.0 — high; consider activedefrag yes (Redis 4+)
2.0 — the allocator is holding lots of unused memory; eventual restart needed
< 1.0 — Redis has swapped to disk; very bad — performance is now random-access disk speeds

Alert on mem_fragmentation_ratio < 1.0 immediately. It's not subtle: swap-backed Redis is faster to take out of rotation than to leave running.

Latency Monitoring — `LATENCY` and `SLOWLOG`

Redis is single-threaded for command execution. A single slow command stalls everything. The two surfaces for catching this:

`LATENCY HISTORY` and `LATENCY LATEST`

Enable the latency monitor:

CONFIG SET latency-monitor-threshold 100

Redis now records any event that takes longer than 100ms into a ring buffer per event-type:

LATENCY LATEST
LATENCY HISTORY event-name
LATENCY DOCTOR

LATENCY DOCTOR returns a human-readable analysis with the top events, ranges, and likely causes. Run it in incident response. For monitoring, poll LATENCY LATEST every 30 seconds and ship the structured output to your time-series store.

`SLOWLOG`

Redis tracks the N most-recent commands that exceeded a threshold:

CONFIG SET slowlog-log-slower-than 10000   # 10 ms in microseconds
CONFIG SET slowlog-max-len 1024
SLOWLOG GET 50

This is your "what query was actually slow" surface. The classic offenders:

KEYS pattern on a large dataset — O(N) scan, blocks the event loop. Forbid KEYS in production; use SCAN instead.
SMEMBERS, LRANGE 0 -1, HGETALL on large collections — O(N) where N can be huge
DEBUG SLEEP — someone left a debug command running
Large MGET / MSET batches — usually fine but worth knowing if they're in the top
Lua scripts (EVAL / EVALSHA) that loop over many keys

Wire SLOWLOG GET 100 into a monitoring panel that runs every minute and aggregates by normalised command. Anything new appearing at the top of the list is worth investigating.

Per-command stats (Redis 7+)

INFO COMMANDSTATS

Returns per-command call count, total time, p50, p99 (and rejected_calls/failed_calls in 7+). This is the closest Redis equivalent to pg_stat_statements / events_statements_summary_by_digest. Sort by total time to find capacity bottlenecks.

Big Keys and Hot Keys

Two distinct failure modes:

Big keys

A single key (typically a List, Set, Hash, or Sorted Set) that has grown to hundreds of MB. Symptoms: any command touching it takes seconds, eviction policy can't shed it efficiently, replication of a write that touches it stalls.

Find them:

redis-cli --bigkeys

Or for surgical inspection:

redis-cli MEMORY USAGE key

Or run SCAN periodically across the keyspace and check MEMORY USAGE on each. The signal: any key > 10MB on a typical workload is suspicious; > 100MB needs an immediate plan.

The fix: shard the key (user:123:notifications → user:123:notifications:202602), cap the size with LTRIM / ZREMRANGEBYRANK, or move it out of Redis to a more appropriate store.

Hot keys

A single key receiving an outsized share of traffic. Even small hot keys can become a bottleneck because Redis is single-threaded. Symptoms: high CPU on Redis, fine memory, decent hit ratio, but latency p99 climbing.

Find them:

redis-cli --hotkeys

(Requires maxmemory-policy of allkeys-lfu or volatile-lfu to collect frequency data.)

The fix: client-side caching for that key with TTL, key-level replication, or sharding the access pattern.

Replication

INFO REPLICATION on the source shows role + connected_slaves + each replica's offset and lag:

role:master
connected_slaves:2
slave0:ip=10.0.1.42,port=6379,state=online,offset=894012348,lag=0
slave1:ip=10.0.1.43,port=6379,state=online,offset=894008923,lag=1
master_repl_offset:894012348

The headline metrics:

state should be online per replica
lag is in seconds based on last ack; > 5s is worth a look
offset gap between master_repl_offset and per-replica offset is the byte-distance

On the replica side:

role:slave
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:894012348

Watch:

master_link_status:down — replication broken
master_sync_in_progress:1 for sustained periods — full resync ongoing, replica unusable for serving reads
repl_backlog_size exhausted (replication backlog buffer too small for the disconnect duration) forces a full resync — sized via repl-backlog-size

Cluster mode replication

In Cluster mode each shard is a primary + N replicas. CLUSTER NODES and CLUSTER INFO show the topology. Monitor:

cluster_state:ok — anything else is broken
cluster_slots_assigned:16384 — all slots covered
cluster_slots_pfail / cluster_slots_fail — non-zero = node-failure detection in progress

Persistence

Redis offers two persistence modes, often combined:

RDB snapshots

BGSAVE forks a child that writes a point-in-time snapshot. Monitor:

rdb_last_save_time — Unix timestamp of last successful save
rdb_last_bgsave_status — must be ok
rdb_last_bgsave_time_sec — duration of last save; a sudden climb means the dataset grew or disk slowed

The "fork latency spike": when Redis forks a child for BGSAVE on a memory-pressured host, the OS pauses the parent for the duration of the copy-on-write page-table copy. On a 12GB dataset this can be 100-200ms. Capture in LATENCY HISTORY fork.

AOF

appendonly yes writes every command to an append-only log. Monitor:

aof_enabled — should match your config
aof_rewrite_in_progress — 1 during rewrite, ok; sustained at 1 means rewrite stalled
aof_last_rewrite_time_sec — rewrite duration
aof_pending_bio_fsync — pending fsyncs on the background IO thread; > 0 sustained = disk struggling

Rewrite stalls (aof_last_rewrite_time_sec growing across runs) are usually disk-bound. They block AOF growth without blocking writes, but write traffic eventually outpaces the rewriter and disk fills.

Connections and Clients

INFO CLIENTS

connected_clients — total client connections
blocked_clients — clients in BLPOP, WAIT, etc; this is expected for queues and pub/sub workers
maxclients — hard cap (default 10,000)
rejected_connections from INFO STATS — incremented when maxclients is hit

Alert when connected_clients > 80% of maxclients and when rejected_connections grows.

Client output buffer

A subscriber that can't keep up with a fast publisher accumulates output in its client output buffer. The buffer has a limit; when reached, the client is disconnected (client-output-buffer-limit). Symptoms: pub/sub subscribers silently disconnecting. Monitor CLIENT LIST periodically for clients with high omem.

Cluster Mode Specifics

In Cluster mode, monitoring adds:

MOVED / ASK error rates — a client routed a request to the wrong node. Some MOVED traffic is normal during failover; sustained MOVED rates mean clients haven't refreshed their slot map. Most modern clients handle this; legacy clients don't.
CLUSTER FAILOVER events — track them as discrete events with timestamps
Per-slot key counts — CLUSTER COUNTKEYSINSLOT per slot; sustained imbalance suggests a key-distribution problem
Cross-slot operations — disallowed by Cluster; client errors show in application logs. Worth tracking from the application side

Sentinel Mode

Sentinel coordinates failover for non-Cluster setups. Monitor:

Sentinel count alive (quorum requirement)
Last failover timestamp
Master/replica known to Sentinel matches actual topology
+sdown / +odown / +failover-triggered events from the Sentinel pub/sub channel

Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud)

AWS ElastiCache

CloudWatch covers most counters: CPUUtilization, EngineCPUUtilization (single-threaded measure — the one that matters), DatabaseMemoryUsagePercentage, Evictions, ReplicationLag, CurrConnections. Cluster mode includes per-shard metrics. Performance Insights for Redis surfaces top commands.

EngineCPUUtilization > 80% is the canonical "Redis is the bottleneck" signal — it measures the single-threaded event loop, which CPUUtilization doesn't isolate.

Google MemoryStore

Cloud Monitoring covers similar ground. The metrics names differ but the structure is identical.

Upstash

Upstash's HTTP/REST API model changes the connection-monitoring story (no persistent connections) but the keyspace metrics, eviction, and slowlog still apply. The Upstash dashboard exposes daily request count and bandwidth.

Dragonfly

Dragonfly is a multi-threaded Redis-compatible drop-in (2023+). Many of the single-threaded constraints (KEYS blocking the planet) don't apply, but the same monitoring surfaces (INFO, SLOWLOG, LATENCY) work because of API compatibility. Some Dragonfly-specific metrics (shard_count, per-shard CPU) need their own panels.

Redis Cloud / Redis Enterprise

Redis Inc.'s managed service exposes the same surfaces plus enterprise-specific (proxy CPU, shard placement) metrics.

What to Alert On

Critical (page)

used_memory > 95% of maxmemory and policy is noeviction
mem_fragmentation_ratio < 1.0 (swap-backed Redis)
evicted_keys rate > 10× baseline (eviction storm)
master_link_status:down on any replica
cluster_state != ok
rdb_last_bgsave_status != ok
rejected_connections growth (connection saturation)
EngineCPUUtilization > 90% (ElastiCache) or equivalent single-thread CPU

High (notification)

Cache hit ratio < 80% on a workload where it was historically > 90%
p99 latency > 10ms sustained (Redis should serve from RAM in single-digit ms)
Big key detected via --bigkeys or MEMORY USAGE > 100MB
aof_last_rewrite_time_sec growing across runs (AOF rewriter falling behind)
repl_backlog_active = 0 (replicas relying on full resync after disconnect)
New slow command appears in SLOWLOG
blocked_clients > 10× baseline (queue backlog forming)

Informational

mem_fragmentation_ratio > 1.5 (defrag opportunity)
Per-shard imbalance in Cluster mode
A new command appearing in top of INFO COMMANDSTATS
Sentinel failover event logged

See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.

Redis Monitoring Checklist

Polling agent runs INFO MEMORY, INFO STATS, INFO REPLICATION, INFO CLIENTS, INFO PERSISTENCE, INFO COMMANDSTATS every 30-60s
CONFIG SET latency-monitor-threshold 100 enabled; LATENCY HISTORY polled
slowlog-log-slower-than 10000 configured; SLOWLOG GET shipped to logs
Hit ratio (keyspace_hits / (hits + misses)) tracked
Eviction rate (evicted_keys delta) tracked
Fragmentation ratio tracked; alert on < 1.0
maxmemory and maxmemory-policy reviewed for every Redis instance (no noeviction on caches by accident)
OS-level free memory > 30-50% over maxmemory (room for fork + AOF buffers)
Big-key job: weekly --bigkeys or periodic MEMORY USAGE scan
Hot-key job: --hotkeys (with LFU policy enabled) on the workload
Replication lag (offset gap + lag seconds) tracked per replica
repl-backlog-size sized to cover expected disconnect duration
Persistence: rdb_last_bgsave_status, AOF rewrite duration, fork-latency events
blocked_clients and rejected_connections tracked
Cluster: per-shard cluster_state, slot assignment, MOVED rates
Sentinel: quorum count, failover events
Managed-service: vendor-specific single-thread CPU metric (EngineCPUUtilization on ElastiCache, equivalents elsewhere)
KEYS disallowed in production (clients use SCAN)
External uptime monitoring complements internal — see Database Monitoring foundation
If Redis backs queues (BullMQ, Sidekiq) — also see Job Queue Monitoring
If Redis backs rate limiting — also see API Rate Limit Monitoring

How Webalert Helps With Redis Monitoring

Webalert provides the external-monitoring layer that complements your in-Redis telemetry:

HTTP monitoring — Watch the API endpoints backed by Redis (caches, sessions, rate limiters); when cache hit ratio collapses, you see it as edge latency
Content validation — Hit an internal /internal/redis-health endpoint that surfaces INFO MEMORY, hit ratio, eviction rate, and replication state; alert when thresholds are crossed
CDN + Redis cache layering — see CDN Monitoring: Edge Cache & Origin Uptime
Multi-region checks — Redis caches replicated to multiple regions; multi-region monitoring confirms freshness from the user's vantage point
Status page — Communicate "we're seeing elevated cache latency" without dumping internal metrics
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
1-minute check intervals — Detect outages within 60 seconds
5-minute setup — Add endpoints, set thresholds, done

See features and pricing for details.

Summary

Redis is single-threaded for command execution; almost every incident relates to memory, latency from a slow command, or replication / persistence.
Poll INFO selectively: MEMORY, STATS, REPLICATION, CLIENTS, PERSISTENCE, COMMANDSTATS.
maxmemory is a soft limit on data; leave 30-50% OS headroom for fork, AOF buffers, replication backlog.
maxmemory-policy is the most commonly misconfigured setting — noeviction on a cache is a guaranteed incident.
Eviction rate, hit ratio, fragmentation ratio are the headline memory metrics. Fragmentation < 1.0 means Redis has swapped — page immediately.
Enable LATENCY monitoring and SLOWLOG. They are the closest equivalents Redis has to query monitoring.
INFO COMMANDSTATS (Redis 7+) is the per-command workload picture — sort by total time.
Big keys block the event loop; find with --bigkeys and MEMORY USAGE; shard or cap them.
Hot keys saturate the single CPU thread; find with --hotkeys (requires LFU policy).
Replication: per-replica offset, lag, link status; size repl-backlog-size correctly to avoid full resyncs.
Persistence: BGSAVE status, fork-latency spikes, AOF rewrite duration.
Cluster: cluster_state, slot coverage, MOVED rates.
Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud) layers vendor tooling — EngineCPUUtilization is the single-thread signal that matters most.

Redis problems are almost always memory problems, slow-command problems, or persistence problems — all three are visible in counters before they become user-visible incidents. Build the monitoring foundation once — selective INFO polling, LATENCY / SLOWLOG, hit ratio, eviction rate, big-key job, replication panels — and the next cache eviction storm shows up on a graph 10 minutes before it shows up on the pager.

Catch Redis regressions before cache misses cascade into database load

Start monitoring with Webalert →

See features and pricing. No credit card required.