Skip to content

Redis Production Monitoring: Memory, Eviction & Latency

Webalert Team
May 19, 2026
15 min read

Redis Production Monitoring: Memory, Eviction & Latency

The on-call channel pings with "API latency p99 just doubled". The application is the same; the database looks fine. Someone glances at Redis: INFO memory shows used_memory_human:11.94G / maxmemory_human:12.00G and evicted_keys is climbing by 50,000 per minute. Half the cache is being evicted as fast as it's being written, the hit ratio has collapsed from 92% to 38%, and every cache miss is now a slow query against the primary database. The Redis box is the proximate cause; an unbounded growth in one specific cache key family is the actual cause, and nobody's been watching it.

Redis is the component teams reach for when something needs to be fast, which means everyone has it in production and almost nobody monitors it well. The default reaction to a Redis problem is to look at OS-level metrics (CPU, memory, network) — but the interesting failure modes (eviction storms, latency spikes from KEYS-class commands, persistence stalls, replication backlog, big keys, hot keys, cluster MOVED rates) live inside Redis itself and are exposed through INFO, LATENCY, SLOWLOG, and a handful of related commands.

This guide is the production-monitoring layer for Redis: which INFO sections matter, how memory and eviction actually work, how to capture latency spikes with attribution, what to do about big and hot keys, and how the monitoring story shifts in Cluster, Sentinel, and managed-service deployments. It complements the external-uptime layer in Database Monitoring: MySQL, PostgreSQL & Redis Uptime.


The INFO Sections Worth Polling

INFO returns a categorised dump of Redis state. Poll a subset every 30-60 seconds; store deltas in a time-series store.

Section What's there
Memory used_memory, maxmemory, fragmentation ratio, peak
Stats total_connections_received, total_commands_processed, keyspace_hits, keyspace_misses, evicted_keys, expired_keys, rejected_connections
Clients connected_clients, blocked_clients, tracking_clients
Replication role, connected_slaves/replicas, master_repl_offset, per-replica replica_repl_offset
Persistence rdb_last_save_time, rdb_last_bgsave_status, aof_rewrite_in_progress, aof_last_rewrite_time_sec
CPU used_cpu_sys, used_cpu_user
Commandstats per-command call count, total time, p50/p99 (Redis 7+)

Run INFO MEMORY, INFO STATS, INFO REPLICATION, INFO CLIENTS, INFO PERSISTENCE, INFO COMMANDSTATS — never a bare INFO from a monitoring agent (avoids needlessly serialising the kitchen sink).


Memory and Eviction — The Single Most Important Topic

Almost every Redis-shaped incident traces back to memory. Two things to internalise:

maxmemory is a soft limit on data only

maxmemory caps the dataset size. Redis still uses memory beyond it for the COB (client output buffer), replication backlog, AOF buffer, and forked children during RDB / AOF rewrite. A Redis with maxmemory = 12GB on a 16GB box can OOM during BGSAVE because the forked child's copy-on-write footprint pushes total RSS over physical RAM. Always leave 30-50% headroom over maxmemory.

maxmemory-policy decides what eviction does

The policy options:

Policy Behavior
noeviction Write commands return errors when at maxmemory. Reads still work. Caches usually don't want this; data stores usually do
allkeys-lru Evict the approximate-least-recently-used key from anywhere. Sensible default for caches
allkeys-lfu Evict approximate-least-frequently-used. Better when hot vs cold differs by frequency, not recency
volatile-lru / volatile-lfu Same, but only on keys with TTL set
allkeys-random / volatile-random Random eviction — rarely the right choice
volatile-ttl Evict keys with the soonest expiration

The single biggest configuration mistake we see: caches deployed with maxmemory-policy = noeviction. Then maxmemory fills up, every SET starts erroring, the application starts failing writes to the cache, and somehow nobody is sure why.

Eviction metrics

The headline counters:

  • evicted_keys — total since boot. Track per-second delta. A non-zero sustained delta on a cache is expected (cache is full); a sustained delta on a primary data store is a misconfiguration alarm.
  • keyspace_hits / keyspace_misses — hit ratio is hits / (hits + misses). Cache hit ratio < 80% is rarely worth the cache.
  • expired_keys — keys removed via TTL expiration, not eviction. Distinct from eviction.

Fragmentation ratio

mem_fragmentation_ratio = used_memory_rss / used_memory. The interesting bands:

  • 1.0 – 1.5 — normal
  • 1.5 – 2.0 — high; consider activedefrag yes (Redis 4+)
  • 2.0 — the allocator is holding lots of unused memory; eventual restart needed

  • < 1.0 — Redis has swapped to disk; very bad — performance is now random-access disk speeds

Alert on mem_fragmentation_ratio < 1.0 immediately. It's not subtle: swap-backed Redis is faster to take out of rotation than to leave running.


Latency Monitoring — LATENCY and SLOWLOG

Redis is single-threaded for command execution. A single slow command stalls everything. The two surfaces for catching this:

LATENCY HISTORY and LATENCY LATEST

Enable the latency monitor:

CONFIG SET latency-monitor-threshold 100

Redis now records any event that takes longer than 100ms into a ring buffer per event-type:

LATENCY LATEST
LATENCY HISTORY event-name
LATENCY DOCTOR

LATENCY DOCTOR returns a human-readable analysis with the top events, ranges, and likely causes. Run it in incident response. For monitoring, poll LATENCY LATEST every 30 seconds and ship the structured output to your time-series store.

SLOWLOG

Redis tracks the N most-recent commands that exceeded a threshold:

CONFIG SET slowlog-log-slower-than 10000   # 10 ms in microseconds
CONFIG SET slowlog-max-len 1024
SLOWLOG GET 50

This is your "what query was actually slow" surface. The classic offenders:

  • KEYS pattern on a large dataset — O(N) scan, blocks the event loop. Forbid KEYS in production; use SCAN instead.
  • SMEMBERS, LRANGE 0 -1, HGETALL on large collections — O(N) where N can be huge
  • DEBUG SLEEP — someone left a debug command running
  • Large MGET / MSET batches — usually fine but worth knowing if they're in the top
  • Lua scripts (EVAL / EVALSHA) that loop over many keys

Wire SLOWLOG GET 100 into a monitoring panel that runs every minute and aggregates by normalised command. Anything new appearing at the top of the list is worth investigating.

Per-command stats (Redis 7+)

INFO COMMANDSTATS

Returns per-command call count, total time, p50, p99 (and rejected_calls/failed_calls in 7+). This is the closest Redis equivalent to pg_stat_statements / events_statements_summary_by_digest. Sort by total time to find capacity bottlenecks.


Big Keys and Hot Keys

Two distinct failure modes:

Big keys

A single key (typically a List, Set, Hash, or Sorted Set) that has grown to hundreds of MB. Symptoms: any command touching it takes seconds, eviction policy can't shed it efficiently, replication of a write that touches it stalls.

Find them:

redis-cli --bigkeys

Or for surgical inspection:

redis-cli MEMORY USAGE key

Or run SCAN periodically across the keyspace and check MEMORY USAGE on each. The signal: any key > 10MB on a typical workload is suspicious; > 100MB needs an immediate plan.

The fix: shard the key (user:123:notificationsuser:123:notifications:202602), cap the size with LTRIM / ZREMRANGEBYRANK, or move it out of Redis to a more appropriate store.

Hot keys

A single key receiving an outsized share of traffic. Even small hot keys can become a bottleneck because Redis is single-threaded. Symptoms: high CPU on Redis, fine memory, decent hit ratio, but latency p99 climbing.

Find them:

redis-cli --hotkeys

(Requires maxmemory-policy of allkeys-lfu or volatile-lfu to collect frequency data.)

The fix: client-side caching for that key with TTL, key-level replication, or sharding the access pattern.


Replication

INFO REPLICATION on the source shows role + connected_slaves + each replica's offset and lag:

role:master
connected_slaves:2
slave0:ip=10.0.1.42,port=6379,state=online,offset=894012348,lag=0
slave1:ip=10.0.1.43,port=6379,state=online,offset=894008923,lag=1
master_repl_offset:894012348

The headline metrics:

  • state should be online per replica
  • lag is in seconds based on last ack; > 5s is worth a look
  • offset gap between master_repl_offset and per-replica offset is the byte-distance

On the replica side:

role:slave
master_link_status:up
master_last_io_seconds_ago:1
master_sync_in_progress:0
slave_repl_offset:894012348

Watch:

  • master_link_status:down — replication broken
  • master_sync_in_progress:1 for sustained periods — full resync ongoing, replica unusable for serving reads
  • repl_backlog_size exhausted (replication backlog buffer too small for the disconnect duration) forces a full resync — sized via repl-backlog-size

Cluster mode replication

In Cluster mode each shard is a primary + N replicas. CLUSTER NODES and CLUSTER INFO show the topology. Monitor:

  • cluster_state:ok — anything else is broken
  • cluster_slots_assigned:16384 — all slots covered
  • cluster_slots_pfail / cluster_slots_fail — non-zero = node-failure detection in progress

Persistence

Redis offers two persistence modes, often combined:

RDB snapshots

BGSAVE forks a child that writes a point-in-time snapshot. Monitor:

  • rdb_last_save_time — Unix timestamp of last successful save
  • rdb_last_bgsave_status — must be ok
  • rdb_last_bgsave_time_sec — duration of last save; a sudden climb means the dataset grew or disk slowed

The "fork latency spike": when Redis forks a child for BGSAVE on a memory-pressured host, the OS pauses the parent for the duration of the copy-on-write page-table copy. On a 12GB dataset this can be 100-200ms. Capture in LATENCY HISTORY fork.

AOF

appendonly yes writes every command to an append-only log. Monitor:

  • aof_enabled — should match your config
  • aof_rewrite_in_progress — 1 during rewrite, ok; sustained at 1 means rewrite stalled
  • aof_last_rewrite_time_sec — rewrite duration
  • aof_pending_bio_fsync — pending fsyncs on the background IO thread; > 0 sustained = disk struggling

Rewrite stalls (aof_last_rewrite_time_sec growing across runs) are usually disk-bound. They block AOF growth without blocking writes, but write traffic eventually outpaces the rewriter and disk fills.


Connections and Clients

INFO CLIENTS
  • connected_clients — total client connections
  • blocked_clients — clients in BLPOP, WAIT, etc; this is expected for queues and pub/sub workers
  • maxclients — hard cap (default 10,000)
  • rejected_connections from INFO STATS — incremented when maxclients is hit

Alert when connected_clients > 80% of maxclients and when rejected_connections grows.

Client output buffer

A subscriber that can't keep up with a fast publisher accumulates output in its client output buffer. The buffer has a limit; when reached, the client is disconnected (client-output-buffer-limit). Symptoms: pub/sub subscribers silently disconnecting. Monitor CLIENT LIST periodically for clients with high omem.


Cluster Mode Specifics

In Cluster mode, monitoring adds:

  • MOVED / ASK error rates — a client routed a request to the wrong node. Some MOVED traffic is normal during failover; sustained MOVED rates mean clients haven't refreshed their slot map. Most modern clients handle this; legacy clients don't.
  • CLUSTER FAILOVER events — track them as discrete events with timestamps
  • Per-slot key countsCLUSTER COUNTKEYSINSLOT per slot; sustained imbalance suggests a key-distribution problem
  • Cross-slot operations — disallowed by Cluster; client errors show in application logs. Worth tracking from the application side

Sentinel Mode

Sentinel coordinates failover for non-Cluster setups. Monitor:

  • Sentinel count alive (quorum requirement)
  • Last failover timestamp
  • Master/replica known to Sentinel matches actual topology
  • +sdown / +odown / +failover-triggered events from the Sentinel pub/sub channel

Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud)

AWS ElastiCache

CloudWatch covers most counters: CPUUtilization, EngineCPUUtilization (single-threaded measure — the one that matters), DatabaseMemoryUsagePercentage, Evictions, ReplicationLag, CurrConnections. Cluster mode includes per-shard metrics. Performance Insights for Redis surfaces top commands.

EngineCPUUtilization > 80% is the canonical "Redis is the bottleneck" signal — it measures the single-threaded event loop, which CPUUtilization doesn't isolate.

Google MemoryStore

Cloud Monitoring covers similar ground. The metrics names differ but the structure is identical.

Upstash

Upstash's HTTP/REST API model changes the connection-monitoring story (no persistent connections) but the keyspace metrics, eviction, and slowlog still apply. The Upstash dashboard exposes daily request count and bandwidth.

Dragonfly

Dragonfly is a multi-threaded Redis-compatible drop-in (2023+). Many of the single-threaded constraints (KEYS blocking the planet) don't apply, but the same monitoring surfaces (INFO, SLOWLOG, LATENCY) work because of API compatibility. Some Dragonfly-specific metrics (shard_count, per-shard CPU) need their own panels.

Redis Cloud / Redis Enterprise

Redis Inc.'s managed service exposes the same surfaces plus enterprise-specific (proxy CPU, shard placement) metrics.


What to Alert On

Critical (page)

  • used_memory > 95% of maxmemory and policy is noeviction
  • mem_fragmentation_ratio < 1.0 (swap-backed Redis)
  • evicted_keys rate > 10× baseline (eviction storm)
  • master_link_status:down on any replica
  • cluster_state != ok
  • rdb_last_bgsave_status != ok
  • rejected_connections growth (connection saturation)
  • EngineCPUUtilization > 90% (ElastiCache) or equivalent single-thread CPU

High (notification)

  • Cache hit ratio < 80% on a workload where it was historically > 90%
  • p99 latency > 10ms sustained (Redis should serve from RAM in single-digit ms)
  • Big key detected via --bigkeys or MEMORY USAGE > 100MB
  • aof_last_rewrite_time_sec growing across runs (AOF rewriter falling behind)
  • repl_backlog_active = 0 (replicas relying on full resync after disconnect)
  • New slow command appears in SLOWLOG
  • blocked_clients > 10× baseline (queue backlog forming)

Informational

  • mem_fragmentation_ratio > 1.5 (defrag opportunity)
  • Per-shard imbalance in Cluster mode
  • A new command appearing in top of INFO COMMANDSTATS
  • Sentinel failover event logged

See Alert Fatigue: Notifications That Get Acted On for the broader low-noise alerting principles.


Redis Monitoring Checklist

  • Polling agent runs INFO MEMORY, INFO STATS, INFO REPLICATION, INFO CLIENTS, INFO PERSISTENCE, INFO COMMANDSTATS every 30-60s
  • CONFIG SET latency-monitor-threshold 100 enabled; LATENCY HISTORY polled
  • slowlog-log-slower-than 10000 configured; SLOWLOG GET shipped to logs
  • Hit ratio (keyspace_hits / (hits + misses)) tracked
  • Eviction rate (evicted_keys delta) tracked
  • Fragmentation ratio tracked; alert on < 1.0
  • maxmemory and maxmemory-policy reviewed for every Redis instance (no noeviction on caches by accident)
  • OS-level free memory > 30-50% over maxmemory (room for fork + AOF buffers)
  • Big-key job: weekly --bigkeys or periodic MEMORY USAGE scan
  • Hot-key job: --hotkeys (with LFU policy enabled) on the workload
  • Replication lag (offset gap + lag seconds) tracked per replica
  • repl-backlog-size sized to cover expected disconnect duration
  • Persistence: rdb_last_bgsave_status, AOF rewrite duration, fork-latency events
  • blocked_clients and rejected_connections tracked
  • Cluster: per-shard cluster_state, slot assignment, MOVED rates
  • Sentinel: quorum count, failover events
  • Managed-service: vendor-specific single-thread CPU metric (EngineCPUUtilization on ElastiCache, equivalents elsewhere)
  • KEYS disallowed in production (clients use SCAN)
  • External uptime monitoring complements internal — see Database Monitoring foundation
  • If Redis backs queues (BullMQ, Sidekiq) — also see Job Queue Monitoring
  • If Redis backs rate limiting — also see API Rate Limit Monitoring

How Webalert Helps With Redis Monitoring

Webalert provides the external-monitoring layer that complements your in-Redis telemetry:

  • HTTP monitoring — Watch the API endpoints backed by Redis (caches, sessions, rate limiters); when cache hit ratio collapses, you see it as edge latency
  • Content validation — Hit an internal /internal/redis-health endpoint that surfaces INFO MEMORY, hit ratio, eviction rate, and replication state; alert when thresholds are crossed
  • CDN + Redis cache layering — see CDN Monitoring: Edge Cache & Origin Uptime
  • Multi-region checks — Redis caches replicated to multiple regions; multi-region monitoring confirms freshness from the user's vantage point
  • Status page — Communicate "we're seeing elevated cache latency" without dumping internal metrics
  • Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
  • 1-minute check intervals — Detect outages within 60 seconds
  • 5-minute setup — Add endpoints, set thresholds, done

See features and pricing for details.


Summary

  • Redis is single-threaded for command execution; almost every incident relates to memory, latency from a slow command, or replication / persistence.
  • Poll INFO selectively: MEMORY, STATS, REPLICATION, CLIENTS, PERSISTENCE, COMMANDSTATS.
  • maxmemory is a soft limit on data; leave 30-50% OS headroom for fork, AOF buffers, replication backlog.
  • maxmemory-policy is the most commonly misconfigured setting — noeviction on a cache is a guaranteed incident.
  • Eviction rate, hit ratio, fragmentation ratio are the headline memory metrics. Fragmentation < 1.0 means Redis has swapped — page immediately.
  • Enable LATENCY monitoring and SLOWLOG. They are the closest equivalents Redis has to query monitoring.
  • INFO COMMANDSTATS (Redis 7+) is the per-command workload picture — sort by total time.
  • Big keys block the event loop; find with --bigkeys and MEMORY USAGE; shard or cap them.
  • Hot keys saturate the single CPU thread; find with --hotkeys (requires LFU policy).
  • Replication: per-replica offset, lag, link status; size repl-backlog-size correctly to avoid full resyncs.
  • Persistence: BGSAVE status, fork-latency spikes, AOF rewrite duration.
  • Cluster: cluster_state, slot coverage, MOVED rates.
  • Managed Redis (ElastiCache, MemoryStore, Upstash, Dragonfly, Redis Cloud) layers vendor tooling — EngineCPUUtilization is the single-thread signal that matters most.

Redis problems are almost always memory problems, slow-command problems, or persistence problems — all three are visible in counters before they become user-visible incidents. Build the monitoring foundation once — selective INFO polling, LATENCY / SLOWLOG, hit ratio, eviction rate, big-key job, replication panels — and the next cache eviction storm shows up on a graph 10 minutes before it shows up on the pager.


Catch Redis regressions before cache misses cascade into database load

Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring