Job Queue Monitoring: Sidekiq, BullMQ, and SQS

Your web server is responding. Your database is healthy. Your status page is green.

But the welcome emails stopped going out three hours ago. Subscription renewals are stuck in "pending." The nightly data export hasn't run since Tuesday. User avatar uploads process successfully on the frontend but the thumbnails never get generated.

All of these failures share the same root cause: background jobs that stopped running — silently, without a 500 response or an alert. The web layer looks fine because the web layer is fine. The queue is the problem.

Background job queues are the dark matter of web infrastructure. They power a huge share of what your product actually does, yet most monitoring setups ignore them entirely. This guide covers how to monitor job queues across the major implementations — Sidekiq, BullMQ, RabbitMQ, and Amazon SQS — so the next consumer crash or dead letter pile-up doesn't become a support ticket on Monday morning.

Why Queue Monitoring Is Different

A web request fails visibly: the user sees an error, your uptime monitor fires, your error tracker lights up. A failed background job is invisible:

No user interaction — The job runs without anyone waiting for the response
Deferred failure — The user submitted the form successfully; they just never get the email
No HTTP response — There's no status code to check, no latency to measure
Consumer crashes hide — A crashed worker means jobs pile up in the queue rather than failing loudly
Retry storms — A job that keeps failing retries N times over hours before hitting the dead letter queue
Timing dependencies — Scheduled jobs that miss their window fail completely without any signal

Queue monitoring requires a completely different set of signals than web monitoring.

What to Monitor (Every Queue System)

These metrics apply regardless of which queue implementation you use.

1) Queue Depth (Backlog)

The number of jobs waiting to be processed. A healthy queue stays near zero; a growing queue means consumers are falling behind or dead.

Alert on sustained growth — A queue that grows for 10 minutes is a problem even if no consumer has crashed
Alert on sudden spikes — A sharp jump usually means a publisher bug creating runaway jobs
Per-queue baselines — A queue that normally sits at 50 jobs and spikes to 5,000 is a crisis; one that normally sits at 5,000 isn't

2) Dead Letter Queue (DLQ) Depth

The dead letter queue holds jobs that exhausted all retries. Every job in the DLQ is a failed user action that needs investigation:

Alert on any growth — A DLQ that's accumulating jobs is the single most actionable queue signal
Alert on spikes — A sudden flood of DLQ entries usually means a bad deploy or a dependency failure
Alert per job class — Different job types failing at once vs. a single type identifies whether it's infrastructure or code

3) Consumer Count and Health

If consumers go to zero, the queue fills indefinitely. Monitor:

Active consumer count — Alert when it drops below the minimum needed for your throughput
Consumer idle time — Consumers stuck "processing" the same job for too long are hung, not healthy
Consumer crash rate — Frequent restarts are a leading indicator of a bad job class crashing workers

4) Job Latency (Age of Oldest Job)

How long jobs wait before being picked up. A healthy queue processes jobs within seconds to minutes.

Alert on increasing latency — Even without queue depth growing, increasing wait time signals degraded throughput
Alert per priority class — A high-priority queue backing up while a low-priority one is fine means consumer misconfiguration

5) Job Processing Time (Duration)

How long individual jobs take to run.

Alert on p95/p99 regressions — A job that normally takes 1 second suddenly taking 30 suggests a slow dependency
Alert on hung jobs — Jobs executing for longer than their expected maximum are likely deadlocked or waiting on a failed dependency
Track per job class — Aggregate duration hides which specific job type is slow

6) Error Rate and Retry Rate

Error rate per job class — A spike in errors for one class is a code bug; errors across all classes is infrastructure
Retry exhaustion rate — The proportion of jobs hitting max retries; rising rates predict DLQ accumulation
Poison pill detection — A single malformed job that causes every consumer to crash is a poison pill

7) Heartbeat Monitoring

The most reliable queue health check is a scheduled job that pings a heartbeat URL:

Have your scheduler enqueue a "heartbeat" job every N minutes
That job executes and calls a heartbeat endpoint (your monitoring tool records the call)
If the heartbeat endpoint stops receiving pings, the queue or consumers have failed

This catches the failure mode that everything else misses: a queue that isn't actually processing anything, because there are no jobs currently in flight to measure. See Cron Job Monitoring: Background Tasks for the implementation pattern.

Monitoring by Queue System

Sidekiq (Ruby)

Sidekiq stores queues in Redis and exposes extensive metrics through its Web UI and the Sidekiq::Stats API.

Key metrics to track:

Sidekiq::Stats.new.enqueued — total jobs waiting
Sidekiq::Stats.new.dead — jobs in the dead set
Sidekiq::Stats.new.retry_size — jobs awaiting retry
Sidekiq::Queue.all.each { |q| q.size } — per-queue depths
Sidekiq::Workers.new.size — number of currently-processing jobs
Sidekiq.redis { |r| r.info } — Redis memory and connection health

Common pitfalls:

Redis memory exhaustion is the most common Sidekiq outage; monitor Redis memory separately
sidekiq-cron or sidekiq-scheduler jobs failing silently — add heartbeat jobs for each critical schedule
Queue priority misconfiguration: low-priority queues starving critical ones

Monitoring integration: Expose a custom /sidekiq_health endpoint that returns 200 with queue depths and consumer counts in JSON. Monitor that endpoint with content validation.

BullMQ (Node.js)

BullMQ uses Redis and provides rich job lifecycle events. The Bull Board or Arena web UIs visualize the queues.

Key metrics:

queue.getWaitingCount() — pending jobs
queue.getFailedCount() — failed jobs (DLQ equivalent)
queue.getDelayedCount() — scheduled-for-future jobs
queue.getActiveCount() — currently processing
queue.getCompletedCount() (optional, auto-cleaned)
Worker event: worker.on('error', ...) — expose worker errors to your APM

Common pitfalls:

autorun: false on workers after deployment restarts causes silent consumer death
Concurrency configured too high floods downstream dependencies
Long-running jobs blocking the event loop if CPU-bound work runs in-process

Monitoring integration: Add a /queue-health endpoint in your Express app that calls getWaitingCount() + getFailedCount() for critical queues and returns JSON. Monitor with content validation checking failed stays at zero.

RabbitMQ

RabbitMQ is a full AMQP broker with its own management API (http://host:15672/api).

Key metrics:

messages_ready per queue — waiting to be consumed
messages_unacknowledged per queue — delivered but not yet acked
consumers per queue — active consumer count
memory — broker memory usage; alarms trigger when threshold is crossed
disk_free — disk alarms can pause publishing
publish_rate and deliver_rate — production/consumption throughput

Common pitfalls:

Memory alarm triggers flow control, which pauses all publishers — a RabbitMQ memory issue looks like every service is broken
Prefetch (basic.qos) misconfiguration causes individual consumers to hoard messages
Unacknowledged message buildup indicates consumers are taking messages but not completing them

Monitoring integration: Poll the management API (/api/queues/vhost/queue-name) for messages_ready and messages_unacknowledged. Alert when either exceeds threshold. Also monitor the management API's own health endpoint (/api/health/checks/aliveness-test/vhost).

Amazon SQS

SQS is a managed queue service with CloudWatch metrics built in.

Key metrics:

ApproximateNumberOfMessagesVisible — jobs waiting in the queue
ApproximateNumberOfMessagesNotVisible — messages being processed (in-flight)
ApproximateAgeOfOldestMessage — age of the oldest job; rises when consumers are slow or dead
NumberOfMessagesSent / NumberOfMessagesDeleted — throughput tracking
Dead letter queue: ApproximateNumberOfMessagesVisible on the DLQ

Common pitfalls:

Visibility timeout shorter than job execution time causes the same message to be processed multiple times
Lambda concurrency limits causing processing to stop while messages pile up
SQS FIFO queues becoming stuck due to a single bad message blocking the message group

Monitoring integration: Set CloudWatch alarms on ApproximateAgeOfOldestMessage (your most sensitive signal) and ApproximateNumberOfMessagesVisible on the DLQ. Forward these to your alerting channels.

Common Queue Failure Modes

Failure	User Impact	How to Detect
All consumers crashed	No jobs processed; queue grows	Consumer count alert + heartbeat
Redis down (Sidekiq/BullMQ)	Queue inaccessible; jobs cannot be enqueued or dequeued	Redis uptime check + queue health endpoint
DLQ accumulating	Failed user actions silently piling up	DLQ depth alert
Consumer stuck on poison pill	One bad message blocks the worker	Job duration alert + consumer idle time
Retry storm from bad deploy	Rapid DLQ accumulation, CPU spike	Error rate + retry rate per job class
Scheduled job missing	Nightly exports, report generation not running	Heartbeat monitoring
Visibility timeout exceeded (SQS)	Duplicate job processing	`ApproximateNumberOfMessagesNotVisible` + duplicate detection
RabbitMQ memory alarm	All publishing paused	Broker memory + alarm status
Consumer misconfiguration after deploy	Jobs queued but not processed	Consumer count + queue depth combo alert
Long-running job blocking worker	Throughput degraded, other jobs wait	Job duration p99 alert

Setting Up Queue Monitoring

Quick start (15 minutes)

Heartbeat job — Enqueue a no-op job every 5 minutes that pings a heartbeat monitor URL
Health endpoint — Expose queue depths and consumer count as JSON at a /queue-health URL
Content validation check on /queue-health — Alert if dead or failed is non-zero, or consumers drops below 1
DLQ alert — Monitor your DLQ depth; alert the moment it starts growing

Comprehensive setup (1 hour)

Add to the quick start:

Per-queue depth alerts with dynamic baselines (alert on growth, not just absolute size)
Consumer count alerts per queue with minimum thresholds
Job duration p95/p99 from your APM, with regression alerts
Retry rate tracking — alert when retry rate for any job class exceeds X% per hour
Redis/broker health — Separate uptime and memory checks on the underlying store
Per-scheduled-job heartbeats — Critical cron-triggered jobs each get their own heartbeat

What to Do When Queue Monitoring Fires

Queue depth growing / consumers at zero:

Check whether the consumer process is running (Sidekiq, BullMQ worker, Lambda, ECS task)
Look at recent deploys — misconfigured startup, wrong environment variable
Check the underlying broker (Redis, RabbitMQ) for memory or connection issues
Restart consumers and watch queue drain

DLQ accumulating:

Pull a sample of DLQ jobs and inspect their payloads and error messages
Identify whether it's one job class or many — single class = code bug; all classes = infrastructure
Fix the root cause before re-queuing DLQ jobs (re-queuing before fixing causes the same accumulation)
Use a DLQ replay strategy: small batches, monitor for recurrence

Heartbeat missed:

Confirm whether the scheduler is still enqueuing the heartbeat job
Check whether the heartbeat job is being picked up (queue depth for the heartbeat queue)
Check broker connectivity from the scheduler
Check consumer logs for the heartbeat job class

Poison pill blocking a consumer:

Identify the stuck job ID from consumer metrics
Move it to the DLQ or delete it
Restart the affected consumer
Add validation for the payload shape that caused the poison pill

How Webalert Helps

Webalert provides the external monitoring layer for your background job infrastructure:

Heartbeat monitoring — Pair with a job that pings a heartbeat URL; alert the moment jobs stop processing
HTTP checks with content validation — Monitor /queue-health endpoints for consumer count and DLQ depth
Cron monitoring — Confirm scheduled jobs fire on time with configurable expected intervals
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
Status pages — Communicate background processing incidents to users
5-minute setup — Start with a heartbeat and health endpoint check today

See features and pricing for details.

Summary

Background jobs fail silently. The web layer looks healthy while the queue is broken.
Monitor queue depth, dead letter queue depth, consumer count, job latency, and job duration.
The most reliable check is a heartbeat job: enqueue → execute → ping a URL. If the ping stops, the queue or consumers have failed.
Different queue systems expose different signals: Sidekiq via Redis/API, BullMQ via queue methods, RabbitMQ via management API, SQS via CloudWatch.
DLQ accumulation is the single most actionable signal — it means real user actions have permanently failed.
Alert on DLQ growth, consumer count drops, and missed heartbeats as your three primary queue health metrics.

The web layer shows you symptoms. Queue monitoring shows you causes.

Catch silent queue failures before users notice

Start monitoring with Webalert →

See features and pricing. No credit card required.

Job Queue Monitoring: Sidekiq, BullMQ, and SQS

Why Queue Monitoring Is Different

What to Monitor (Every Queue System)

1) Queue Depth (Backlog)

2) Dead Letter Queue (DLQ) Depth

3) Consumer Count and Health

4) Job Latency (Age of Oldest Job)

5) Job Processing Time (Duration)

6) Error Rate and Retry Rate

7) Heartbeat Monitoring

Monitoring by Queue System

Sidekiq (Ruby)

BullMQ (Node.js)

RabbitMQ

Amazon SQS

Common Queue Failure Modes

Setting Up Queue Monitoring

Quick start (15 minutes)

Comprehensive setup (1 hour)

What to Do When Queue Monitoring Fires

How Webalert Helps

Summary

Catch silent queue failures before users notice

Related Articles

MongoDB Monitoring: Uptime, Replicas, and Clusters

Peak Traffic Monitoring: Black Friday and Launch Days

PCI DSS Monitoring: What Compliance Auditors Expect

Ready to Monitor Your Website?