Skip to content

Job Queue Monitoring: Sidekiq, BullMQ, and SQS

Webalert Team
May 6, 2026
12 min read

Job Queue Monitoring: Sidekiq, BullMQ, and SQS

Your web server is responding. Your database is healthy. Your status page is green.

But the welcome emails stopped going out three hours ago. Subscription renewals are stuck in "pending." The nightly data export hasn't run since Tuesday. User avatar uploads process successfully on the frontend but the thumbnails never get generated.

All of these failures share the same root cause: background jobs that stopped running — silently, without a 500 response or an alert. The web layer looks fine because the web layer is fine. The queue is the problem.

Background job queues are the dark matter of web infrastructure. They power a huge share of what your product actually does, yet most monitoring setups ignore them entirely. This guide covers how to monitor job queues across the major implementations — Sidekiq, BullMQ, RabbitMQ, and Amazon SQS — so the next consumer crash or dead letter pile-up doesn't become a support ticket on Monday morning.


Why Queue Monitoring Is Different

A web request fails visibly: the user sees an error, your uptime monitor fires, your error tracker lights up. A failed background job is invisible:

  • No user interaction — The job runs without anyone waiting for the response
  • Deferred failure — The user submitted the form successfully; they just never get the email
  • No HTTP response — There's no status code to check, no latency to measure
  • Consumer crashes hide — A crashed worker means jobs pile up in the queue rather than failing loudly
  • Retry storms — A job that keeps failing retries N times over hours before hitting the dead letter queue
  • Timing dependencies — Scheduled jobs that miss their window fail completely without any signal

Queue monitoring requires a completely different set of signals than web monitoring.


What to Monitor (Every Queue System)

These metrics apply regardless of which queue implementation you use.

1) Queue Depth (Backlog)

The number of jobs waiting to be processed. A healthy queue stays near zero; a growing queue means consumers are falling behind or dead.

  • Alert on sustained growth — A queue that grows for 10 minutes is a problem even if no consumer has crashed
  • Alert on sudden spikes — A sharp jump usually means a publisher bug creating runaway jobs
  • Per-queue baselines — A queue that normally sits at 50 jobs and spikes to 5,000 is a crisis; one that normally sits at 5,000 isn't

2) Dead Letter Queue (DLQ) Depth

The dead letter queue holds jobs that exhausted all retries. Every job in the DLQ is a failed user action that needs investigation:

  • Alert on any growth — A DLQ that's accumulating jobs is the single most actionable queue signal
  • Alert on spikes — A sudden flood of DLQ entries usually means a bad deploy or a dependency failure
  • Alert per job class — Different job types failing at once vs. a single type identifies whether it's infrastructure or code

3) Consumer Count and Health

If consumers go to zero, the queue fills indefinitely. Monitor:

  • Active consumer count — Alert when it drops below the minimum needed for your throughput
  • Consumer idle time — Consumers stuck "processing" the same job for too long are hung, not healthy
  • Consumer crash rate — Frequent restarts are a leading indicator of a bad job class crashing workers

4) Job Latency (Age of Oldest Job)

How long jobs wait before being picked up. A healthy queue processes jobs within seconds to minutes.

  • Alert on increasing latency — Even without queue depth growing, increasing wait time signals degraded throughput
  • Alert per priority class — A high-priority queue backing up while a low-priority one is fine means consumer misconfiguration

5) Job Processing Time (Duration)

How long individual jobs take to run.

  • Alert on p95/p99 regressions — A job that normally takes 1 second suddenly taking 30 suggests a slow dependency
  • Alert on hung jobs — Jobs executing for longer than their expected maximum are likely deadlocked or waiting on a failed dependency
  • Track per job class — Aggregate duration hides which specific job type is slow

6) Error Rate and Retry Rate

  • Error rate per job class — A spike in errors for one class is a code bug; errors across all classes is infrastructure
  • Retry exhaustion rate — The proportion of jobs hitting max retries; rising rates predict DLQ accumulation
  • Poison pill detection — A single malformed job that causes every consumer to crash is a poison pill

7) Heartbeat Monitoring

The most reliable queue health check is a scheduled job that pings a heartbeat URL:

  1. Have your scheduler enqueue a "heartbeat" job every N minutes
  2. That job executes and calls a heartbeat endpoint (your monitoring tool records the call)
  3. If the heartbeat endpoint stops receiving pings, the queue or consumers have failed

This catches the failure mode that everything else misses: a queue that isn't actually processing anything, because there are no jobs currently in flight to measure. See Cron Job Monitoring: Background Tasks for the implementation pattern.


Monitoring by Queue System

Sidekiq (Ruby)

Sidekiq stores queues in Redis and exposes extensive metrics through its Web UI and the Sidekiq::Stats API.

Key metrics to track:

  • Sidekiq::Stats.new.enqueued — total jobs waiting
  • Sidekiq::Stats.new.dead — jobs in the dead set
  • Sidekiq::Stats.new.retry_size — jobs awaiting retry
  • Sidekiq::Queue.all.each { |q| q.size } — per-queue depths
  • Sidekiq::Workers.new.size — number of currently-processing jobs
  • Sidekiq.redis { |r| r.info } — Redis memory and connection health

Common pitfalls:

  • Redis memory exhaustion is the most common Sidekiq outage; monitor Redis memory separately
  • sidekiq-cron or sidekiq-scheduler jobs failing silently — add heartbeat jobs for each critical schedule
  • Queue priority misconfiguration: low-priority queues starving critical ones

Monitoring integration: Expose a custom /sidekiq_health endpoint that returns 200 with queue depths and consumer counts in JSON. Monitor that endpoint with content validation.

BullMQ (Node.js)

BullMQ uses Redis and provides rich job lifecycle events. The Bull Board or Arena web UIs visualize the queues.

Key metrics:

  • queue.getWaitingCount() — pending jobs
  • queue.getFailedCount() — failed jobs (DLQ equivalent)
  • queue.getDelayedCount() — scheduled-for-future jobs
  • queue.getActiveCount() — currently processing
  • queue.getCompletedCount() (optional, auto-cleaned)
  • Worker event: worker.on('error', ...) — expose worker errors to your APM

Common pitfalls:

  • autorun: false on workers after deployment restarts causes silent consumer death
  • Concurrency configured too high floods downstream dependencies
  • Long-running jobs blocking the event loop if CPU-bound work runs in-process

Monitoring integration: Add a /queue-health endpoint in your Express app that calls getWaitingCount() + getFailedCount() for critical queues and returns JSON. Monitor with content validation checking failed stays at zero.

RabbitMQ

RabbitMQ is a full AMQP broker with its own management API (http://host:15672/api).

Key metrics:

  • messages_ready per queue — waiting to be consumed
  • messages_unacknowledged per queue — delivered but not yet acked
  • consumers per queue — active consumer count
  • memory — broker memory usage; alarms trigger when threshold is crossed
  • disk_free — disk alarms can pause publishing
  • publish_rate and deliver_rate — production/consumption throughput

Common pitfalls:

  • Memory alarm triggers flow control, which pauses all publishers — a RabbitMQ memory issue looks like every service is broken
  • Prefetch (basic.qos) misconfiguration causes individual consumers to hoard messages
  • Unacknowledged message buildup indicates consumers are taking messages but not completing them

Monitoring integration: Poll the management API (/api/queues/vhost/queue-name) for messages_ready and messages_unacknowledged. Alert when either exceeds threshold. Also monitor the management API's own health endpoint (/api/health/checks/aliveness-test/vhost).

Amazon SQS

SQS is a managed queue service with CloudWatch metrics built in.

Key metrics:

  • ApproximateNumberOfMessagesVisible — jobs waiting in the queue
  • ApproximateNumberOfMessagesNotVisible — messages being processed (in-flight)
  • ApproximateAgeOfOldestMessage — age of the oldest job; rises when consumers are slow or dead
  • NumberOfMessagesSent / NumberOfMessagesDeleted — throughput tracking
  • Dead letter queue: ApproximateNumberOfMessagesVisible on the DLQ

Common pitfalls:

  • Visibility timeout shorter than job execution time causes the same message to be processed multiple times
  • Lambda concurrency limits causing processing to stop while messages pile up
  • SQS FIFO queues becoming stuck due to a single bad message blocking the message group

Monitoring integration: Set CloudWatch alarms on ApproximateAgeOfOldestMessage (your most sensitive signal) and ApproximateNumberOfMessagesVisible on the DLQ. Forward these to your alerting channels.


Common Queue Failure Modes

Failure User Impact How to Detect
All consumers crashed No jobs processed; queue grows Consumer count alert + heartbeat
Redis down (Sidekiq/BullMQ) Queue inaccessible; jobs cannot be enqueued or dequeued Redis uptime check + queue health endpoint
DLQ accumulating Failed user actions silently piling up DLQ depth alert
Consumer stuck on poison pill One bad message blocks the worker Job duration alert + consumer idle time
Retry storm from bad deploy Rapid DLQ accumulation, CPU spike Error rate + retry rate per job class
Scheduled job missing Nightly exports, report generation not running Heartbeat monitoring
Visibility timeout exceeded (SQS) Duplicate job processing ApproximateNumberOfMessagesNotVisible + duplicate detection
RabbitMQ memory alarm All publishing paused Broker memory + alarm status
Consumer misconfiguration after deploy Jobs queued but not processed Consumer count + queue depth combo alert
Long-running job blocking worker Throughput degraded, other jobs wait Job duration p99 alert

Setting Up Queue Monitoring

Quick start (15 minutes)

  1. Heartbeat job — Enqueue a no-op job every 5 minutes that pings a heartbeat monitor URL
  2. Health endpoint — Expose queue depths and consumer count as JSON at a /queue-health URL
  3. Content validation check on /queue-health — Alert if dead or failed is non-zero, or consumers drops below 1
  4. DLQ alert — Monitor your DLQ depth; alert the moment it starts growing

Comprehensive setup (1 hour)

Add to the quick start:

  1. Per-queue depth alerts with dynamic baselines (alert on growth, not just absolute size)
  2. Consumer count alerts per queue with minimum thresholds
  3. Job duration p95/p99 from your APM, with regression alerts
  4. Retry rate tracking — alert when retry rate for any job class exceeds X% per hour
  5. Redis/broker health — Separate uptime and memory checks on the underlying store
  6. Per-scheduled-job heartbeats — Critical cron-triggered jobs each get their own heartbeat

What to Do When Queue Monitoring Fires

Queue depth growing / consumers at zero:

  1. Check whether the consumer process is running (Sidekiq, BullMQ worker, Lambda, ECS task)
  2. Look at recent deploys — misconfigured startup, wrong environment variable
  3. Check the underlying broker (Redis, RabbitMQ) for memory or connection issues
  4. Restart consumers and watch queue drain

DLQ accumulating:

  1. Pull a sample of DLQ jobs and inspect their payloads and error messages
  2. Identify whether it's one job class or many — single class = code bug; all classes = infrastructure
  3. Fix the root cause before re-queuing DLQ jobs (re-queuing before fixing causes the same accumulation)
  4. Use a DLQ replay strategy: small batches, monitor for recurrence

Heartbeat missed:

  1. Confirm whether the scheduler is still enqueuing the heartbeat job
  2. Check whether the heartbeat job is being picked up (queue depth for the heartbeat queue)
  3. Check broker connectivity from the scheduler
  4. Check consumer logs for the heartbeat job class

Poison pill blocking a consumer:

  1. Identify the stuck job ID from consumer metrics
  2. Move it to the DLQ or delete it
  3. Restart the affected consumer
  4. Add validation for the payload shape that caused the poison pill

How Webalert Helps

Webalert provides the external monitoring layer for your background job infrastructure:

  • Heartbeat monitoring — Pair with a job that pings a heartbeat URL; alert the moment jobs stop processing
  • HTTP checks with content validation — Monitor /queue-health endpoints for consumer count and DLQ depth
  • Cron monitoring — Confirm scheduled jobs fire on time with configurable expected intervals
  • Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
  • Status pages — Communicate background processing incidents to users
  • 5-minute setup — Start with a heartbeat and health endpoint check today

See features and pricing for details.


Summary

  • Background jobs fail silently. The web layer looks healthy while the queue is broken.
  • Monitor queue depth, dead letter queue depth, consumer count, job latency, and job duration.
  • The most reliable check is a heartbeat job: enqueue → execute → ping a URL. If the ping stops, the queue or consumers have failed.
  • Different queue systems expose different signals: Sidekiq via Redis/API, BullMQ via queue methods, RabbitMQ via management API, SQS via CloudWatch.
  • DLQ accumulation is the single most actionable signal — it means real user actions have permanently failed.
  • Alert on DLQ growth, consumer count drops, and missed heartbeats as your three primary queue health metrics.

The web layer shows you symptoms. Queue monitoring shows you causes.


Catch silent queue failures before users notice

Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring