Skip to content

Ruby on Rails Monitoring: Production Uptime Guide

Webalert Team
May 15, 2026
16 min read

Ruby on Rails Monitoring: Production Uptime Guide

Rails apps fail in a particular flavor: not loud crashes, but slow creeping degradation. Sidekiq's queue grows from 100 to 100,000 over a weekend. An ActiveRecord N+1 you didn't notice in staging hits production and your p95 doubles. A migration that "took 30 seconds locally" runs for 40 minutes against a real table and locks reads. A long-running Action Cable connection count climbs until your Puma workers exhaust and new requests get queued.

None of these look like "the website is down." Your /health endpoint returns 200 the whole time. Your CDN cache hit rate is great. Your status page is green. And yet — checkout latency is unusable, background jobs are running 6 hours behind, and the dashboard chart shows users abandoning at 2× the normal rate.

Rails monitoring isn't about uptime in the binary sense; it's about catching these gradual degradations before they become user-visible incidents. This guide covers the per-tier monitoring stack — process, request, job, database, cache — with the Rails-specific signals that matter most: ActiveRecord query patterns, Sidekiq queue health, Puma worker saturation, Action Cable connections, and zero-downtime deploys.


What Makes Rails Monitoring Different

Rails has more concurrency-and-state moving parts than most frameworks. A typical production Rails app is simultaneously:

  • Web tier — Puma (most common), Unicorn (legacy), Falcon (fiber-based, newer); usually behind NGINX or a CDN
  • Background tier — Sidekiq (most common), Resque, GoodJob, SolidQueue (Rails 8 default); pulling jobs from Redis or PostgreSQL
  • Scheduled tier — Sidekiq-Cron, sidekiq-scheduler, whenever, or Rails' own recurring jobs in 8+
  • Realtime tier — Action Cable on WebSockets; subscriptions, broadcasts, presence
  • Cache tier — Redis (most common), Memcached, Solid Cache
  • Database tier — usually PostgreSQL, sometimes MySQL; talking to it via the connection pool

Failures in any one of these tiers can take down user-facing functionality without showing up in the others. Sidekiq saturation doesn't drop your healthcheck. A bad ActiveRecord query doesn't drop your Sidekiq worker count. A Puma worker exhaustion event doesn't move the Sidekiq queue chart.

The monitoring strategy that works for Rails is per-tier — each layer has its own set of signals, alerts, and runbooks. Trying to monitor the whole stack with a single "is the website up?" check misses 90% of real outages.


Tier 1: Process and Worker Health (Puma / Unicorn / Falcon)

The web tier is where requests enter and where the most common silent failure — worker exhaustion — happens.

Puma metrics that matter

Puma exposes process state via the Puma::Stats API and via pumactl stats:

  • booted_workers — current workers alive (should match config; less = workers crashed)
  • running — worker count
  • backlog — queued requests waiting for a worker (the canary metric — if this is climbing, you're saturated)
  • pool_capacity — threads available across workers
  • busy_threads — threads currently serving requests

The key alert: backlog > 0 sustained for > 1 minute = workers are exhausted, new requests are queueing. By the time backlog is climbing, users are seeing slowness.

Worker saturation patterns

  • Slow external HTTP call in a request → thread tied up waiting → fewer threads available → backlog climbs
  • Long-running ActiveRecord query → same effect, but blamed on the database
  • Memory leak → workers grow until OOM-killed → fewer workers → backlog climbs
  • Bug in concurrent-ruby pool initialization → threads deadlocked
  • Action Cable connection leak → WebSocket workers (or threads, depending on adapter) tied up

Monitor:

  • Puma worker count vs configured
  • Per-worker memory growth (alert at 80% of container limit)
  • Backlog and pool capacity (1-minute resolution)
  • Request queue time (the time between when the request arrived at the load balancer and when Puma picked it up; should be < 50ms)

Unicorn / Falcon

  • Unicorn is per-process (no threading), so saturation looks like all workers busy on slow requests. Alert on total_busy >= worker_count for > 30s.
  • Falcon is fiber-based and handles I/O concurrency very differently. Sentinel for issues: fiber pool exhaustion, which manifests as request stalls without CPU pressure.

Tier 2: Request Monitoring (Rails Controllers and Middleware)

Per-request monitoring is what every APM tool covers reasonably well. The Rails-specific gotchas to watch:

Use ActiveSupport::Notifications

Rails emits structured events for everything internally — process_action.action_controller, sql.active_record, cache_read.active_support, render_template.action_view, etc. Most APMs use these to build their dashboards. You can also subscribe directly in your own code:

ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  StatsD.measure('rails.sql.duration', event.duration)
end

This is how to instrument a metric the APM didn't auto-collect.

N+1 query detection

The classic Rails performance killer. Tools:

  • Bullet — flags N+1 in development (don't run in prod; it has overhead and false positives in hot paths)
  • prosopite — better N+1 detection that works in test and CI
  • Skylight / AppSignal / New Relic / Sentry tracing — surfaces N+1 in production traces

The production-monitoring signal for N+1: a controller action whose SQL query count climbs over time as data grows. A user's dashboard with 5 records calls 5 queries; the same dashboard six months later with 500 records calls 500. Track sql.active_record count per request endpoint over time, not just total duration.

Slow request alerting

  • p95 response time per controller action
  • p99 response time (the tail is where bad UX hides)
  • Request queue time (Heroku-style X-Request-Start to Puma pickup)
  • 5xx rate per controller — see 5xx Server Error Rate Monitoring

Memory bloat per request

A request that loads 100K Active Record objects and never finishes the response often leaves the worker in a bloated state for the next request too. Track per-worker memory delta around request boundaries; alert on workers > 1.5× their post-boot RSS.


Tier 3: Background Jobs (Sidekiq)

For most Rails apps, background jobs are where the silent failures live. The user submits a form, the controller responds with 200, the actual work happens in Sidekiq — and if Sidekiq is degraded, the user sees a stale UI for hours.

Sidekiq metrics that matter

Sidekiq exposes everything via Sidekiq::Stats and Sidekiq::Queue.all:

  • Queue depth (size) — jobs waiting to be processed
  • Queue latency (latency) — age of the oldest job in the queue (the leading indicator of saturation)
  • Processed / failed counters
  • Retry set size — jobs that have failed and are awaiting retry
  • Dead set size — jobs that have permanently failed (after the retry policy)
  • Scheduled set size — jobs queued for future execution
  • Worker process count — alive workers
  • Busy workers — currently processing

Queue-specific alerting

A single Sidekiq queue is rarely enough for production. Split by priority:

  • critical — email confirmations, password resets, payment processing
  • default — most user-facing jobs
  • low — analytics, reporting, batch processing
  • mailers — outbound email

Then alert per queue:

  • critical latency > 60s → page (user-impacting)
  • default latency > 5 min → notification
  • low latency > 30 min → notification
  • Any queue size > 10× rolling 7-day average → notification
  • Dead set growing → page (jobs are giving up entirely)
  • Retry set > 1000 → notification (something's failing and the system is futilely retrying)

Sidekiq worker process health

  • Worker process count vs expected — alert if any worker has been gone for > 5 min
  • Per-worker memory (Sidekiq workers leak; periodic restart is normal — see "RSS at restart" pattern)
  • Worker concurrency setting vs actual busy count

See Job Queue Monitoring: Sidekiq, BullMQ, and SQS for the broader queue-monitoring picture.

Cron / scheduled jobs

Rails 8 has recurring jobs in SolidQueue. Sidekiq has sidekiq-cron. Either way: monitor that the job actually ran, not just that the scheduler thinks it should have.

Pattern: each scheduled job records its last successful run timestamp; an external monitor checks that timestamp is within the expected window. See Cron Job Monitoring.


Tier 4: ActiveRecord and Database

Rails apps' second-most-common failure mode (after Sidekiq) is the database — usually via slow queries, connection-pool exhaustion, or migrations gone wrong.

Connection pool exhaustion

Every Rails app has a connection pool (config/database.yml's pool value). Each thread that wants to query the DB takes a connection. If you have:

  • Puma: 5 threads × 3 workers = 15 threads
  • Sidekiq: 10 concurrency = 10 threads

Then the same Rails process needs at least 15 (web) or 10 (worker) connections from its DB pool. Misconfiguring this is one of the top-10 Rails outages.

Symptoms:

  • ActiveRecord::ConnectionTimeoutError in logs
  • Random request slowdowns with no SQL evidence

Monitor:

  • ActiveRecord::Base.connection_pool.stat:busy, :dead, :waiting
  • PostgreSQL / MySQL max_connections vs current
  • Alert on waiting > 0 for > 30 seconds

Slow query log

PostgreSQL pg_stat_statements and log_min_duration_statement capture slow queries. Tools like pganalyze, pgHero, and the AWS RDS Performance Insights dashboard make this surface-able.

Watch for:

  • New queries appearing at the top of the slow-query list after a deploy (likely a new code path that lacks an index)
  • Existing queries getting slower over time (data growth without index strategy review)
  • Sequential scans on large tables (missing index)

Migrations

The hidden Rails outage: a migration that locks a large table.

Patterns that bite:

  • ALTER TABLE on PostgreSQL without ALGORITHM=INPLACE (MySQL) or without breaking into batched operations
  • Adding an index without CONCURRENTLY (locks writes)
  • Adding a NOT NULL column with a default value to a multi-million-row table on older Postgres
  • Backfilling a column inside the migration instead of in a separate job

Use Strong Migrations (a gem) to refuse to deploy migrations with known dangerous patterns. Monitor the deploy itself:

  • Migration runtime — alert if a single migration takes > 5 minutes
  • Lock-wait time on the deploy DB role — alert if waiting > 30s
  • Replica lag during a deploy — alert if replicas fall > 30s behind

See Website Migration Monitoring: Zero-Downtime Checklist for the broader migration-monitoring approach.

Database-specific monitoring

See Database Monitoring: MySQL, PostgreSQL, and Redis for the per-engine monitoring approach.


Tier 5: Cache (Redis / Memcached / Solid Cache)

Rails caching is opaque until it's broken. Hit rate is the only number that actually matters.

  • Cache hit rate — should be > 90% for a healthy app; sudden drop = cache layer rotated or invalidation bug
  • Eviction rate — items being kicked out before they expire = cache is too small
  • Memory usage — alert at 80% of max
  • Connection count — Redis has a maxclients limit; running into it breaks every Rails process at once
  • Latency — Redis p99 should be < 5ms; > 50ms means the network or the Redis instance is degraded

Solid Cache (Rails 8 default, backed by the DB) shifts the failure mode — cache misses become DB load. Monitor cache table size, query latency on cache lookups, and consider partitioning if the table grows huge.


Tier 6: Action Cable / WebSockets

If your app uses Action Cable, you have a long-lived-connection failure mode that doesn't exist in a pure-REST app.

Monitor:

  • Active WebSocket connection count (per-server and aggregate)
  • Subscription count per channel
  • Message broadcast latency (broadcast → received-by-client time)
  • Pubsub backend (Redis usually) connection health
  • Per-worker WebSocket count — Action Cable runs in-process with Puma by default; many WS connections eat your Puma threads

The most common Action Cable outage: WS connections leak (clients reconnect without the old connection being properly cleaned up) until Puma's thread pool is exhausted and HTTP requests start queueing.

See WebSocket Monitoring: Realtime Connection Uptime for the broader WS-monitoring approach.


Error Tracking

Rails has a rich Ruby ecosystem of error trackers. Most teams use one:

  • Sentry — language-agnostic, strong source-map support for assets, OpenTelemetry-compatible
  • Honeybadger — Ruby-first, simple setup, great Rails coverage
  • AppSignal — Ruby + Elixir specialty; APM + errors in one
  • Bugsnag — language-agnostic; mature deduplication

What to look for:

  • Capture full request context (params, user ID, current_user roles, feature flags)
  • Capture Sidekiq job context (worker class, args, retry count)
  • Filter PII automatically — at minimum email and password fields
  • Alert on new error types (not just spikes in known errors) — a fresh stack trace right after a deploy is the leading indicator of a regression

APM Choice

The Ruby-friendly APM landscape:

  • Skylight — Ruby-native, lightweight, strongest "trace allocation" detail; lacks full distributed-tracing across non-Ruby services
  • AppSignal — Errors + APM + metrics + uptime in one; Ruby/Elixir focus
  • New Relic Ruby agent — heaviest agent but most mature; OTel-compatible in 2026
  • Datadog APM — language-agnostic; great if you already use Datadog
  • Honeybadger Insights — newer offering bundled with their error tracker
  • OpenTelemetry + open backend — see OpenTelemetry Monitoring; ruby-otel auto-instrumentation is solid

Pick one. Avoid running two simultaneously — the overhead compounds.


Deploy and Release Health

Rails deploys are where most production incidents start. Build observability into the deploy itself:

  • Deploy markers — drop a marker in your APM and logs at every deploy so post-incident analysis can correlate
  • Asset host health — the config.asset_host CDN domain serves your compiled assets. If it's misconfigured or the CDN edge has a stale cert, your app loads but is unstyled. External monitoring catches this.
  • Asset fingerprint validation — after deploy, fetch the manifest and verify the new asset URLs are reachable
  • Rollback button always available — your deploy is only as good as how fast you can undo it; measure rollback time too

For the broader CI/CD signal see CI/CD Pipeline Monitoring. For framework counterparts, see Laravel Monitoring, Django Monitoring, and Next.js Monitoring.


Common Rails Outages (Real Patterns)

Recurring incident shapes we keep seeing:

  1. Sidekiq queue saturation on Monday morning. Weekend traffic was light, workers were scaled down by autoscaler, Monday traffic hits and queue depth balloons before scale-up reacts. Fix: floor the worker count and pre-scale on schedule.
  2. Long-running migration in business hours. "It only took 30 seconds locally" — but the local table had 10K rows and prod has 50M. Fix: Strong Migrations + always-run-EXPLAIN-against-production-shape policy.
  3. N+1 that creeps in over time. The user with 5 records is fine; the user with 5,000 records hits the wall. Fix: alert on per-action SQL count, not just duration.
  4. CDN asset-host expired or misconfigured. App loads, every CSS/JS asset 404s. Fix: external uptime check on the asset host.
  5. Action Cable connection leak. Slow climb in Puma backlog over hours/days with no obvious cause; restart fixes it. Fix: explicit connection-cleanup logic + monitoring on Action Cable connection count.
  6. Connection pool size mismatch after a Sidekiq concurrency bump. Bumped Sidekiq from 10 to 20, forgot to bump DB pool. Random ConnectionTimeoutError under load. Fix: keep RAILS_MAX_THREADS and pool size synchronized in env config.

Rails Monitoring Checklist

  • Puma backlog / pool capacity / worker count tracked
  • Per-worker memory tracked; alert at 80% of container limit
  • Request queue time tracked (LB → Puma)
  • p95 / p99 response time per controller
  • 5xx rate per controller
  • ActiveRecord SQL count per request endpoint over time (N+1 drift)
  • DB connection pool :waiting > 0 alerting
  • Slow query log reviewed weekly with new-entry alerting
  • Sidekiq queue size + latency tracked per queue
  • Sidekiq dead set size alerting (jobs giving up)
  • Sidekiq retry set size alerting (jobs in futile retry)
  • Sidekiq scheduled job last-run-timestamp monitoring
  • Cache hit rate, eviction rate, memory, connection count, p99 latency
  • Action Cable connection count, broadcast latency, channel subscription count
  • Migrations gated by Strong Migrations
  • Deploy markers in APM + logs
  • Asset host external uptime check
  • Error tracker capturing controller + Sidekiq context
  • APM in place; agents not duplicated
  • External uptime monitor on production hostname (multi-region)
  • Internal /internal/rails-health endpoint returning per-tier status for external monitoring

How Webalert Helps With Rails Production Monitoring

Webalert covers the external-monitoring layer:

  • HTTP monitoring — Public hostname, login flow, asset host, custom domain
  • Multi-region checks — Catch regional CDN / DNS issues your APM can't see
  • Internal health-endpoint monitoring — Hit /internal/rails-health with auth; validate JSON shape (Sidekiq queue depths, Puma backlog, DB pool waiting count)
  • SSL certificate monitoring — Asset host, custom domain, API subdomain
  • Response time alerts — Catch p95 climbing before it becomes an incident
  • Heartbeats for scheduled jobs — Cron jobs that don't ping in are alerted on
  • Status page — Communicate Sidekiq queue lag or deploy issues to customers
  • Multi-channel alerts — Email, SMS, Slack, Discord, Teams, webhooks
  • 1-minute check intervals — Outages detected within 60 seconds
  • 5-minute setup — Add hostnames, internal endpoints, set thresholds

See features and pricing.


Summary

  • Rails apps fail by gradual degradation, not loud crashes. The monitoring strategy that works is per-tier: process, request, job, database, cache, realtime.
  • The single most-overlooked tier is Sidekiq. Queue latency per queue is the leading indicator of every user-impact job-related incident.
  • ActiveRecord drift — N+1 patterns and connection-pool sizing — is the second most common Rails outage. Track per-action SQL count over time, not just duration.
  • Use ActiveSupport::Notifications to instrument what your APM doesn't auto-collect. The hooks are right there in Rails.
  • Strong Migrations gates the dangerous-migration class of incidents at PR review time. Combine with deploy-time migration runtime monitoring.
  • Action Cable connection leaks silently exhaust Puma threads. Monitor WS connection count even if WebSockets are a small part of your app.
  • Pair internal APM (Skylight, AppSignal, New Relic, Datadog, or OTel + an open backend) with external uptime monitoring. They catch different classes of failure.
  • Asset-host outages make your app look broken without ever touching your servers. External uptime checks on the CDN domain are non-negotiable.

A well-instrumented Rails app makes the difference between "the site is slow today" turning into a four-hour incident vs a five-minute Slack thread. Build the per-tier signal once, tune the thresholds gradually, and the next outage will surface in your dashboards before it surfaces in your inbox.


Catch Sidekiq saturation, N+1 drift, and asset-host outages before users do

Start monitoring with Webalert →

See features and pricing. No credit card required.

Monitor your website in under 60 seconds — no credit card required.

Start Free Monitoring

Written by

Webalert Team

The Webalert team is dedicated to helping businesses keep their websites online and their users happy with reliable monitoring solutions.

Ready to Monitor Your Website?

Start monitoring for free with 3 monitors, 10-minute checks, and instant alerts.

Start Free Monitoring