
Rails apps fail in a particular flavor: not loud crashes, but slow creeping degradation. Sidekiq's queue grows from 100 to 100,000 over a weekend. An ActiveRecord N+1 you didn't notice in staging hits production and your p95 doubles. A migration that "took 30 seconds locally" runs for 40 minutes against a real table and locks reads. A long-running Action Cable connection count climbs until your Puma workers exhaust and new requests get queued.
None of these look like "the website is down." Your /health endpoint returns 200 the whole time. Your CDN cache hit rate is great. Your status page is green. And yet — checkout latency is unusable, background jobs are running 6 hours behind, and the dashboard chart shows users abandoning at 2× the normal rate.
Rails monitoring isn't about uptime in the binary sense; it's about catching these gradual degradations before they become user-visible incidents. This guide covers the per-tier monitoring stack — process, request, job, database, cache — with the Rails-specific signals that matter most: ActiveRecord query patterns, Sidekiq queue health, Puma worker saturation, Action Cable connections, and zero-downtime deploys.
What Makes Rails Monitoring Different
Rails has more concurrency-and-state moving parts than most frameworks. A typical production Rails app is simultaneously:
- Web tier — Puma (most common), Unicorn (legacy), Falcon (fiber-based, newer); usually behind NGINX or a CDN
- Background tier — Sidekiq (most common), Resque, GoodJob, SolidQueue (Rails 8 default); pulling jobs from Redis or PostgreSQL
- Scheduled tier — Sidekiq-Cron, sidekiq-scheduler, whenever, or Rails' own recurring jobs in 8+
- Realtime tier — Action Cable on WebSockets; subscriptions, broadcasts, presence
- Cache tier — Redis (most common), Memcached, Solid Cache
- Database tier — usually PostgreSQL, sometimes MySQL; talking to it via the connection pool
Failures in any one of these tiers can take down user-facing functionality without showing up in the others. Sidekiq saturation doesn't drop your healthcheck. A bad ActiveRecord query doesn't drop your Sidekiq worker count. A Puma worker exhaustion event doesn't move the Sidekiq queue chart.
The monitoring strategy that works for Rails is per-tier — each layer has its own set of signals, alerts, and runbooks. Trying to monitor the whole stack with a single "is the website up?" check misses 90% of real outages.
Tier 1: Process and Worker Health (Puma / Unicorn / Falcon)
The web tier is where requests enter and where the most common silent failure — worker exhaustion — happens.
Puma metrics that matter
Puma exposes process state via the Puma::Stats API and via pumactl stats:
booted_workers— current workers alive (should match config; less = workers crashed)running— worker countbacklog— queued requests waiting for a worker (the canary metric — if this is climbing, you're saturated)pool_capacity— threads available across workersbusy_threads— threads currently serving requests
The key alert: backlog > 0 sustained for > 1 minute = workers are exhausted, new requests are queueing. By the time backlog is climbing, users are seeing slowness.
Worker saturation patterns
- Slow external HTTP call in a request → thread tied up waiting → fewer threads available → backlog climbs
- Long-running ActiveRecord query → same effect, but blamed on the database
- Memory leak → workers grow until OOM-killed → fewer workers → backlog climbs
- Bug in concurrent-ruby pool initialization → threads deadlocked
- Action Cable connection leak → WebSocket workers (or threads, depending on adapter) tied up
Monitor:
- Puma worker count vs configured
- Per-worker memory growth (alert at 80% of container limit)
- Backlog and pool capacity (1-minute resolution)
- Request queue time (the time between when the request arrived at the load balancer and when Puma picked it up; should be < 50ms)
Unicorn / Falcon
- Unicorn is per-process (no threading), so saturation looks like all workers busy on slow requests. Alert on
total_busy >= worker_countfor > 30s. - Falcon is fiber-based and handles I/O concurrency very differently. Sentinel for issues: fiber pool exhaustion, which manifests as request stalls without CPU pressure.
Tier 2: Request Monitoring (Rails Controllers and Middleware)
Per-request monitoring is what every APM tool covers reasonably well. The Rails-specific gotchas to watch:
Use ActiveSupport::Notifications
Rails emits structured events for everything internally — process_action.action_controller, sql.active_record, cache_read.active_support, render_template.action_view, etc. Most APMs use these to build their dashboards. You can also subscribe directly in your own code:
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
StatsD.measure('rails.sql.duration', event.duration)
end
This is how to instrument a metric the APM didn't auto-collect.
N+1 query detection
The classic Rails performance killer. Tools:
- Bullet — flags N+1 in development (don't run in prod; it has overhead and false positives in hot paths)
- prosopite — better N+1 detection that works in test and CI
- Skylight / AppSignal / New Relic / Sentry tracing — surfaces N+1 in production traces
The production-monitoring signal for N+1: a controller action whose SQL query count climbs over time as data grows. A user's dashboard with 5 records calls 5 queries; the same dashboard six months later with 500 records calls 500. Track sql.active_record count per request endpoint over time, not just total duration.
Slow request alerting
- p95 response time per controller action
- p99 response time (the tail is where bad UX hides)
- Request queue time (Heroku-style
X-Request-Startto Puma pickup) - 5xx rate per controller — see 5xx Server Error Rate Monitoring
Memory bloat per request
A request that loads 100K Active Record objects and never finishes the response often leaves the worker in a bloated state for the next request too. Track per-worker memory delta around request boundaries; alert on workers > 1.5× their post-boot RSS.
Tier 3: Background Jobs (Sidekiq)
For most Rails apps, background jobs are where the silent failures live. The user submits a form, the controller responds with 200, the actual work happens in Sidekiq — and if Sidekiq is degraded, the user sees a stale UI for hours.
Sidekiq metrics that matter
Sidekiq exposes everything via Sidekiq::Stats and Sidekiq::Queue.all:
- Queue depth (
size) — jobs waiting to be processed - Queue latency (
latency) — age of the oldest job in the queue (the leading indicator of saturation) - Processed / failed counters
- Retry set size — jobs that have failed and are awaiting retry
- Dead set size — jobs that have permanently failed (after the retry policy)
- Scheduled set size — jobs queued for future execution
- Worker process count — alive workers
- Busy workers — currently processing
Queue-specific alerting
A single Sidekiq queue is rarely enough for production. Split by priority:
critical— email confirmations, password resets, payment processingdefault— most user-facing jobslow— analytics, reporting, batch processingmailers— outbound email
Then alert per queue:
criticallatency > 60s → page (user-impacting)defaultlatency > 5 min → notificationlowlatency > 30 min → notification- Any queue size > 10× rolling 7-day average → notification
- Dead set growing → page (jobs are giving up entirely)
- Retry set > 1000 → notification (something's failing and the system is futilely retrying)
Sidekiq worker process health
- Worker process count vs expected — alert if any worker has been gone for > 5 min
- Per-worker memory (Sidekiq workers leak; periodic restart is normal — see "RSS at restart" pattern)
- Worker concurrency setting vs actual busy count
See Job Queue Monitoring: Sidekiq, BullMQ, and SQS for the broader queue-monitoring picture.
Cron / scheduled jobs
Rails 8 has recurring jobs in SolidQueue. Sidekiq has sidekiq-cron. Either way: monitor that the job actually ran, not just that the scheduler thinks it should have.
Pattern: each scheduled job records its last successful run timestamp; an external monitor checks that timestamp is within the expected window. See Cron Job Monitoring.
Tier 4: ActiveRecord and Database
Rails apps' second-most-common failure mode (after Sidekiq) is the database — usually via slow queries, connection-pool exhaustion, or migrations gone wrong.
Connection pool exhaustion
Every Rails app has a connection pool (config/database.yml's pool value). Each thread that wants to query the DB takes a connection. If you have:
- Puma: 5 threads × 3 workers = 15 threads
- Sidekiq: 10 concurrency = 10 threads
Then the same Rails process needs at least 15 (web) or 10 (worker) connections from its DB pool. Misconfiguring this is one of the top-10 Rails outages.
Symptoms:
ActiveRecord::ConnectionTimeoutErrorin logs- Random request slowdowns with no SQL evidence
Monitor:
ActiveRecord::Base.connection_pool.stat—:busy,:dead,:waiting- PostgreSQL / MySQL
max_connectionsvs current - Alert on
waiting > 0for > 30 seconds
Slow query log
PostgreSQL pg_stat_statements and log_min_duration_statement capture slow queries. Tools like pganalyze, pgHero, and the AWS RDS Performance Insights dashboard make this surface-able.
Watch for:
- New queries appearing at the top of the slow-query list after a deploy (likely a new code path that lacks an index)
- Existing queries getting slower over time (data growth without index strategy review)
- Sequential scans on large tables (missing index)
Migrations
The hidden Rails outage: a migration that locks a large table.
Patterns that bite:
ALTER TABLEon PostgreSQL withoutALGORITHM=INPLACE(MySQL) or without breaking into batched operations- Adding an index without
CONCURRENTLY(locks writes) - Adding a
NOT NULLcolumn with a default value to a multi-million-row table on older Postgres - Backfilling a column inside the migration instead of in a separate job
Use Strong Migrations (a gem) to refuse to deploy migrations with known dangerous patterns. Monitor the deploy itself:
- Migration runtime — alert if a single migration takes > 5 minutes
- Lock-wait time on the deploy DB role — alert if waiting > 30s
- Replica lag during a deploy — alert if replicas fall > 30s behind
See Website Migration Monitoring: Zero-Downtime Checklist for the broader migration-monitoring approach.
Database-specific monitoring
See Database Monitoring: MySQL, PostgreSQL, and Redis for the per-engine monitoring approach.
Tier 5: Cache (Redis / Memcached / Solid Cache)
Rails caching is opaque until it's broken. Hit rate is the only number that actually matters.
- Cache hit rate — should be > 90% for a healthy app; sudden drop = cache layer rotated or invalidation bug
- Eviction rate — items being kicked out before they expire = cache is too small
- Memory usage — alert at 80% of max
- Connection count — Redis has a
maxclientslimit; running into it breaks every Rails process at once - Latency — Redis p99 should be < 5ms; > 50ms means the network or the Redis instance is degraded
Solid Cache (Rails 8 default, backed by the DB) shifts the failure mode — cache misses become DB load. Monitor cache table size, query latency on cache lookups, and consider partitioning if the table grows huge.
Tier 6: Action Cable / WebSockets
If your app uses Action Cable, you have a long-lived-connection failure mode that doesn't exist in a pure-REST app.
Monitor:
- Active WebSocket connection count (per-server and aggregate)
- Subscription count per channel
- Message broadcast latency (broadcast → received-by-client time)
- Pubsub backend (Redis usually) connection health
- Per-worker WebSocket count — Action Cable runs in-process with Puma by default; many WS connections eat your Puma threads
The most common Action Cable outage: WS connections leak (clients reconnect without the old connection being properly cleaned up) until Puma's thread pool is exhausted and HTTP requests start queueing.
See WebSocket Monitoring: Realtime Connection Uptime for the broader WS-monitoring approach.
Error Tracking
Rails has a rich Ruby ecosystem of error trackers. Most teams use one:
- Sentry — language-agnostic, strong source-map support for assets, OpenTelemetry-compatible
- Honeybadger — Ruby-first, simple setup, great Rails coverage
- AppSignal — Ruby + Elixir specialty; APM + errors in one
- Bugsnag — language-agnostic; mature deduplication
What to look for:
- Capture full request context (params, user ID, current_user roles, feature flags)
- Capture Sidekiq job context (worker class, args, retry count)
- Filter PII automatically — at minimum email and password fields
- Alert on new error types (not just spikes in known errors) — a fresh stack trace right after a deploy is the leading indicator of a regression
APM Choice
The Ruby-friendly APM landscape:
- Skylight — Ruby-native, lightweight, strongest "trace allocation" detail; lacks full distributed-tracing across non-Ruby services
- AppSignal — Errors + APM + metrics + uptime in one; Ruby/Elixir focus
- New Relic Ruby agent — heaviest agent but most mature; OTel-compatible in 2026
- Datadog APM — language-agnostic; great if you already use Datadog
- Honeybadger Insights — newer offering bundled with their error tracker
- OpenTelemetry + open backend — see OpenTelemetry Monitoring; ruby-otel auto-instrumentation is solid
Pick one. Avoid running two simultaneously — the overhead compounds.
Deploy and Release Health
Rails deploys are where most production incidents start. Build observability into the deploy itself:
- Deploy markers — drop a marker in your APM and logs at every deploy so post-incident analysis can correlate
- Asset host health — the
config.asset_hostCDN domain serves your compiled assets. If it's misconfigured or the CDN edge has a stale cert, your app loads but is unstyled. External monitoring catches this. - Asset fingerprint validation — after deploy, fetch the manifest and verify the new asset URLs are reachable
- Rollback button always available — your deploy is only as good as how fast you can undo it; measure rollback time too
For the broader CI/CD signal see CI/CD Pipeline Monitoring. For framework counterparts, see Laravel Monitoring, Django Monitoring, and Next.js Monitoring.
Common Rails Outages (Real Patterns)
Recurring incident shapes we keep seeing:
- Sidekiq queue saturation on Monday morning. Weekend traffic was light, workers were scaled down by autoscaler, Monday traffic hits and queue depth balloons before scale-up reacts. Fix: floor the worker count and pre-scale on schedule.
- Long-running migration in business hours. "It only took 30 seconds locally" — but the local table had 10K rows and prod has 50M. Fix: Strong Migrations + always-run-EXPLAIN-against-production-shape policy.
- N+1 that creeps in over time. The user with 5 records is fine; the user with 5,000 records hits the wall. Fix: alert on per-action SQL count, not just duration.
- CDN asset-host expired or misconfigured. App loads, every CSS/JS asset 404s. Fix: external uptime check on the asset host.
- Action Cable connection leak. Slow climb in Puma backlog over hours/days with no obvious cause; restart fixes it. Fix: explicit connection-cleanup logic + monitoring on Action Cable connection count.
- Connection pool size mismatch after a Sidekiq concurrency bump. Bumped Sidekiq from 10 to 20, forgot to bump DB pool. Random ConnectionTimeoutError under load. Fix: keep
RAILS_MAX_THREADSand pool size synchronized in env config.
Rails Monitoring Checklist
- Puma backlog / pool capacity / worker count tracked
- Per-worker memory tracked; alert at 80% of container limit
- Request queue time tracked (LB → Puma)
- p95 / p99 response time per controller
- 5xx rate per controller
- ActiveRecord SQL count per request endpoint over time (N+1 drift)
- DB connection pool
:waiting > 0alerting - Slow query log reviewed weekly with new-entry alerting
- Sidekiq queue size + latency tracked per queue
- Sidekiq dead set size alerting (jobs giving up)
- Sidekiq retry set size alerting (jobs in futile retry)
- Sidekiq scheduled job last-run-timestamp monitoring
- Cache hit rate, eviction rate, memory, connection count, p99 latency
- Action Cable connection count, broadcast latency, channel subscription count
- Migrations gated by Strong Migrations
- Deploy markers in APM + logs
- Asset host external uptime check
- Error tracker capturing controller + Sidekiq context
- APM in place; agents not duplicated
- External uptime monitor on production hostname (multi-region)
- Internal
/internal/rails-healthendpoint returning per-tier status for external monitoring
How Webalert Helps With Rails Production Monitoring
Webalert covers the external-monitoring layer:
- HTTP monitoring — Public hostname, login flow, asset host, custom domain
- Multi-region checks — Catch regional CDN / DNS issues your APM can't see
- Internal health-endpoint monitoring — Hit
/internal/rails-healthwith auth; validate JSON shape (Sidekiq queue depths, Puma backlog, DB pool waiting count) - SSL certificate monitoring — Asset host, custom domain, API subdomain
- Response time alerts — Catch p95 climbing before it becomes an incident
- Heartbeats for scheduled jobs — Cron jobs that don't ping in are alerted on
- Status page — Communicate Sidekiq queue lag or deploy issues to customers
- Multi-channel alerts — Email, SMS, Slack, Discord, Teams, webhooks
- 1-minute check intervals — Outages detected within 60 seconds
- 5-minute setup — Add hostnames, internal endpoints, set thresholds
Summary
- Rails apps fail by gradual degradation, not loud crashes. The monitoring strategy that works is per-tier: process, request, job, database, cache, realtime.
- The single most-overlooked tier is Sidekiq. Queue latency per queue is the leading indicator of every user-impact job-related incident.
- ActiveRecord drift — N+1 patterns and connection-pool sizing — is the second most common Rails outage. Track per-action SQL count over time, not just duration.
- Use ActiveSupport::Notifications to instrument what your APM doesn't auto-collect. The hooks are right there in Rails.
- Strong Migrations gates the dangerous-migration class of incidents at PR review time. Combine with deploy-time migration runtime monitoring.
- Action Cable connection leaks silently exhaust Puma threads. Monitor WS connection count even if WebSockets are a small part of your app.
- Pair internal APM (Skylight, AppSignal, New Relic, Datadog, or OTel + an open backend) with external uptime monitoring. They catch different classes of failure.
- Asset-host outages make your app look broken without ever touching your servers. External uptime checks on the CDN domain are non-negotiable.
A well-instrumented Rails app makes the difference between "the site is slow today" turning into a four-hour incident vs a five-minute Slack thread. Build the per-tier signal once, tune the thresholds gradually, and the next outage will surface in your dashboards before it surfaces in your inbox.