
ActiveRecord makes the database disappear. That is its biggest selling point and its single biggest production risk. Three patterns - N+1 queries, slow individual queries, and connection-pool exhaustion - account for the majority of Rails apps' production performance incidents. Every Rails team has lived all three.
This is the practical monitoring guide. How to detect each pattern, what to log, which gems pay rent, what to alert on, and - because Rails monitoring is usually internal - how external HTTP latency and 5xx monitoring acts as the canary that tells you the database is in trouble before your APM dashboard catches up.
For broader Rails production monitoring (Sidekiq, uptime, deploys), see Rails Monitoring: Production Uptime & Sidekiq. For the database itself, see PostgreSQL Monitoring and MySQL Monitoring.
The Three ActiveRecord Failure Modes
Almost every AR-related incident is one of:
- N+1 queries - one query that should have been one becomes N+1. The page works, just slowly. Throughput silently halves.
- Slow individual queries - one query, often after a deploy or data growth, suddenly takes 800 ms instead of 30 ms. Pages timeout. The query plan changed.
- Connection-pool exhaustion -
ActiveRecord::ConnectionTimeoutError. Requests pile up waiting for a connection that no longer comes. Often triggered by one of the above two, not by raw load.
Detect each in monitoring before users do.
Pattern 1: N+1 Queries
What N+1 actually is
# Controller
@orders = Order.where(user_id: current_user.id).limit(20)
# View
<% @orders.each do |order| %>
<%= order.customer.name %>
<% end %>
One query selects 20 orders. Then the view triggers 20 more queries - one per order.customer. Total: 21 queries. Latency dominated by round-trip count, not by data size.
Fixes (the ones AR users keep forgetting)
Order.where(user_id: current_user.id).includes(:customer).limit(20)
includespickspreloadoreager_loadautomatically based on the rest of the query.preload(:customer)forces a separate query (SELECT ... FROM customers WHERE id IN (...)). Best when you do not filter oncustomer.eager_load(:customer)forces a singleLEFT OUTER JOIN. Best when you need to filter or order oncustomer.joins(:customer)does not preload - it only adds the JOIN. The instances are still lazy. Easy to confuse.
Detection: bullet
bullet is the classic. In development and test:
# config/environments/development.rb
config.after_initialize do
Bullet.enable = true
Bullet.bullet_logger = true
Bullet.console = true
Bullet.rails_logger = true
Bullet.raise = true # fail the test
Bullet.add_safelist type: :n_plus_one_query,
class_name: 'User',
association: :profile # known-fine
end
Run RSpec with Bullet.raise = true. A new N+1 fails CI. This is the single highest-ROI guardrail in a Rails codebase.
Detection: prosopite
prosopite catches N+1s that bullet misses - particularly with scope_chain, polymorphic associations, and STI. Some teams run both.
Detection: query_diet
query_diet caps queries per request:
QueryDiet::Counter.threshold = 25
Any request exceeding 25 queries logs a warning and (optionally) raises. Useful when N+1s come from helper methods that bullet cannot trace.
Detection: log-tail pattern matching
For finding existing N+1s in production logs without deploying instrumentation:
# Sample production logs, count repeated queries per request
grep -B1 'Duration:' production.log | \
awk -v RS= '/SELECT .* FROM "customers" WHERE/{c++} END{print c}'
If the same templated query fires 20 times in a 200 ms window, it is almost certainly an N+1.
What to log per request
db_query_countdb_query_duration_total_msunique_query_templates(with bind params stripped)n_plus_one_suspectedboolean (heuristic: same template count > 5)
These four numbers in your structured logs make N+1s detectable post-hoc without re-running tests.
Pattern 2: Slow Individual Queries
Set the threshold at the database
PostgreSQL:
# postgresql.conf
log_min_duration_statement = 200 # log every query > 200 ms
log_statement = none # do not log all DDL/DML
log_line_prefix = '%t [%p] [%d] [%a] '
MySQL:
# my.cnf
slow_query_log = 1
slow_query_log_file = /var/log/mysql/slow.log
long_query_time = 0.2
log_queries_not_using_indexes = 0 # noisy in many apps
Aggregate the slow log per template (pg_stat_statements, pt-query-digest, or a vendor APM). Targets:
- p95 query time per template per hour.
- Total time spent per template per hour ("top query by cost", which often differs from "top query by frequency").
- New templates appearing for the first time after a deploy.
Set the threshold at the application
ActiveSupport::Notifications lets you instrument every query without monkey-patching:
ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
next if event.payload[:name] == 'SCHEMA'
if event.duration > 500
Rails.logger.warn({
slow_query_ms: event.duration.round,
sql: event.payload[:sql],
name: event.payload[:name],
controller: ActiveSupport::CurrentAttributes::CurrentRequest&.controller,
request_id: ActiveSupport::CurrentAttributes::CurrentRequest&.request_id,
}.to_json)
end
end
The advantage over DB-side logs: you get controller, request_id, and full context. Pipe to your log aggregator and alert when a single template crosses thresholds.
Plan changes after deploys
A query plan can change for two reasons:
- Stats changed (autovacuum /
ANALYZEran, new histogram). - Schema changed (new index, dropped index, type change).
After every deploy, sample 20-50 of your hottest queries and run EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) against them. Compare to the previous deploy's plan. If a query that used Index Scan last week is now Seq Scan, alert.
Hot index hygiene
- Watch for
idx_scan = 0inpg_stat_user_indexes- unused indexes that still cost write latency. - Watch for
seq_scanrising on large tables inpg_stat_user_tables. - Bloat:
pg_stat_all_tables.n_dead_tup/n_live_tup> 0.2 means vacuum is behind.
For the full DB-side monitoring set, see PostgreSQL Monitoring.
Pattern 3: Connection Pool Exhaustion
The symptom
ActiveRecord::ConnectionTimeoutError:
could not obtain a connection from the pool within 5.000s
(waited 5.000 seconds);
all pooled connections were in use
The math
Rails maintains a connection pool per process. The pool size is set in database.yml:
production:
adapter: postgresql
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
checkout_timeout: 5
Required relationship:
pool_size >= threads_per_process
…where threads_per_process is whichever is larger between:
- Web:
RAILS_MAX_THREADS(Puma threads per worker). - Background: Sidekiq's
concurrencysetting.
Additionally, PostgreSQL max_connections >= sum of all pool sizes across all processes across web workers, Sidekiq workers, console sessions, and the rails runner you forgot was open.
Why it explodes
It is almost never raw load. It is almost always one of:
- A query slowed down (Pattern 2), so connections are held longer, so the pool drains.
- A new endpoint does
Model.connection.transaction { ... heavy work ... }and holds connections across slow external calls. - Background jobs spawned threads of their own and each thread grabbed a connection.
- A long-running migration or
ANALYZEis blocking.
Monitoring connection pool
ActiveRecord::Base.connection_pool.stat returns:
{
size: 10,
connections: 10,
busy: 9,
dead: 0,
idle: 1,
waiting: 0,
checkout_timeout: 5.0
}
Emit these as metrics every 10 s. Alert when:
busy / size > 0.9for more than 60 s.waiting > 0for more than 30 s.checkout_timeouterrors appear at all.
Sidekiq sizing pitfalls
Sidekiq uses one process with N threads. Each thread can independently use AR. So:
# sidekiq.yml
:concurrency: 20
…requires:
# database.yml
production:
pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 25 } %> # >= sidekiq concurrency + buffer
A common silent failure: web sets pool to 5, Sidekiq concurrency to 25, both deploys go fine, then the first heavy job batch triggers pool timeouts because the Sidekiq processes share the same database.yml block.
Callback chain cost
after_save callbacks that touch other models can quietly turn one update into 10. Worse, callbacks running inside a transaction extend the duration the connection is held, increasing pool pressure. Audit your callback chains; prefer after_commit for side-effects.
ActiveSupport::Notifications: The Built-In APM
Rails ships with the instrumentation you need. The under-used events:
| Event | What you get |
|---|---|
sql.active_record |
every query, duration, bind params |
process_action.action_controller |
controller, action, status, view ms, db ms, total ms |
start_processing.action_controller |
request begin, params |
perform.active_job / perform_job.sidekiq |
job class, duration |
cache_read.active_support / cache_write.active_support |
cache hit/miss, key |
instantiation.active_record |
record count instantiated per query |
instantiation.active_record is interesting: it tells you when you loaded 5,000 ActiveRecord objects for a request that returned 20 to the user. That is a memory cost too, not just a CPU cost. Alert when a controller instantiates > 1,000 records per request.
ActiveSupport::Notifications.subscribe('instantiation.active_record') do |*args|
event = ActiveSupport::Notifications::Event.new(*args)
next if event.payload[:record_count] < 1_000
Rails.logger.warn({
over_instantiation: true,
count: event.payload[:record_count],
class: event.payload[:class_name],
}.to_json)
end
Counter Caches, Bulk Operations, and Other Easy Wins
Patterns that AR users discover late but every senior Rails dev pre-empts:
Counter caches
class Comment < ApplicationRecord
belongs_to :post, counter_cache: true
end
Replaces post.comments.count (a SELECT COUNT(*)) with post.comments_count (a column read). Massive on list pages with N posts each showing a comment count.
pluck vs map
User.where(active: true).pluck(:email) # 1 query, array of strings
User.where(active: true).map(&:email) # 1 query but full instantiation
pluck skips ActiveRecord instantiation. Material for any list page.
find_each for large iterations
User.where(churned: false).find_each(batch_size: 1_000) do |u|
# ...
end
Default each loads everything into memory. find_each paginates. Memory bloat is also a perf problem - GC pauses count as user-visible latency.
insert_all / upsert_all
For bulk inserts, AR 6+ provides insert_all (skips validations and callbacks - which is the point) and upsert_all. Orders of magnitude faster than .create! in a loop.
update_all
User.where(stale: true).update_all(stale: false) runs one SQL statement; the per-instance equivalent runs N. Use deliberately - it skips callbacks and timestamps.
External Monitoring As The Canary
Here is the part Rails teams underuse. External HTTP latency and 5xx monitoring is the most reliable early signal of AR trouble.
- N+1 doubles? Average response time visibly rises. Your external latency probe sees it before your APM aggregates.
- A slow query lands after a deploy? p95 response time on the affected endpoint jumps from 200 ms to 1.5 s. External probe picks it up.
- Pool exhaustion? Requests start returning 500s with
ConnectionTimeoutError. External probe sees a 5xx spike with characteristic timing.
You do not need a fancy APM to detect every AR regression. You need:
- Per-endpoint synthetic HTTP checks from outside your VPC.
- Response-time and 5xx alerts on those checks.
- The discipline to investigate inside-out starting from the external alert, not waiting for it.
This is how teams without an APM budget still catch DB regressions. See 5xx Server Errors Explained for the diagnostic playbook once the alert fires, and TTFB Monitoring for the slow-but-not-broken case.
A Concrete Alert Set
| Severity | Trigger | Window | Source |
|---|---|---|---|
| Critical | ActiveRecord::ConnectionTimeoutError rate > 0.1% |
5 min | Application logs |
| Critical | p95 response time on /checkout > 1 s |
5 min | External HTTP monitor |
| Critical | 5xx rate > 1% on any monitored endpoint | 5 min | External HTTP monitor |
| High | db_query_count p95 per request > 50 on a route |
1 h | Structured logs |
| High | Single SQL template p95 > 500 ms | 15 min | sql.active_record |
| High | New SQL template (never-before-seen) appears after deploy | 1 h | Structured logs |
| Medium | instantiation.active_record record_count > 5,000 on any request |
1 h | instantiation.active_record |
| Medium | Pool busy / size > 0.9 sustained |
60 s | connection_pool.stat |
| Info | Counter cache drift > 1% (background job comparing count vs *_count) |
daily | Job |
The combination of internal pool/query metrics and external latency / 5xx metrics is what closes the loop. Internal alone misses connectivity issues. External alone misses cause.
For the broader deploy/CI side - because most AR regressions ride in on a deploy - see CI/CD Pipeline Monitoring and Database Monitoring (MySQL, PostgreSQL, Redis).
The Pre-Deploy Checklist
Before merging a Rails PR that touches data access:
-
bullet/prosopiteran clean in CI -
query_dietthreshold not raised -
EXPLAIN ANALYZEreviewed for any new or changed query on a large table - No
where(...)immediately followed byeachon a parent + child withoutincludes - No
update_all/delete_allwithout a covering index - Background jobs reviewed for new AR access (Sidekiq pool sizing still adequate)
- Migrations split: schema change in one deploy, data backfill in next, code switch in third
- New
pg_stat_statementsentries reviewed post-deploy
Post-Incident Worksheet
After every AR incident:
- Capture the failing SQL template and its
EXPLAIN ANALYZE. - Identify the deploy that introduced it (commit, PR, reviewer).
- Determine which monitoring signal would have caught it earliest: external latency, internal query log, pool stats, bullet in CI?
- If the answer is "bullet would have, but it was skipped" - turn on
Bullet.raise = true. - If the answer is "external HTTP probe caught it within 30 s" - good, document the SOP.
- Add a test that fails if the N+1 / slow path recurs.
- Update the alerting threshold if necessary.
How Webalert Helps
Rails APMs are inside-out. Webalert is the outside-in counterpart:
- External HTTP monitoring - the earliest signal that AR is in trouble. Catches the latency rise the moment users feel it.
- 5xx alerting -
ConnectionTimeoutErrorshows up as 500. Webalert pages immediately. - Multi-region uptime - because if your DB primary is degraded in EU but read replicas are fine in US, your monitoring should know.
- TLS / DNS / Domain checks - because the categorical reasons your healthcheck endpoint becomes unreachable are not always inside the app.
- Content validation - catch the day a deploy serves your home page as a 200-OK Rails error page. See Response Body Validation.
- Status page - communicate downtime cleanly to users while you fix the underlying query.
Example Webalert check tuned for Rails DB-perf canarying:
- URL:
https://example.com/products(a list endpoint that hits AR) - Method:
GET - Expected status:
200 - Must contain: a product name pattern you know is always present
- Response time: under 800ms (alert on > 1.2s, page on > 2.5s)
- Regions: US + EU
- Frequency: 60 s
When this latency rises, your AR layer is almost certainly the cause. By the time your APM aggregate updates, Webalert has already paged.
Summary
Three failure modes own ActiveRecord production incidents: N+1, slow queries, pool exhaustion. Detect each with the right tool - bullet / prosopite in CI, pg_stat_statements and ActiveSupport::Notifications at runtime, connection_pool.stat plus tuned Sidekiq sizing in operations. And critically, pair internal AR monitoring with external HTTP latency and 5xx monitoring, because the external probe is the one signal that fires the moment your users feel pain - before any dashboard catches up.
Get this right and you stop being surprised by Rails performance. You catch the regression in the PR, in CI, or in the first 60 seconds of the deploy.