ActiveRecord Monitoring: N+1, Slow Queries & Pool Exhaustion

ActiveRecord makes the database disappear. That is its biggest selling point and its single biggest production risk. Three patterns - N+1 queries, slow individual queries, and connection-pool exhaustion - account for the majority of Rails apps' production performance incidents. Every Rails team has lived all three.

This is the practical monitoring guide. How to detect each pattern, what to log, which gems pay rent, what to alert on, and - because Rails monitoring is usually internal - how external HTTP latency and 5xx monitoring acts as the canary that tells you the database is in trouble before your APM dashboard catches up.

For broader Rails production monitoring (Sidekiq, uptime, deploys), see Rails Monitoring: Production Uptime & Sidekiq. For the database itself, see PostgreSQL Monitoring and MySQL Monitoring.

The Three ActiveRecord Failure Modes

Almost every AR-related incident is one of:

N+1 queries - one query that should have been one becomes N+1. The page works, just slowly. Throughput silently halves.
Slow individual queries - one query, often after a deploy or data growth, suddenly takes 800 ms instead of 30 ms. Pages timeout. The query plan changed.
Connection-pool exhaustion - ActiveRecord::ConnectionTimeoutError. Requests pile up waiting for a connection that no longer comes. Often triggered by one of the above two, not by raw load.

Detect each in monitoring before users do.

Pattern 1: N+1 Queries

What N+1 actually is

# Controller
@orders = Order.where(user_id: current_user.id).limit(20)

# View
<% @orders.each do |order| %>
  <%= order.customer.name %>
<% end %>

One query selects 20 orders. Then the view triggers 20 more queries - one per order.customer. Total: 21 queries. Latency dominated by round-trip count, not by data size.

Fixes (the ones AR users keep forgetting)

Order.where(user_id: current_user.id).includes(:customer).limit(20)

includes picks preload or eager_load automatically based on the rest of the query.
preload(:customer) forces a separate query (SELECT ... FROM customers WHERE id IN (...)). Best when you do not filter on customer.
eager_load(:customer) forces a single LEFT OUTER JOIN. Best when you need to filter or order on customer.
joins(:customer) does not preload - it only adds the JOIN. The instances are still lazy. Easy to confuse.

Detection: `bullet`

bullet is the classic. In development and test:

# config/environments/development.rb
config.after_initialize do
  Bullet.enable        = true
  Bullet.bullet_logger = true
  Bullet.console       = true
  Bullet.rails_logger  = true
  Bullet.raise         = true # fail the test
  Bullet.add_safelist type: :n_plus_one_query,
                      class_name: 'User',
                      association: :profile # known-fine
end

Run RSpec with Bullet.raise = true. A new N+1 fails CI. This is the single highest-ROI guardrail in a Rails codebase.

Detection: `prosopite`

prosopite catches N+1s that bullet misses - particularly with scope_chain, polymorphic associations, and STI. Some teams run both.

Detection: `query_diet`

query_diet caps queries per request:

QueryDiet::Counter.threshold = 25

Any request exceeding 25 queries logs a warning and (optionally) raises. Useful when N+1s come from helper methods that bullet cannot trace.

Detection: log-tail pattern matching

For finding existing N+1s in production logs without deploying instrumentation:

# Sample production logs, count repeated queries per request
grep -B1 'Duration:' production.log | \
  awk -v RS= '/SELECT .* FROM "customers" WHERE/{c++} END{print c}'

If the same templated query fires 20 times in a 200 ms window, it is almost certainly an N+1.

What to log per request

db_query_count
db_query_duration_total_ms
unique_query_templates (with bind params stripped)
n_plus_one_suspected boolean (heuristic: same template count > 5)

These four numbers in your structured logs make N+1s detectable post-hoc without re-running tests.

Pattern 2: Slow Individual Queries

Set the threshold at the database

PostgreSQL:

# postgresql.conf
log_min_duration_statement = 200   # log every query > 200 ms
log_statement              = none  # do not log all DDL/DML
log_line_prefix            = '%t [%p] [%d] [%a] '

MySQL:

# my.cnf
slow_query_log         = 1
slow_query_log_file    = /var/log/mysql/slow.log
long_query_time        = 0.2
log_queries_not_using_indexes = 0   # noisy in many apps

Aggregate the slow log per template (pg_stat_statements, pt-query-digest, or a vendor APM). Targets:

p95 query time per template per hour.
Total time spent per template per hour ("top query by cost", which often differs from "top query by frequency").
New templates appearing for the first time after a deploy.

Set the threshold at the application

ActiveSupport::Notifications lets you instrument every query without monkey-patching:

ActiveSupport::Notifications.subscribe('sql.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  next if event.payload[:name] == 'SCHEMA'

  if event.duration > 500
    Rails.logger.warn({
      slow_query_ms: event.duration.round,
      sql: event.payload[:sql],
      name: event.payload[:name],
      controller: ActiveSupport::CurrentAttributes::CurrentRequest&.controller,
      request_id: ActiveSupport::CurrentAttributes::CurrentRequest&.request_id,
    }.to_json)
  end
end

The advantage over DB-side logs: you get controller, request_id, and full context. Pipe to your log aggregator and alert when a single template crosses thresholds.

Plan changes after deploys

A query plan can change for two reasons:

Stats changed (autovacuum / ANALYZE ran, new histogram).
Schema changed (new index, dropped index, type change).

After every deploy, sample 20-50 of your hottest queries and run EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON) against them. Compare to the previous deploy's plan. If a query that used Index Scan last week is now Seq Scan, alert.

Hot index hygiene

Watch for idx_scan = 0 in pg_stat_user_indexes - unused indexes that still cost write latency.
Watch for seq_scan rising on large tables in pg_stat_user_tables.
Bloat: pg_stat_all_tables.n_dead_tup / n_live_tup > 0.2 means vacuum is behind.

For the full DB-side monitoring set, see PostgreSQL Monitoring.

Pattern 3: Connection Pool Exhaustion

The symptom

ActiveRecord::ConnectionTimeoutError:
  could not obtain a connection from the pool within 5.000s
  (waited 5.000 seconds);
  all pooled connections were in use

The math

Rails maintains a connection pool per process. The pool size is set in database.yml:

production:
  adapter: postgresql
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 5 } %>
  checkout_timeout: 5

Required relationship:

pool_size >= threads_per_process

…where threads_per_process is whichever is larger between:

Web: RAILS_MAX_THREADS (Puma threads per worker).
Background: Sidekiq's concurrency setting.

Additionally, PostgreSQL max_connections >= sum of all pool sizes across all processes across web workers, Sidekiq workers, console sessions, and the rails runner you forgot was open.

Why it explodes

It is almost never raw load. It is almost always one of:

A query slowed down (Pattern 2), so connections are held longer, so the pool drains.
A new endpoint does Model.connection.transaction { ... heavy work ... } and holds connections across slow external calls.
Background jobs spawned threads of their own and each thread grabbed a connection.
A long-running migration or ANALYZE is blocking.

Monitoring connection pool

ActiveRecord::Base.connection_pool.stat returns:

{
  size: 10,
  connections: 10,
  busy: 9,
  dead: 0,
  idle: 1,
  waiting: 0,
  checkout_timeout: 5.0
}

Emit these as metrics every 10 s. Alert when:

busy / size > 0.9 for more than 60 s.
waiting > 0 for more than 30 s.
checkout_timeout errors appear at all.

Sidekiq sizing pitfalls

Sidekiq uses one process with N threads. Each thread can independently use AR. So:

# sidekiq.yml
:concurrency: 20

…requires:

# database.yml
production:
  pool: <%= ENV.fetch("RAILS_MAX_THREADS") { 25 } %>  # >= sidekiq concurrency + buffer

A common silent failure: web sets pool to 5, Sidekiq concurrency to 25, both deploys go fine, then the first heavy job batch triggers pool timeouts because the Sidekiq processes share the same database.yml block.

Callback chain cost

after_save callbacks that touch other models can quietly turn one update into 10. Worse, callbacks running inside a transaction extend the duration the connection is held, increasing pool pressure. Audit your callback chains; prefer after_commit for side-effects.

ActiveSupport::Notifications: The Built-In APM

Rails ships with the instrumentation you need. The under-used events:

Event	What you get
`sql.active_record`	every query, duration, bind params
`process_action.action_controller`	controller, action, status, view ms, db ms, total ms
`start_processing.action_controller`	request begin, params
`perform.active_job` / `perform_job.sidekiq`	job class, duration
`cache_read.active_support` / `cache_write.active_support`	cache hit/miss, key
`instantiation.active_record`	record count instantiated per query

instantiation.active_record is interesting: it tells you when you loaded 5,000 ActiveRecord objects for a request that returned 20 to the user. That is a memory cost too, not just a CPU cost. Alert when a controller instantiates > 1,000 records per request.

ActiveSupport::Notifications.subscribe('instantiation.active_record') do |*args|
  event = ActiveSupport::Notifications::Event.new(*args)
  next if event.payload[:record_count] < 1_000

  Rails.logger.warn({
    over_instantiation: true,
    count: event.payload[:record_count],
    class: event.payload[:class_name],
  }.to_json)
end

Counter Caches, Bulk Operations, and Other Easy Wins

Patterns that AR users discover late but every senior Rails dev pre-empts:

Counter caches

class Comment < ApplicationRecord
  belongs_to :post, counter_cache: true
end

Replaces post.comments.count (a SELECT COUNT(*)) with post.comments_count (a column read). Massive on list pages with N posts each showing a comment count.

`pluck` vs `map`

User.where(active: true).pluck(:email)   # 1 query, array of strings
User.where(active: true).map(&:email)     # 1 query but full instantiation

pluck skips ActiveRecord instantiation. Material for any list page.

`find_each` for large iterations

User.where(churned: false).find_each(batch_size: 1_000) do |u|
  # ...
end

Default each loads everything into memory. find_each paginates. Memory bloat is also a perf problem - GC pauses count as user-visible latency.

`insert_all` / `upsert_all`

For bulk inserts, AR 6+ provides insert_all (skips validations and callbacks - which is the point) and upsert_all. Orders of magnitude faster than .create! in a loop.

`update_all`

User.where(stale: true).update_all(stale: false) runs one SQL statement; the per-instance equivalent runs N. Use deliberately - it skips callbacks and timestamps.

External Monitoring As The Canary

Here is the part Rails teams underuse. External HTTP latency and 5xx monitoring is the most reliable early signal of AR trouble.

N+1 doubles? Average response time visibly rises. Your external latency probe sees it before your APM aggregates.
A slow query lands after a deploy? p95 response time on the affected endpoint jumps from 200 ms to 1.5 s. External probe picks it up.
Pool exhaustion? Requests start returning 500s with ConnectionTimeoutError. External probe sees a 5xx spike with characteristic timing.

You do not need a fancy APM to detect every AR regression. You need:

Per-endpoint synthetic HTTP checks from outside your VPC.
Response-time and 5xx alerts on those checks.
The discipline to investigate inside-out starting from the external alert, not waiting for it.

This is how teams without an APM budget still catch DB regressions. See 5xx Server Errors Explained for the diagnostic playbook once the alert fires, and TTFB Monitoring for the slow-but-not-broken case.

A Concrete Alert Set

Severity	Trigger	Window	Source
Critical	`ActiveRecord::ConnectionTimeoutError` rate > 0.1%	5 min	Application logs
Critical	p95 response time on `/checkout` > 1 s	5 min	External HTTP monitor
Critical	5xx rate > 1% on any monitored endpoint	5 min	External HTTP monitor
High	`db_query_count` p95 per request > 50 on a route	1 h	Structured logs
High	Single SQL template p95 > 500 ms	15 min	`sql.active_record`
High	New SQL template (never-before-seen) appears after deploy	1 h	Structured logs
Medium	`instantiation.active_record` `record_count` > 5,000 on any request	1 h	`instantiation.active_record`
Medium	Pool `busy / size > 0.9` sustained	60 s	`connection_pool.stat`
Info	Counter cache drift > 1% (background job comparing `count` vs `*_count`)	daily	Job

The combination of internal pool/query metrics and external latency / 5xx metrics is what closes the loop. Internal alone misses connectivity issues. External alone misses cause.

For the broader deploy/CI side - because most AR regressions ride in on a deploy - see CI/CD Pipeline Monitoring and Database Monitoring (MySQL, PostgreSQL, Redis).

The Pre-Deploy Checklist

Before merging a Rails PR that touches data access:

bullet/prosopite ran clean in CI
query_diet threshold not raised
EXPLAIN ANALYZE reviewed for any new or changed query on a large table
No where(...) immediately followed by each on a parent + child without includes
No update_all / delete_all without a covering index
Background jobs reviewed for new AR access (Sidekiq pool sizing still adequate)
Migrations split: schema change in one deploy, data backfill in next, code switch in third
New pg_stat_statements entries reviewed post-deploy

Post-Incident Worksheet

After every AR incident:

Capture the failing SQL template and its EXPLAIN ANALYZE.
Identify the deploy that introduced it (commit, PR, reviewer).
Determine which monitoring signal would have caught it earliest: external latency, internal query log, pool stats, bullet in CI?
If the answer is "bullet would have, but it was skipped" - turn on Bullet.raise = true.
If the answer is "external HTTP probe caught it within 30 s" - good, document the SOP.
Add a test that fails if the N+1 / slow path recurs.
Update the alerting threshold if necessary.

How Webalert Helps

Rails APMs are inside-out. Webalert is the outside-in counterpart:

External HTTP monitoring - the earliest signal that AR is in trouble. Catches the latency rise the moment users feel it.
5xx alerting - ConnectionTimeoutError shows up as 500. Webalert pages immediately.
Multi-region uptime - because if your DB primary is degraded in EU but read replicas are fine in US, your monitoring should know.
TLS / DNS / Domain checks - because the categorical reasons your healthcheck endpoint becomes unreachable are not always inside the app.
Content validation - catch the day a deploy serves your home page as a 200-OK Rails error page. See Response Body Validation.
Status page - communicate downtime cleanly to users while you fix the underlying query.

Example Webalert check tuned for Rails DB-perf canarying:

URL: https://example.com/products (a list endpoint that hits AR)
Method: GET
Expected status: 200
Must contain: a product name pattern you know is always present
Response time: under 800ms (alert on > 1.2s, page on > 2.5s)
Regions: US + EU
Frequency: 60 s

When this latency rises, your AR layer is almost certainly the cause. By the time your APM aggregate updates, Webalert has already paged.

Summary

Three failure modes own ActiveRecord production incidents: N+1, slow queries, pool exhaustion. Detect each with the right tool - bullet / prosopite in CI, pg_stat_statements and ActiveSupport::Notifications at runtime, connection_pool.stat plus tuned Sidekiq sizing in operations. And critically, pair internal AR monitoring with external HTTP latency and 5xx monitoring, because the external probe is the one signal that fires the moment your users feel pain - before any dashboard catches up.

Get this right and you stop being surprised by Rails performance. You catch the regression in the PR, in CI, or in the first 60 seconds of the deploy.

Catch ActiveRecord regressions in the first 60 seconds of a deploy

Start monitoring with Webalert ->

See features and pricing. No credit card required.