The Four Golden Signals of Monitoring Explained

Most monitoring setups fail in the same way: hundreds of metrics, dozens of dashboards, and still nobody can answer "is the service healthy right now?" The problem isn't too little data — it's no hierarchy. Google's Site Reliability Engineering team gave the industry that hierarchy in four words: latency, traffic, errors, and saturation. If you measure only four things about a user-facing system, measure these.

This guide explains the four golden signals from first principles — what each one means, why these four (and not others), how to measure each correctly, and how to turn them into alerts that fire on real problems instead of noise.

Where the Four Golden Signals Come From

The four golden signals were popularized by Google's Site Reliability Engineering book. The premise is simple: for a user-facing system, a small set of signals captures almost everything you need to detect and triage problems. They're deliberately symptom-oriented — they describe what users experience, not the internal mechanics. That's what makes them a starting point rather than another pile of host metrics.

The four signals are:

Latency — how long requests take.
Traffic — how much demand the system is under.
Errors — the rate of requests that fail.
Saturation — how "full" the system is.

Read together, they answer the only questions that matter during an incident: Is it slow? Is it failing? Is it overloaded? And is the load unusual?

Signal 1: Latency

Latency is how long it takes to serve a request — and the single most important rule is to separate the latency of successful requests from the latency of failed ones.

Why? A fast error is still an error, and it can poison your numbers in both directions. An outage where everything returns an instant 500 will look like a latency improvement if you average all requests together. Conversely, slow failures (timeouts) inflate latency in a way that's qualitatively different from slow successes.

How to measure it well:

Use percentiles, never averages. The average hides the tail, and the tail is what users feel. Track p50, p95, and p99 together. See latency percentiles explained for why averages lie.
Split success vs error latency into separate series.
Measure from outside the system too — server-side timing misses DNS, TLS, and network. TTFB monitoring captures the server side; outside-in checks capture the full path.

A good latency target is expressed as a percentile over a window: "99% of requests under 400ms over 30 days."

Signal 2: Traffic

Traffic measures demand on the system — how much it's being used. The right unit depends on the service: HTTP requests per second for a web app, transactions per second for a database, messages per second for a queue, concurrent sessions for a streaming service.

Traffic is the signal that gives the other three context. A spike in errors during a 10x traffic surge tells a very different story than the same error spike at normal load. On its own, traffic rarely pages anyone — but it's the denominator for error rate, the leading indicator of impending saturation, and the first place to look when something changes.

How to measure it well:

Pick a unit that reflects real user demand, not internal chatter (exclude health checks and retries where you can).
Watch for both surges (traffic spikes, viral events, DDoS) and drop-offs — a sudden fall in traffic often means something upstream is broken and users can't even reach you.
Chart it next to errors and saturation so correlations are obvious.

Signal 3: Errors

Errors are the rate of requests that fail. This is usually the signal most directly tied to user pain, and it's less straightforward than it looks because "failure" has layers:

Explicit failures — 5xx responses, exceptions, RPC errors. See 5xx error rate monitoring.
Implicit failures — a 200 OK that returns the wrong content, a malformed body, or a partial result. These are the dangerous ones because naive monitoring counts them as success. See response body validation.
Policy failures — responses that are technically fine but violate a contract, like exceeding your latency SLO.

How to measure it well:

Track error rate as a ratio (errors ÷ total requests), not just an absolute count — 100 errors means nothing without the denominator.
Validate content, not just status codes, so "false green" failures get caught.
Distinguish client errors (4xx) from server errors (5xx) — see the HTTP status code guide — since they point to different root causes.

Signal 4: Saturation

Saturation measures how "full" your system is — how close a constrained resource is to the limit that will make it fall over. CPU, memory, disk I/O, connection pools, queue depth, thread pools: every system has a bottleneck resource, and saturation tracks headroom on it.

Saturation is the most forward-looking signal. Latency, traffic, and errors tell you what's happening now; saturation tells you what's about to happen. A queue depth climbing steadily toward its limit, or a connection pool at 95% utilization, is an outage with a countdown timer.

How to measure it well:

Find the actual bottleneck. It's rarely "CPU at 100%." More often it's a connection pool, a disk filling up, a thread pool, or a downstream rate limit.
Watch leading indicators: utilization percentage, queue length, and the rate of change. A resource at 70% and rising fast is more urgent than one steady at 85%.
Many systems degrade before they're 100% full — latency starts climbing as a resource approaches saturation, which is why this signal and latency move together near the edge.

A useful heuristic: set saturation alerts at the level where latency starts to degrade, not at 100% — by then it's already too late.

How the Four Signals Work Together

The power is in reading them as a set. The combination localizes the problem before you open a single trace:

Pattern	Likely diagnosis
Errors up, traffic normal, saturation low	Bad deploy, dependency failure, or bug
Latency up, saturation high, traffic high	Capacity problem — you're overloaded
Latency up, saturation low, traffic normal	Slow dependency, lock contention, or GC pauses
Traffic down sharply, errors up elsewhere	Something upstream is broken; users can't reach you
Saturation climbing, everything else fine	Impending outage — act before it tips

This is why the four golden signals are a triage framework, not just a dashboard. They turn "something's wrong" into "it's a capacity problem on the database connection pool" in seconds.

Alerting on the Golden Signals

Having the signals isn't enough — bad thresholds on good signals still produce alert fatigue. A few principles:

Alert on symptoms, not causes. Page on "checkout latency p99 > 1s" (a user symptom), not on "CPU > 80%" (a cause that may or may not matter). The golden signals are symptom-oriented by design.
Alert on error-budget burn rate, not instantaneous spikes, so transient blips don't wake anyone. See SLOs and error budgets.
Page on latency and errors; ticket on saturation trends. Saturation usually gives you lead time, so it rarely warrants a 3am page unless it's about to tip.
Give every page a runbook so the on-call engineer knows what to do.

Golden Signals vs RED vs USE

The golden signals aren't the only framework, and the others are complementary rather than competing:

RED method (Rate, Errors, Duration) is essentially the golden signals minus saturation, optimized for request-driven services and microservices.
USE method (Utilization, Saturation, Errors) is resource-oriented — it's the saturation signal, expanded, for infrastructure.

A common pattern: use RED/golden signals for your services (the user-facing view) and USE for your resources (the infrastructure view). Together they cover both "are users happy?" and "is the box about to fall over?" The dedicated RED vs USE guide covers when to reach for each.

How Webalert Helps

Webalert measures the user-facing golden signals from outside your infrastructure — the way your customers actually experience them:

Latency — multi-region response-time checks with per-geography percentiles, not a single averaged number.
Errors — status-code and content validation, so "false green" failures are caught.
Traffic-aware context — correlate outages with demand and catch traffic drop-offs that signal an upstream break.
Symptom-based alerting with tunable thresholds and status pages to communicate during incidents.

Outside-in monitoring complements internal APM: your metrics show saturation inside the system; Webalert confirms whether users can actually reach it and how fast.

Summary

The four golden signals — latency, traffic, errors, and saturation — are the minimum viable monitoring for any user-facing system, and the fastest way to triage an incident. Measure latency as percentiles split by success and failure; measure traffic in real user-demand units; measure errors as a validated ratio; and measure saturation on the true bottleneck with leading indicators.

Read them together and they localize almost any problem in seconds. Alert on the symptoms (latency, errors) via burn rate, trend on saturation, and pair every page with a runbook. Start with these four, and only add more once they're not enough.

Monitor the signals your users actually feel

Start monitoring with Webalert ->

See features and pricing. No credit card required.

The Four Golden Signals of Monitoring Explained

Where the Four Golden Signals Come From

Signal 1: Latency

Signal 2: Traffic

Signal 3: Errors

Signal 4: Saturation

How the Four Signals Work Together

Alerting on the Golden Signals

Golden Signals vs RED vs USE

How Webalert Helps

Summary

Monitor the signals your users actually feel

Related Articles

RED vs USE Method: Monitoring Metrics Frameworks

Prometheus Monitoring: What It Catches and Misses

Anomaly Detection in Monitoring: Beyond Static Thresholds

Stop guessing about downtime