
Set an alert to fire when CPU goes above 80% and you'll learn two painful lessons fast. First, it screams every afternoon during your normal traffic peak — a false alarm. Second, it stays silent when CPU sits at a "fine" 50% that's actually double what 3 a.m. should ever be — a missed problem. Static thresholds treat a metric as if "normal" were a single fixed number. Real systems don't work that way: normal changes by hour, by day, by season, by deploy.
Anomaly detection is the answer to that mismatch. Instead of asking "did this cross a fixed line?", it asks "is this unusual for right now?" This guide explains what anomaly detection is, how it differs from static thresholds, the main techniques behind it, where it genuinely helps, and the pitfalls that trip teams up.
The Problem With Static Thresholds
A static threshold fires when a metric crosses a fixed value — error rate over 1%, latency over 500ms, queue depth over 1,000. They're simple, predictable, and the right tool for plenty of jobs. But they break down whenever "normal" isn't constant:
- Seasonality. Traffic, CPU, and request rates swing predictably by time of day and day of week. A threshold tuned for the daytime peak misses problems at night; one tuned for night screams all day.
- Growth. A threshold that fit last quarter's traffic is wrong after you've doubled.
- The "in-between" failures. The nastiest problems sit below the alert line but far above normal — a metric at half its limit that's still wildly abnormal for the moment.
- Threshold sprawl. Hundreds of metrics each need a hand-tuned number, and every one is a guess that drifts out of date.
The result is the worst of both worlds: too many false alarms (driving alert fatigue) and missed incidents.
What Anomaly Detection Is
Anomaly detection learns what "normal" looks like for a metric and alerts when current behavior deviates from that learned pattern. Rather than comparing against a number you picked, it compares against the metric's own history — its typical range for this hour, this day, this trend.
Concretely, an anomaly detector builds an expected band around a metric (a forecast plus a tolerance) and flags points that fall outside it. A CPU reading of 50% might be perfectly normal at noon and a glaring anomaly at midnight, and the detector treats those two cases differently because it has learned the daily shape of the curve.
The promise is twofold: catch problems static thresholds miss (the abnormal-but-below-the-line cases), and cut false alarms from predictable swings the model already expects.
How Anomaly Detection Works
Techniques range from simple statistics to machine learning. You rarely need the fancy end to get value.
- Statistical baselines. Compute a rolling mean and standard deviation; flag points more than N standard deviations away. Cheap, transparent, and effective for metrics that are roughly stable. Variants like median + MAD resist outliers better.
- Seasonal decomposition / forecasting. Models that explicitly learn daily and weekly cycles (Holt-Winters, time-series forecasters) predict the expected value for this moment and flag deviations. This is what handles "normal at noon, anomalous at midnight."
- Moving-window comparison. Compare the current window to the same window last week or to a trailing baseline. Simple and surprisingly robust against seasonality.
- Machine-learning models. Clustering, isolation forests, and neural forecasters can capture complex multi-metric patterns — at the cost of more data, tuning, and opacity.
Most production anomaly detection is "smart baselining": a forecast of the expected range with a tolerance band, not a black-box neural net. Start simple; reach for ML only when simple methods demonstrably fall short.
Where Anomaly Detection Shines
Anomaly detection earns its keep on metrics where normal genuinely moves:
- Traffic and throughput. Request rates with strong daily/weekly cycles — a static threshold can't fit them, but a seasonal model can flag a real drop or spike.
- The golden signals. Latency, traffic, errors, and saturation all benefit when their baselines shift with load.
- Slow degradations. A gradual creep that never trips a fixed line but is clearly abnormal versus history — exactly the kind of silent problem that erodes service quietly.
- Sudden drops. Orders, signups, or payments falling to zero at a time they're normally busy — a "too quiet" anomaly a high-watermark threshold would never catch.
- Fleets of metrics. When you have thousands of series and can't hand-tune each, learned baselines scale where manual thresholds don't.
The Pitfalls
Anomaly detection is not magic, and treating it that way causes its own outages of trust:
- It detects unusual, not bad. A traffic spike from a successful launch is an anomaly — and entirely good. Anomalies need context before they become alerts; otherwise you've just rebuilt alert fatigue with cleverer math.
- Cold starts and change points. A new metric has no history to learn from. A legitimate step change (a launch, a migration) looks anomalous until the model adapts — and a model that adapts too fast will quietly accept a slow regression as the "new normal."
- Opacity. "The model says it's weird" is a hard alert to act on at 3 a.m. Detectors that can't explain why something is anomalous slow down response.
- Tuning sensitivity is still tuning. You've traded picking thresholds for picking tolerance bands and confidence levels. The work doesn't vanish; it changes shape.
- It doesn't replace static thresholds. Some limits are absolute — disk at 100%, SLO error budget exhausted, certificate expired. Those want a hard line, not a learned band.
How Webalert Helps
Webalert focuses on detecting the deviations that matter for availability and response time, without burying you in noise:
- Baseline-aware response-time tracking that surfaces when your service is slow for it — not just when it crosses an arbitrary millisecond line — using your own latency history as the reference.
- Sustained-deviation alerting that waits for a real, persistent change rather than firing on a single odd sample, so anomalies become actionable alerts instead of noise.
- Multi-region context so a deviation in one location is distinguished from a global problem before it pages anyone.
- Sensible defaults over endless tuning — meaningful detection out of the box, so you're not hand-maintaining a threshold for every check.
The goal isn't anomaly detection for its own sake; it's catching the abnormal-but-below-the-line problems early while keeping alerts trustworthy.
Summary
Static thresholds fail whenever "normal" isn't a fixed number — they false-alarm on predictable peaks and stay silent on problems that sit below the line but far above normal. Anomaly detection fixes this by learning a metric's expected pattern and flagging deviations from it, comparing against history rather than a hand-picked value. Techniques span simple statistical baselines, seasonal forecasting, and machine learning — and most of the value comes from the simpler end.
It shines on metrics where normal genuinely shifts: traffic, the golden signals, slow degradations, and sudden drops. But it detects unusual, not bad, struggles with cold starts and change points, and never fully replaces the hard limits that deserve a fixed line. Use anomaly detection to widen what you can catch, keep static thresholds for absolutes, and add context before any anomaly becomes a page.