MTTR, MTBF & MTTF: Reliability Metrics Explained

MTTR, MTBF, MTTF Explained: Reliability Metrics Every Team Should Track

Your site went down last Tuesday. It was back in 23 minutes. Good or bad?

Without a baseline, you can't answer that question. And without tracking the right metrics over time, you can't tell if your reliability is improving or getting worse.

MTTR, MTBF, and MTTF are the three metrics that answer the questions every engineering team needs to ask: How fast do we recover? How often do things break? How long do they last before failing?

This guide explains each metric in plain language, shows you how to calculate them, and covers how they connect to real-world monitoring and incident response.

The Three Metrics at a Glance

Metric	Stands for	What it measures	Goal
MTTR	Mean Time to Recovery	How long it takes to restore service after a failure	Lower is better
MTBF	Mean Time Between Failures	How often failures occur (for repairable systems)	Higher is better
MTTF	Mean Time to Failure	How long a non-repairable component lasts before failing	Higher is better

Think of it this way:

MTTR = how fast you fix things
MTBF = how often things break
MTTF = how long things last

MTTR — Mean Time to Recovery

MTTR is the average time it takes to restore a system or service after a failure. It's the single most actionable reliability metric because it measures your team's ability to detect, respond to, and resolve incidents.

How to calculate MTTR

MTTR = Total downtime / Number of incidents

Example: Over the past quarter, your service had 4 incidents with the following downtimes: 45 minutes, 12 minutes, 30 minutes, and 25 minutes.

MTTR = (45 + 12 + 30 + 25) / 4 = 28 minutes

Your average recovery time is 28 minutes.

What MTTR includes

MTTR covers the full recovery timeline:

Detection time — How long until you know something is wrong
Response time — How long until someone starts working on it
Diagnosis time — How long to identify the root cause
Resolution time — How long to implement the fix
Verification time — How long to confirm the fix works

Each phase is an opportunity to improve. If detection takes 15 of your 28 minutes, better monitoring has the biggest impact. If diagnosis takes the longest, better runbooks and observability tools will help more.

What good MTTR looks like

MTTR	Assessment
< 5 minutes	Excellent — automated detection and fast response
5–15 minutes	Good — strong monitoring and practiced incident response
15–60 minutes	Average — room for improvement in detection or diagnosis
1–4 hours	Below average — likely gaps in monitoring or on-call process
> 4 hours	Critical — significant process or tooling issues

How to reduce MTTR

Faster detection:

Use monitoring with 1-minute check intervals instead of 5 or 10
Monitor from multiple regions to catch localized failures
Set up alerts on response time degradation, not just total failure

Faster response:

Implement on-call rotations so there's always someone responsible
Use escalation policies — if the first responder doesn't acknowledge in 5 minutes, alert the next person
Send alerts to push channels (SMS, phone call) not just email or Slack

Faster diagnosis:

Build health check endpoints that report which specific subsystem is broken
Use structured logging that makes it easy to search for the root cause
Maintain runbooks for common failure modes

Faster resolution:

Automate rollbacks for failed deployments
Practice incident response so the team knows the drill
Pre-define communication templates so status updates don't slow you down

MTBF — Mean Time Between Failures

MTBF measures the average time between the end of one failure and the start of the next. It applies to repairable systems — things you fix and put back into service.

How to calculate MTBF

MTBF = Total uptime / Number of failures

Example: Over 30 days (43,200 minutes), your service had 3 incidents. Total downtime was 90 minutes, so total uptime was 43,110 minutes.

MTBF = 43,110 / 3 = 14,370 minutes ≈ 9.98 days

On average, you can expect about 10 days between failures.

MTBF vs. uptime percentage

MTBF and uptime percentage tell different stories:

Scenario	Uptime	MTBF	What it means
1 outage of 43 min in 30 days	99.9%	30 days	Rare but significant failure
6 outages of 7 min in 30 days	99.9%	5 days	Frequent small failures

Both scenarios have 99.9% uptime, but the reliability experience is very different. MTBF reveals the frequency that uptime percentage hides.

How to improve MTBF

Fix root causes, not symptoms. If the same failure recurs, your post-incident review isn't leading to real fixes.
Monitor proactively. Catch degradation (rising response times, intermittent errors) before it becomes a full outage.
Reduce deployment risk. Use canary deploys, feature flags, and staged rollouts to catch bugs before they affect all users.
Automate recovery. Auto-restart crashed services, auto-scale during traffic spikes, auto-failover to healthy replicas.

MTTF — Mean Time to Failure

MTTF measures how long a non-repairable component operates before failing. It's most relevant for hardware (hard drives, power supplies, network switches) but also applies to software components that are replaced rather than repaired.

How to calculate MTTF

MTTF = Total operating time / Number of units that failed

Example: You deployed 10 servers. Over 2 years, 2 experienced hardware failure. The first failed at 18 months, the second at 22 months.

MTTF = (18 + 22) / 2 = 20 months

When MTTF matters for web services

For most web applications, MTTR and MTBF are more relevant than MTTF. But MTTF applies when you're thinking about:

SSL certificates — They have a fixed lifetime (90 days or 1 year). MTTF is literally the certificate validity period.
Hardware-dependent services — Bare-metal servers, dedicated database hosts.
Third-party services — You can't "repair" a vendor outage; you either wait or switch providers.
Software version lifecycle — How long before a dependency reaches end-of-life and must be replaced.

How the Metrics Relate to Each Other

For repairable systems, the three metrics connect:

MTBF = MTTF + MTTR

Or visually:

|←—— MTTF ——→|←— MTTR —→|←—— MTTF ——→|←— MTTR —→|
    Working      Down         Working       Down
|←————————— MTBF ——————————→|

MTTF is the working period before a failure
MTTR is the repair/recovery period
MTBF is the full cycle: working + recovery

In practice, for web services where MTTR is much smaller than MTTF, MTBF and MTTF are nearly equal. A service with 10 days between failures and 30 minutes of recovery time has an MTTF of ~9.98 days and an MTBF of 10 days — functionally the same.

Tracking These Metrics in Practice

Where the data comes from

You need two things to calculate these metrics:

Incident timestamps — When each incident started and ended
Total observation period — The time window you're measuring

Monitoring tools provide both. Every time a monitor detects a failure and later detects recovery, that's one incident with a start time, end time, and duration.

How often to measure

Weekly: Track MTTR for each incident to spot trends early.
Monthly: Calculate MTBF to see if failure frequency is improving.
Quarterly: Review all three metrics as part of a reliability review. Compare to previous quarters.

Segmentation matters

Don't calculate a single MTTR across all incidents. Break it down:

By severity — P1 incidents should have lower MTTR than P3s (faster response for critical issues).
By service — Your API might have different reliability than your marketing site.
By root cause — Database issues vs. deployment issues vs. third-party failures.
By time of day — Incidents during business hours often have faster MTTR than overnight ones.

This segmentation reveals where your process works well and where it needs attention.

Common Mistakes When Using Reliability Metrics

Optimizing MTTR without fixing MTBF

If you're recovering faster but failing just as often, you're firefighting — not improving. The goal is to increase MTBF (fewer failures) while keeping MTTR low (fast recovery when failures happen).

Ignoring detection time in MTTR

Some teams measure MTTR from "engineer starts working" to "service restored." This ignores detection and response time, which are often the biggest portion. True MTTR starts when the incident begins, not when you notice it.

Comparing across different teams or services

A 15-minute MTTR for a simple marketing site and a 15-minute MTTR for a complex distributed payment system are not the same achievement. Compare metrics within the same service over time, not across services.

Using averages without understanding distribution

An average MTTR of 20 minutes might mean "every incident takes about 20 minutes" or "most take 5 minutes but one took 3 hours." Track percentiles (p50, p90, p99) alongside averages to understand the real distribution.

Setting targets without a baseline

Don't set an MTTR target of "under 15 minutes" if you've never measured your current MTTR. Measure first, then set incremental improvement targets.

How Monitoring Improves Every Metric

Monitoring is the foundation of all three metrics:

MTTR improvement

Monitoring directly reduces the detection phase of MTTR — often the longest phase. A 1-minute check interval means you know about a failure within 1-2 minutes instead of waiting for user complaints (which can take 30+ minutes).

Without monitoring	With monitoring
User reports issue (20 min)	Alert fires (1-2 min)
Team investigates (15 min)	Engineer responds (5 min)
Root cause found (20 min)	Health endpoint shows DB down (1 min)
Fix deployed (15 min)	Fix deployed (15 min)
Total: 70 min	Total: 22 min

MTBF improvement

Proactive monitoring catches degradation before it becomes an outage. Rising response times, intermittent errors, and SSL certificates approaching expiry are all early warnings. Fixing these prevents failures, which increases MTBF.

MTTF improvement

Monitoring reveals patterns. If servers consistently fail at 18 months, you know to replace them proactively at 15 months. If SSL certificates expire because auto-renewal failed, monitoring the renewal process prevents the next failure.

How Webalert Helps Track Reliability

Webalert provides the monitoring data that powers these metrics:

Incident detection with timestamps — Every outage is recorded with exact start and recovery times, giving you the data to calculate MTTR and MTBF.
1-minute check intervals — Minimize detection time, the biggest lever for reducing MTTR.
Multi-region checks — Catch failures faster by detecting them from the closest monitoring location.
Response time tracking — Spot degradation trends that predict future failures, improving MTBF.
SSL and DNS monitoring — Prevent entire categories of outages, directly increasing MTTF.
On-call scheduling and escalation — Reduce response time by ensuring someone is always available.
Uptime history and reporting — Historical data to calculate and trend all three metrics over time.

See features and pricing for the full details.

Summary

Three metrics, three questions:

MTTR — How fast do we recover? (Lower is better.)
MTBF — How often do things break? (Higher is better.)
MTTF — How long do things last before failing? (Higher is better.)

To improve them:

Reduce MTTR by monitoring with fast check intervals, setting up on-call rotations, and building health endpoints that pinpoint the broken subsystem.
Increase MTBF by fixing root causes (not just symptoms), monitoring proactively for degradation, and reducing deployment risk.
Increase MTTF by monitoring certificate expiry, tracking hardware age, and replacing components before they fail.

The teams that track these metrics improve them. The teams that don't are just hoping things get better.

Track your reliability metrics from day one

Start monitoring free with Webalert →

See features and pricing. No credit card required.