Auth0, Okta, Clerk: Identity Provider Monitoring

Your application is up. Every health check is green. Your status page says 100% available. And no one can log in.

This is the IdP outage. Your product feels completely broken to users, even though your code is fine. The login page loads, the user types their password, the redirect to your identity provider happens — and then either the page hangs, returns a confusing error, or completes the round-trip with an unusable session.

For most B2B SaaS, identity is the single most consequential third-party dependency. Stripe down means new sign-ups can't pay. SendGrid down means transactional email is delayed. Auth0 / Okta / Clerk down means nobody can log in at all. The blast radius is closer to 100% than any other dependency in your stack.

And yet most teams monitor the identity provider exactly the way they monitor any other API: a periodic ping to a /health endpoint, maybe a status-page subscription. That misses the failure modes that actually matter — silent token correctness regressions, regional auth latency spikes, JWKS rotation problems, SSO connection drift — every one of which produces a "the product is broken" experience without a single 500 response in your logs.

This guide is the IdP-specific monitoring layer for Auth0, Okta, Clerk, AWS Cognito, Firebase Auth, Microsoft Entra ID, and the rest: what to actually watch, how to wire end-to-end synthetic login canaries, and how to alert without drowning in noise.

Why IdP Outages Are Uniquely Painful

Compared to almost any other dependency:

Total blast radius. A login outage affects 100% of authenticated traffic — every customer, every plan, every tier.
Indirect blame. Users see your login page hang or error. They blame your product, not Auth0 / Okta / Clerk. Trust damage is real.
No "graceful degradation." You can serve a degraded ecommerce page from cache. You cannot "kind of" log someone in.
Provider status pages lag reality. A regional Auth0 incident often shows green on status.auth0.com for 5–15 minutes after users start hitting it.
The failure mode is often correctness, not availability. Tokens are issued — but with stale claims, or wrong tenant ID, or missing roles. The API call returns 200; nothing catches it until users start filing tickets.

The job of IdP monitoring is to detect all four classes: regional latency degradation, partial-availability failures, status-page drift, and silent correctness regressions.

"Up" vs "Correct": The Distinction That Matters

The most-overlooked aspect of IdP monitoring is the difference between the IdP responded and the IdP responded correctly.

Layer	What it means	Failure mode	Detection
Reachable	Endpoint accepts TCP, responds	DNS / TLS / outage	HTTP uptime check
Available	API returns 2xx	Backend degraded	Status-code check
Performant	Latency is in baseline range	Regional slowdown	Response-time check
Correct	Token contains the right claims, signed by the right key	Wrong tenant, stale roles, missing scopes, JWKS drift	Synthetic login + token-validation
End-to-end	A full user flow succeeds	Anywhere across the chain	Synthetic browser canary

The first four are what most teams monitor. The last two are where the silent breakage hides. Without synthetic login canaries you cannot tell the difference between "Auth0 is fine" and "Auth0 issued a token but the audience claim is wrong because someone changed an Auth0 Rule yesterday."

What to Monitor for Any IdP

The single most important metric. Track per-method:

Password login success rate
Magic-link login success rate
Social login success rate (Google, Apple, GitHub, Microsoft)
SSO / SAML success rate
Passkey / WebAuthn login success rate
Token refresh success rate

Alert on:

Success rate < 98% for 5 minutes (notification)
Success rate < 90% for 2 minutes (page)
Drop > 2pp vs rolling 24h average for any login method (notification)

Why per-method matters: a Google-SSO outage at Google's side will drop the social-login number to zero while password login looks fine. You need to know which path is broken to respond correctly.

2) OIDC Discovery Endpoint

Every modern IdP exposes /.well-known/openid-configuration. It returns the issuer, JWKS URI, authorization endpoint, token endpoint, supported scopes, etc.

Monitor:

HTTP availability (1-minute interval)
Response-time p95
Content validation — issuer and jwks_uri must match expected values
Any change to the returned JSON (could be legit or could be a misconfiguration)

A subtle failure mode: the discovery endpoint loads but its jwks_uri now points somewhere unreachable. Token verification breaks everywhere downstream.

3) JWKS Endpoint and Key Rotation

JWKS (jwks_uri) is the set of public keys your APIs use to verify JWTs. When the IdP rotates keys without you noticing, every API request fails verification.

Monitor:

HTTP availability of jwks_uri
Number of keys returned (should be ≥ 1; usually 1–3 during a rotation window)
New kid (key ID) appearing — this is the leading indicator of a rotation. If your services cache JWKS for hours, you need them to refresh before the IdP cuts over to the new key.
Key removal — if a kid that was in use yesterday is gone today, sessions issued with that key now fail

4) Token Endpoint Latency and Errors

The OAuth /oauth/token endpoint (or equivalent) is the most-hit IdP endpoint in steady-state traffic. Watch:

p50, p95, p99 latency
Error rate by error code (invalid_grant, invalid_client, rate_limit_exceeded)
Token issuance success rate

Alert on p95 > 2× rolling 7-day baseline for 10 minutes — most token-endpoint slowdowns are regional and clear up, but a sustained 2× is usually the start of a real incident.

For session-based flows (Clerk, Cognito, NextAuth):

Session refresh success rate
Session cookie rotation cadence (alert if cookies stop rotating — could be a cookie domain bug)
Cross-site cookie behavior (SameSite, Secure, partitioned cookies) — third-party cookie restrictions break SSO flows in subtle ways

6) MFA Challenge Success Rate

If your IdP supports MFA, monitor each challenge type:

TOTP success rate
SMS challenge delivery + success rate
Push notification (Auth0 Guardian, Okta Verify, Microsoft Authenticator) success rate
Passkey / WebAuthn success rate
Email-magic-link success rate

A drop here is almost always the third party providing the challenge (Twilio for SMS, FCM/APNS for push) — not the IdP itself. See Third-Party Dependency Monitoring.

7) Tenant / Region Health (Multi-Tenant SaaS)

Auth0 tenants, Okta orgs, and Clerk applications are isolated by region. A failure in us-west-2 doesn't affect eu-central-1 — but your monitoring needs to know that.

Run synthetic checks from each region your tenants are in
Tag every alert with tenant and region
For B2B SaaS with per-customer SAML connections, monitor each connection separately — one customer's IdP can break without affecting others

See Multi-Tenant SaaS Monitoring: Per-Customer Uptime for the broader pattern.

8) Provider Status Page (Lagging Indicator Only)

Subscribe to the IdP's status-page RSS, but treat it as the lagging indicator. Your synthetic canary will see real incidents 5–15 minutes before the status page does. If your monitoring is only the provider's status page, you'll find out about the outage from your customers first.

Per-Provider Notes

Auth0

Tenant regions: US, EU, AU, JP, CA. Each region is isolated; status-page incidents are usually scoped to one region.
Custom domains: auth.yourapp.com adds a layer to monitor — DNS, custom-domain SSL cert, Auth0 routing
Rules / Actions / Hooks: code that runs during the auth flow. A slow Action adds latency to every login. Monitor with the Actions Latency metric in the Auth0 dashboard or via synthetic canary
Anomaly detection rules: legitimate users sometimes get blocked. Monitor the "blocked" rate alongside the success rate
Failure mode to watch: rule changes deployed yesterday silently broke the email_verified claim — synthetic canary that validates claims catches this

Okta

Universal Directory + Authentication policies: split between identity store and access policies. A policy change can lock out a subset of users while everyone else logs in fine
Workflows: similar to Auth0 Actions — monitor execution time and failure rate
Multiple issuers: Okta apps can use the Okta Org Authorization Server or a Custom Authorization Server. Make sure your monitor validates the right issuer
API rate limits: Okta enforces hard per-org rate limits. A traffic spike or a misbehaving integration eats the budget and breaks logins for everyone. Monitor X-Rate-Limit-Remaining headers — see API Rate Limit Monitoring

Clerk

__session cookie: short-lived session token rotated frequently. Failures here look like users mysteriously logging out
Webhooks: Clerk drives user-lifecycle events (user.created, session.created) via webhooks. If webhooks fail, your downstream user record falls out of sync — see Webhook Monitoring
Frontend SDK: the React/Next.js SDK has its own failure modes (JS bundle blocked, CSP header missing). Synthetic browser checks catch what API monitoring misses

AWS Cognito

User Pools vs Identity Pools: different APIs, different failure modes. Most apps care about User Pools.
Hosted UI: extra layer of HTML/CSS/JS Cognito serves. Branding customizations occasionally break with a Cognito release
Lambda triggers: pre-signup, post-confirmation, pre-token-generation Lambdas all run inline with auth. A slow trigger slows every login. Monitor Lambda metrics for these specifically
Regional: each AWS region is a separate User Pool. Catch regional outages separately

Firebase Auth

Google Cloud Identity Platform is the underlying service. Status updates land at status.cloud.google.com, not Firebase's own page
Quota errors: free-tier and paid quotas exist; running into the daily quota silently breaks logins for the rest of the day
Identity provider config drift: SAML / OIDC providers configured in the Firebase console can be edited by anyone with access — monitor the config or alert on changes

Microsoft Entra ID (formerly Azure AD)

Sign-in logs: Entra exposes a queryable sign-in log. A real-time stream of sign-in failures with reason codes
Conditional Access policies: similar to Okta policies — can unintentionally lock out a subset of users
B2C tenants: separate from B2B (workforce) tenants; consumer-facing flows have their own latency profile

Supabase Auth

See Supabase & Firebase Backend Monitoring for the full Supabase angle

SAML vs OIDC Monitoring

Most enterprise SSO is SAML. Most consumer / modern SaaS auth is OIDC (OpenID Connect). They fail differently.

OIDC monitoring

Discovery endpoint, JWKS, token endpoint, userinfo endpoint, authorization endpoint
Validate iss, aud, exp, iat, nonce on every token
JWT signature verification using current JWKS keys

SAML monitoring

IdP metadata XML — fetch it on a schedule and alert on changes; alert on expiration of embedded certs
ACS (Assertion Consumer Service) endpoint at your side — must accept signed assertions
Clock skew tolerance — SAML assertions have NotBefore / NotOnOrAfter windows that fail with skew > 5 min
Per-customer SP-initiated and IdP-initiated flow tests
Signature validation against the customer's certificate (which they rotate without telling you)

For B2B SaaS, the per-customer SAML test is non-negotiable. One enterprise customer's IdP can break without affecting any other customer.

The single highest-value monitor: script a full credential-grant flow and run it on a schedule.

Minimum viable version (API-only)

POST /oauth/token
  grant_type=password
  client_id=…
  username=monitoring-canary@yourapp.com
  password=…
→ Validate response:
  - HTTP 200
  - access_token present
  - id_token present
  - JWT signature verifies against current JWKS
  - claims include expected sub, aud, email, custom claims
  - exp is in the future

Run every 1–5 minutes from at least one region. Alert on any failure or on response-time > baseline.

Better version (real browser)

A scripted browser session that:

Loads your login page
Submits credentials
Waits for the redirect dance to complete
Hits an authenticated endpoint with the session
Validates the response

This catches everything: DNS, TLS, CDN, custom domain, frontend JS, IdP, callback handler, and your own session middleware. It's also the most expensive to run and the noisiest, but for a critical login flow it's worth it.

Multi-region

Run the canary from at least 3 regions. A US-only canary will miss EU and APAC incidents. See Multi-Region Monitoring.

What Not to Monitor (the Common Mistake)

Most teams accidentally monitor IdP health via "401 rate from our backend API." This is a terrible IdP signal:

A 401 spike is usually a client bug, an expired token wave, a deploy that bumped audience values — not an IdP problem
A real IdP outage often manifests as users unable to authenticate in the first place, which never reaches your API at all
401s from misbehaving clients (curl scripts, integration partners) drown the signal

Better signals:

Synthetic login canary outcome
Login success rate (measured at the start of the flow, before any token is issued)
IdP-side metrics where available

See Monitor Authenticated APIs With Bearer Tokens and Custom Headers for the patterns to monitor your authenticated APIs without falling into this trap.

Alerting Thresholds

Critical (page)

Synthetic login canary fails 2 consecutive runs
Login success rate < 90% for 2 minutes
JWKS endpoint returns < 1 key
Discovery endpoint returns non-200 for > 3 minutes

High (notification)

Login success rate < 98% for 5 minutes
Token-endpoint p95 latency > 2× baseline for 10 minutes
New kid appears in JWKS (informational — verify rotation is intended)
Single tenant / region success rate drops while others are healthy
MFA challenge success rate < 95% for 10 minutes

Informational

OIDC discovery JSON content change
SAML metadata content change
Provider status page update

See Alert Fatigue: Notifications That Get Acted On for the broader noise-reduction principles. For the rest of the auth-flow monitoring picture see Login & Authentication Flow Monitoring. For the broader API-uptime piece see API Uptime Monitoring: Health Checks.

IdP Monitoring Checklist

OIDC discovery endpoint monitored with content validation
JWKS endpoint monitored for availability and kid changes
Token-endpoint latency p50/p95/p99 tracked
Per-method login success rate tracked (password, magic link, social, SSO, passkey, refresh, MFA)
Synthetic login canary every 1–5 minutes with full JWT validation
Synthetic browser flow at least every 15 minutes for critical login paths
Multi-region canary execution (≥ 3 regions)
Per-tenant / per-region alerting tags
Per-customer SAML SP-initiated and IdP-initiated tests (B2B)
SAML certificate expiration alerts (90 / 30 / 7 day)
Custom-domain SSL cert expiration alerts
IdP rate-limit headers tracked
Webhook delivery success rate (Clerk, Auth0, others)
Provider status page subscription (lagging-indicator only)
Internal /internal/auth-health endpoint surfacing last-canary-result for external monitoring
Runbook covering "auth is down" vs "auth issued wrong claims"
No reliance on 401-rate-from-our-API as the primary IdP signal

How Webalert Helps Monitor Identity Providers

Webalert handles the external-monitoring layer for IdP stacks:

HTTP monitoring with content validation — Hit /.well-known/openid-configuration and the JWKS endpoint; alert when the issuer or jwks_uri content drifts
Authenticated endpoint monitoring — Validate your internal /internal/auth-health endpoint that surfaces last-canary outcome and JWT-verification result
Multi-region checks — Run from every region your users live in; catch regional Auth0/Okta/Clerk degradation before users do
Response time monitoring — Catch token-endpoint slowdowns before they cascade
SSL certificate monitoring — Custom-domain cert expiry, SAML signing-cert expiry
Status page integration — Communicate auth degradation to your customers
Multi-channel alerts — Email, SMS, Slack, Discord, Microsoft Teams, webhooks
1-minute check intervals — Detect IdP regressions within 60 seconds
5-minute setup — Add endpoints, point monitors at them, set thresholds

See features and pricing.

Summary

Identity providers are usually your single highest-blast-radius dependency. An IdP outage feels like a complete product outage.
Provider status pages lag reality by 5–15 minutes. Synthetic canaries are the only way to know in real time.
The most-missed failure mode is correctness: tokens issued but with wrong claims. Plain HTTP uptime never catches this.
Monitor across all four layers: reachable, available, performant, correct, end-to-end.
The single highest-value monitor is a scripted login canary that validates JWT claims and runs every 1–5 minutes from multiple regions.
Each provider — Auth0, Okta, Clerk, Cognito, Firebase, Entra — has its own failure modes (tenant regions, action latency, JWKS rotation, Lambda triggers, conditional-access policies). Tune monitors for the specific provider.
For B2B SaaS, per-customer SAML connection monitoring is non-negotiable.
Don't lean on "401 rate from our API" as the IdP signal — it's the noisiest possible proxy.

The good news: IdP outages are detectable well before users feel them, if you build the right canaries. The bad news: nobody builds them by default — most teams only put them in place after the first painful incident. Build them now and skip that incident.

Catch IdP outages and silent token regressions before your users log out

Start monitoring with Webalert →

See features and pricing. No credit card required.

Auth0, Okta, Clerk: Identity Provider Monitoring

Why IdP Outages Are Uniquely Painful

"Up" vs "Correct": The Distinction That Matters

What to Monitor for Any IdP

2) OIDC Discovery Endpoint

3) JWKS Endpoint and Key Rotation

4) Token Endpoint Latency and Errors

6) MFA Challenge Success Rate

7) Tenant / Region Health (Multi-Tenant SaaS)

8) Provider Status Page (Lagging Indicator Only)

Per-Provider Notes

Auth0

Okta

Clerk

AWS Cognito

Firebase Auth

Microsoft Entra ID (formerly Azure AD)

Supabase Auth

SAML vs OIDC Monitoring

OIDC monitoring

SAML monitoring

Minimum viable version (API-only)

Better version (real browser)

Multi-region

What Not to Monitor (the Common Mistake)

Alerting Thresholds

Critical (page)

High (notification)

Informational

IdP Monitoring Checklist

How Webalert Helps Monitor Identity Providers

Summary

Catch IdP outages and silent token regressions before your users log out

Related Articles

How to Monitor Your Login and Authentication Flow

How to Monitor Authenticated APIs (Tokens & Headers)

Packet Loss Monitoring: Causes, Detection, and Fixes

Stop guessing about downtime

Auth0, Okta, Clerk: Identity Provider Monitoring

Why IdP Outages Are Uniquely Painful

"Up" vs "Correct": The Distinction That Matters

What to Monitor for Any IdP

1) Login Success Rate (Real Users)

2) OIDC Discovery Endpoint

3) JWKS Endpoint and Key Rotation

4) Token Endpoint Latency and Errors

5) Session Refresh and Cookie Rotation

6) MFA Challenge Success Rate

7) Tenant / Region Health (Multi-Tenant SaaS)

8) Provider Status Page (Lagging Indicator Only)

Per-Provider Notes

Auth0

Okta

Clerk

AWS Cognito

Firebase Auth

Microsoft Entra ID (formerly Azure AD)

Supabase Auth

SAML vs OIDC Monitoring

OIDC monitoring

SAML monitoring

The Synthetic Login Canary

Minimum viable version (API-only)

Better version (real browser)

Multi-region

What Not to Monitor (the Common Mistake)

Alerting Thresholds

Critical (page)

High (notification)

Informational

IdP Monitoring Checklist

How Webalert Helps Monitor Identity Providers

Summary

Catch IdP outages and silent token regressions before your users log out

Related Articles

How to Monitor Your Login and Authentication Flow

How to Monitor Authenticated APIs (Tokens & Headers)

Packet Loss Monitoring: Causes, Detection, and Fixes

Stop guessing about downtime