3am Runbook
Authentication Incident Runbook
Treat login like a production dependency. Stabilize first, diagnose second.
Download the PDF runbook
Enter your email to unlock the download. I’ll also send updates on auth reliability patterns.
1) Confirm impact
- Is login failing for everyone or a subset (region, app version, device)?
- What’s the user-facing symptom (can’t log in, stuck in redirect loop, logged out)?
- Check the primary SLI: login success rate + time to successful login.
2) Triage by bucket
- IdP: throttling, outage, timeouts, discovery/JWKS unreachable.
- Us: recent deploy/config, session store issues, bad cache, DNS/TLS problems.
- Clients: retry storm, token refresh stampede, misbehaving versions.
3) Stabilize without causing outage #2
- Enable circuit breaker to stop hammering the IdP during failures.
- Backoff with jitter for retries. No synchronized retry waves.
- Prefer keeping existing sessions alive temporarily over forcing reauth.
- Rollback the deploy if it correlates with token storms or redirect loops.
4) Recover safely
- Ramp traffic back gradually. Watch login SLIs, not infra vanity metrics.
- Remove mitigations intentionally. Don’t “forget” degraded modes enabled.
- Write down timeline + trigger + what amplified the blast radius.