3am Runbook

Authentication Incident Runbook

Treat login like a production dependency. Stabilize first, diagnose second.

Download the PDF runbook

Enter your email to unlock the download. I’ll also send updates on auth reliability patterns.

1) Confirm impact

  • Is login failing for everyone or a subset (region, app version, device)?
  • What’s the user-facing symptom (can’t log in, stuck in redirect loop, logged out)?
  • Check the primary SLI: login success rate + time to successful login.

2) Triage by bucket

  • IdP: throttling, outage, timeouts, discovery/JWKS unreachable.
  • Us: recent deploy/config, session store issues, bad cache, DNS/TLS problems.
  • Clients: retry storm, token refresh stampede, misbehaving versions.

3) Stabilize without causing outage #2

  • Enable circuit breaker to stop hammering the IdP during failures.
  • Backoff with jitter for retries. No synchronized retry waves.
  • Prefer keeping existing sessions alive temporarily over forcing reauth.
  • Rollback the deploy if it correlates with token storms or redirect loops.

4) Recover safely

  • Ramp traffic back gradually. Watch login SLIs, not infra vanity metrics.
  • Remove mitigations intentionally. Don’t “forget” degraded modes enabled.
  • Write down timeline + trigger + what amplified the blast radius.

Interested in updates on new npm releases?

Sign up with your email and get fresh updates as soon as they drop.