When IAM Changes Kill Production and CloudTrail Tells You Exactly Who Did It

February 15, 2025

A single IAM change can take down production in a way that looks like “the app is broken” but is actually “someone removed a permission at the worst possible time.” The scary part is not the outage. The scary part is the meeting after, where everyone asks the same question. Who did it. This is where CloudTrail turns from “security checkbox” into your most valuable debugging tool. CloudTrail records AWS activity across console, CLI, and SDKs, and it is designed to answer “who did what, when, and from where.” This post is a real-world playbook for solving the IAM-change outage fast, and then hardening your account so the same class of failure becomes rare and boring. The failure mode you see at 2am Your symptoms usually look like one of these: Your backend starts returning 500s after a deploy that “should have been safe.” ECS tasks or Lambda invocations start failing with AccessDenied. A background job silently stops writing to S3, DynamoDB, SQS, KMS, you name it. Someone “just tweaked permissions” and now half your stack is on fire. The real root cause is usually one of these IAM events: a policy detached from a role an inline policy overwritten a managed policy version changed a role trust policy updated so the service can no longer assume it That last one is extra spicy because it breaks identity at the source. The 10 minute forensic loop that makes you look terrifyingly competent You are trying to answer four questions: What changed Who changed it Where they changed it from What exactly was affected CloudTrail gives you those answers when you know what to look for. Step 1 Find the failing principal and the exact permission error Start from the error message in your app logs. You want: the AWS service being called the API action being denied the role ARN or assumed-role ARN This is your “search key” into CloudTrail. Step 2 Use CloudTrail Event history for the first pass CloudTrail Event history is enabled by default and gives you a searchable record of the past 90 days of management events in a region. It is immutable and fast for incident response. In the CloudTrail console Event history, filter around the time the outage started and search for IAM changes. The usual suspects are: DetachRolePolicy AttachRolePolicy PutRolePolicy DeleteRolePolicy CreatePolicyVersion SetDefaultPolicyVersion UpdateAssumeRolePolicy If you already know the role name, filter by resource name too. Step 3 Open the event record and read it like a crime scene report CloudTrail event records have a predictable structure, and the fields you care about are consistent across services. The high signal fields are: eventTime eventName userIdentity sourceIPAddress userAgent requestParameters responseElements errorCode and errorMessage if the change failed The single most important section is userIdentity. It tells you what kind of identity performed the action, what credentials were used, and whether the call came from an assumed role. This is where you’ll spot patterns like: a human using the console a CI role assumed via STS a break-glass role used outside normal hours a third-party integration doing something it should never do Now you have your answer for “who did it,” plus enough context to be fair about it. Step 4 Confirm blast radius in a second query Once you find the first IAM change event, widen the time window by 10 minutes and search for adjacent changes. IAM outages are often “two edits” not one. Example: someone detaches a policy, then attempts to fix it by attaching a different one, then updates the trust policy, then accidentally makes it worse. CloudTrail will show that sequence, but it will not show events in a guaranteed order inside log files, so always lean on timestamps instead of expecting a neat stack trace. When Event history is not enough and what to do instead Event history is per region and 90 days. That is perfect for most incidents, but not for audits, long-running mysteries, and multi-account org setups. Trails for retention and real monitoring CloudTrail trails can deliver events to an S3 bucket and optionally to CloudWatch Logs and EventBridge. This is how you get long-term retention and real-time detection. AWS notes CloudTrail typically delivers logs to S3 within about 5 minutes on average, which is good enough for most alerting pipelines. CloudTrail Lake for fast SQL search at scale CloudTrail Lake lets you run SQL-based queries on event data stores. It is powerful for investigations across accounts and regions, but it incurs charges, so use it intentionally. The minimum viable detective control that prevents repeats Once you have been burned by IAM once, you stop treating it as “just permissions” and start treating it like production configuration. The simplest hardening is: Route CloudTrail events into EventBridge Match on IAM change API calls Alert immediately to your incident channel EventBridge can receive AWS service events delivered via CloudTrail, including “AWS API Call via CloudTrail” events. You do not need a huge SIEM to start. You just need to know when someone touches the keys to the kingdom. The minimum viable prevention that reduces your blast radius Detective controls tell you what happened. Preventive controls make it harder for the incident to happen at all. Permission boundaries for roles that create roles Permission boundaries set the maximum permissions an IAM identity can ever get, even if someone attaches an overly broad policy later. This is a big deal for teams that want developers to move fast without letting them mint new admin. SCPs for org-wide guardrails Service control policies in AWS Organizations restrict what accounts can do. They do not grant permissions, they only limit what is possible. Founder translation: even if someone fat-fingers a policy in a single account, your org-level seatbelt can stop the worst actions from being executable. Access Analyzer to kill “we just gave it admin” culture IAM Access Analyzer can help review unused or risky access, and it can generate least-privilege policies based on CloudTrail activity. That is a practical way to replace broad permissions with what the system actually uses. Two diagrams that make this post feel premium Diagram 1 The IAM outage timeline A simple horizontal timeline with four blocks: Deployment finishes AccessDenied spikes IAM change event in CloudTrail Recovery by restoring policy or trust relationship Under the CloudTrail block, list the exact event name you found, the identity type, and the source IP. Diagram 2 The hardened control loop A loop diagram: IAM change attempt → CloudTrail record → EventBridge rule → alert → human review → remediation → policy hardening This diagram sells “operator mindset” instantly. You can build it in Figma in 10 minutes. The takeaways that matter in real teams CloudTrail is your truth layer for “who changed what,” and Event history gives you 90 days of fast answers by default. Trails are how you graduate from debugging to monitoring, because they deliver to S3 and can feed CloudWatch and EventBridge. Prevention is not one thing. It is boundaries, org guardrails, and continuous least-privilege cleanup. The best part is that all of this makes you faster, not slower. The whole point is to make outages boring and audits trivial.

When IAM Changes Kill Production and CloudTrail Tells You Exactly Who Did It

Interested in updates on new npm releases?