The First 24 Hours: An Incident Response Runbook You'll Actually Use

Most incident response plans are written to pass an audit, not to survive a Tuesday at 2 a.m. They're forty pages of definitions, a RACI chart nobody has memorized, and a phone tree with at least one number that's been disconnected since the last reorg. When the real thing hits, nobody opens the binder. People improvise, and the quality of the improvisation is the quality of your program.

I want to argue that the first 24 hours of a breach are won or lost on a handful of decisions that have almost nothing to do with the technical depth of your forensics. They're about who decides, what you preserve, and how you talk. Get those right and your engineers get room to work. Get them wrong and you'll spend the worst day of the quarter managing executives instead of managing the incident.

Assign roles before you understand the problem

The instinct in hour one is to swarm. Everyone who got paged piles into the same call and starts reading the same logs out loud. It feels like progress. It's actually the most expensive thing you can do, because you've just put your most capable responders into a meeting instead of on a keyboard.

The single most useful move in the opening minutes is to name an Incident Commander who does not touch a terminal. Their entire job is to hold the shape of the response: who owns containment, who owns forensics, who owns comms, and what the current working theory is. The IC is not the smartest person in the room about the affected system — they're frequently better when they aren't, because they won't get sucked into the rabbit hole. Underneath them you want a small number of named owners: a technical lead actually doing the work, a scribe keeping the timeline, and a comms lead who is the only person allowed to talk to anyone outside the response. That's it. Four roles. Everyone else is a resource the technical lead pulls in on request, not a standing participant.

The reason this matters is decision rights. In a breach you will hit choices that cannot wait for consensus: do we pull this host offline, do we rotate every credential in the blast radius, do we take a customer-facing service down to stop the bleeding. Those are tradeoffs between availability and containment, and somebody has to own the call. Pre-decide that the IC makes operational containment decisions and informs leadership, rather than asks permission. If your engineers have to wait for a VP to wake up and bless isolating a compromised box, you've designed latency into the worst possible moment.

Preserve evidence before you clean up

Here's the mechanic almost everyone gets backwards under pressure: the instinct to fix is the enemy of the ability to understand. The compromised instance is also the crime scene. The moment someone reboots it "to see if that clears it," or terminates it and lets the auto-scaling group replace it, you've potentially destroyed the only copy of what the attacker actually did.

So before containment turns into cleanup, capture. In a cloud environment this is genuinely easier than the on-prem world ever was, and there's no excuse for skipping it. Snapshot the volumes. Take a memory capture if the tooling supports it. Pull the relevant logs — load balancer, application, the control-plane audit trail — into a separate, locked-down account or bucket that the responders can read but the incident can't reach. Make sure the logging that proves what happened isn't sitting in the same blast radius as the thing that got popped, because attackers delete logs and so, accidentally, do panicked responders.

The discipline I'd push hardest: isolate, don't destroy. You can almost always achieve containment without obliterating evidence. Detach the network. Revoke the role. Quarantine the instance out of the load balancer. The attacker is now inert and you still have the artifact. Have your scribe note timestamps for every action in UTC as you go, because three weeks later when counsel or a regulator asks "when did you know, and when did you act," the answer needs to be a timeline, not a memory.

Run a comms cadence, not a comms scramble

This is the part security teams are worst at and it's the part that determines whether leadership trusts you for the next two years. Executives and regulators don't panic because something bad happened. They panic because they don't know what's happening and they suspect you don't either. The cure is rhythm.

Pick an interval — every 60 minutes early, stretching out as things stabilize — and send a short, structured update at that interval whether or not anything changed. No change is itself an update. A good update is four lines: what we know, what we don't know yet, what we're doing right now, and when the next update lands. That's it. Resist the pressure to speculate on root cause or scope before you have it; the fastest way to lose credibility is to declare "no customer data affected" in hour two and walk it back in hour twelve. Say what's confirmed, label everything else as preliminary, and let the cadence carry the trust.

Crucially, the comms lead owns this channel and shields the responders from it. When an SVP wants a status, they get it from the cadence or from the IC — they do not DM the engineer who's elbow-deep in packet captures. Half the value of a comms function in an incident is giving anxious leadership somewhere to put their anxiety that isn't your responders' inbox.

The regulatory clock is its own discipline, and in financial services it's unforgiving — notification windows are measured in hours and days, not weeks, and they start ticking on determination, not on resolution. Working with 1,500-plus financial institutions taught me that the obligation isn't "tell everyone everything immediately"; it's "be able to prove you ran a defensible process." Have legal and compliance in the response from hour one, let them own the privilege and the notification decisions, and feed them the same clean timeline your scribe is keeping. The cadence is what makes that timeline trustworthy when it matters.

The takeaway

A runbook you'll actually use fits on a page and answers four questions before the incident, not during it: who's the IC, who can make a containment call without asking, where does evidence go, and what's the comms interval. Everything else is detail you can look up while the timer runs.

So here's the challenge. Don't wait for the real thing to find out whether your plan is a binder or a reflex. Run the drill. Page the team off-hours, hand someone the IC role who's never held it, and watch what breaks in the first thirty minutes. The gaps you find in a tabletop are free. The ones you find at 2 a.m. are not.

Incident ResponseSecurity OperationsCrisis LeadershipFintech

Case Studies & Practice

Open Source