Ransomware Recovery Is a Backups-You've-Tested Problem

Ask any engineering org whether they have backups and the answer is always yes. The snapshots are running, the retention policy is documented, the dashboard is green. Then a destructive event happens — ransomware, a rogue credential, a deletion that cascades — and the same org discovers, in the worst possible moment, that having backups and being able to recover are two entirely different things.

I've come to believe ransomware recovery isn't really a backup problem. It's a backups-you've-tested problem. The distinction sounds pedantic until you're standing in an incident bridge at 2 a.m. asking how old the last clean restore point is, whether the attacker had the access to corrupt it too, and how long a full rebuild actually takes when nobody's done it end to end. The backup is the artifact. The restore is the capability. You only own the one you've exercised.

The real mechanic: assume the attacker is already inside your backups

Most backup strategies are designed for the wrong adversary. They're built to survive hardware failure and human error — events that are random and don't fight back. Ransomware is not random and it fights back. A competent operator lives in your environment for days or weeks before they pull the trigger, and one of the first things they go after is your ability to recover. They find the backup service account. They delete snapshots. They quietly extend the dwell time until your clean restore points have rolled off retention. By the time the encryption starts, the recovery plan you were counting on is already gone.

So the design question isn't "do we have copies." It's "can an attacker with our most privileged credentials destroy our ability to recover." If the answer is yes, you don't have a backup strategy, you have a false sense of one. Everything that matters in cyber-resilience flows from closing that gap.

On AWS the public building blocks for this are well understood, and the point is to compose them deliberately rather than trust defaults. Immutability comes first. S3 Object Lock in compliance mode and Backup Vault Lock let you write recovery data that literally cannot be deleted or altered before its retention expires — not by an admin, not by the root user, not by an attacker holding your keys. That's the property that defeats the "delete the backups" playbook. If a privileged credential can shorten retention or unlock the vault, it isn't immutable; it's just inconvenient to delete.

Isolation comes second. Backups that live in the same account and same blast radius as production are backups that share production's compromise. The pattern is a separate, locked-down recovery account in its own organizational boundary, cross-account copies pushed into it rather than pulled, and a control plane the production identities can't reach. Tape the analogy to your monitor: it's an air gap implemented with IAM and account boundaries instead of a physical disconnect. The recovery environment should be boring, sparse, and almost nobody should have standing access to it.

Recovery into a clean room comes third. When you restore, you do not restore into the environment that just got owned. You stand up an isolated VPC, no peering, no shared services, restricted egress, and you bring data back there to validate it before it touches anything that matters. Ransomware loves a hasty restore straight into the production blast radius — you helpfully reintroduce the malware along with the data and hand the attacker a second turn. The clean room is where you confirm what's actually clean.

The drill is the product

Here's the part everyone skips. All of that architecture is theory until you've run the detection-to-restore drill on a calendar, before a real event forces it. Not a tabletop where people talk through what they'd do. An actual game day where you take a real workload, pretend its production data is gone, and rebuild it from immutable backups into the clean room while a clock runs.

That exercise surfaces the things no architecture diagram will tell you. The restore that takes eleven hours because nobody sized the throughput. The IAM role that doesn't exist in the recovery account because it was created by hand in prod and never codified. The database that comes back but won't start because a dependency lives in a service you forgot to include. The DNS cutover nobody owns. The runbook that assumes a person who left the company. You want to find every one of these on a Tuesday with coffee, not during an incident with your name on the bridge.

The drill is also where two numbers stop being aspirational and become real: your recovery time objective and your recovery point objective. RTO and RPO are not values you declare in a policy document. They're measurements you take with a stopwatch and then close the gap on. If you've never timed a full restore, your RTO is fiction, and fiction is exactly what gets exposed when the people you serve — in our case, the 1,500-plus financial institutions that depend on us — are waiting to know when their data is coming back.

The takeaway

Cyber-resilience done right isn't a bigger backup budget or a fancier tool. It's a handful of public AWS primitives — immutability, account isolation, a clean-room recovery path — wired together under one assumption: the attacker already has your keys and is coming for your ability to recover. Then it's the discipline to rehearse the restore until it's muscle memory.

So here's the challenge. Don't ask your team whether you have backups; you already know that answer and it's worthless. Ask when you last restored a production-scale workload from an immutable copy into an isolated environment, end to end, with a clock running. If you can't name the date, that drill is the most important thing on your roadmap — and the next destructive event will schedule it for you on far worse terms.

Cyber ResilienceRansomwareAWSDisaster Recovery

Case Studies & Practice

Open Source