Skip to content
Open to board advisory and board seats — 2H 2026, then CY 2027–2028.
See details →
Writing

From Cluster Autoscaler to Karpenter Across a Fleet: What Actually Breaks

Karpenter is the right call for most EKS shops. But the migration breaks things that have nothing to do with autoscaling — and a lean platform team should know exactly what those are before flipping the switch.

By Michael YorkMay 13, 2026 5 min read 975 words All postsTable of contents

Karpenter is one of those rare infrastructure upgrades that is genuinely worth the hype. It provisions nodes faster, bin-packs better, and lets you stop maintaining a sprawl of per-AZ node groups. If you run EKS at any real scale, you will end up there. I'm not here to talk you out of it.

I'm here to talk about the week after you turn it on. Because the thing that breaks during a Cluster Autoscaler to Karpenter migration is almost never the autoscaling. The autoscaling works on day one — that's the seductive part. What breaks is everything downstream that quietly depended on how the old system behaved. And on a fleet of clusters, those assumptions are baked into a hundred places no one wrote down.

The real mechanic: you're not swapping a scaler, you're changing node lifecycle

Cluster Autoscaler operates on Auto Scaling Groups. It thinks in terms of node groups you defined, instance types you pre-selected, and a scaling cadence that is, frankly, slow and predictable. Predictable is boring, and boring is what your other systems were tuned against.

Karpenter throws that model out. It provisions individual instances directly against your NodePool constraints, consolidates aggressively, and treats nodes as genuinely disposable. That last word is the whole story. Karpenter will churn nodes — to consolidate underutilized capacity, to move you onto cheaper instance types, to respect node expiry. This is the feature. It is also the thing that exposes every workload in your fleet that was secretly assuming nodes live a long time.

So the question to ask before you migrate isn't "will it scale?" It's "what in my environment quietly assumes a node will still be here in an hour?" On a lean team, you find those assumptions in production unless you go looking first.

What actually breaks

PodDisruptionBudgets you never set. When Karpenter consolidates, it drains nodes. If a workload has no PDB, or a sloppy one, consolidation will happily take down more replicas at once than you'd ever tolerate. Cluster Autoscaler's slower rhythm hid this for years. Karpenter does not. Before you migrate, audit PDBs across the fleet — and treat "no PDB" as a finding, not a default.

Stateful and long-running workloads. Anything that holds local state, runs a long batch job, or hates being rescheduled needs an explicit do-not-disrupt annotation or it will get moved at the worst possible time. The fix is trivial; knowing which of your hundreds of workloads need it is the work.

Graceful shutdown that was never really tested. Faster, more frequent node turnover is a continuous test of whether your apps actually handle SIGTERM, finish in-flight requests, and deregister from load balancers cleanly. Plenty of services that looked healthy under Cluster Autoscaler were just never asked to prove it. Karpenter asks constantly.

DaemonSets and the bin-packing math. Karpenter accounts for DaemonSet overhead when it picks instance sizes, which is great — until a heavy logging or security agent makes your "efficient" small nodes wasteful, or your consolidation savings evaporate because every node carries the same fixed tax. Model your DaemonSet footprint before you trust the cost projection.

The cost surprise that runs the other way. Everyone expects Karpenter to cut spend, and it usually does. But broad instance-type flexibility plus Spot can put workloads onto hardware your team has never operationally validated — different network throughput, different ratios, occasionally different behavior under load. And Spot interruptions become a daily event rather than a quarterly curiosity. Karpenter handles them well, but only if your apps tolerate interruption in the first place. Constrain the NodePool to instance families you've actually run before you open the aperture all the way.

The security and governance edges nobody budgets for

This is where my background makes me twitchy, and where lean teams get burned. Karpenter provisions instances directly, which means its IAM role and the NodePool configuration are now load-bearing security controls. The node IAM role, the security groups and subnets selected by your NodeClass, the AMI family in use — all of that moves from "set once in Terraform" to "evaluated continuously by a controller." If your guardrails assumed nodes only came from a handful of blessed ASGs, those assumptions are now wrong.

Two things matter here. First, your AMI strategy. Karpenter can pull the latest EKS-optimized AMI automatically, which is convenient and also means an upstream image change can roll across your fleet faster than your validation does. Pin deliberately and control the upgrade. Second, drift and compliance tooling that watched ASGs needs to watch NodePools and NodeClasses instead, or your posture reporting goes quietly blind right when node churn is at its highest.

What a lean platform team should plan for

The honest version of this migration is not a config swap, it's a behavioral change you roll out cluster by cluster. Run Karpenter alongside Cluster Autoscaler on one non-critical cluster first. Turn consolidation on in its most conservative mode and watch what churns before you let it get aggressive. Audit PDBs and disruption annotations as a prerequisite, not a follow-up. Constrain instance types to what you know, then widen. And treat the first month of elevated node turnover as a free, continuous chaos test — because that's exactly what it is.

The teams that struggle are the ones that treated Karpenter as a drop-in efficiency win and discovered, in production, that their reliability was propped up by slow nodes. The teams that win are the ones who understood that disposable nodes are the point, and made their workloads worthy of the assumption before flipping the switch.

So here's the challenge: before your next cluster migrates, go find every workload that would be unhappy if its node disappeared in the next ten minutes. If you can't produce that list, you're not ready to turn on consolidation — you're ready to learn the list the hard way.

Platform EngineeringKubernetesCloud CostReliability