Running Spot GPU Workloads on EKS the Right Way

GPU workloads are where AWS bills go to get absolutely unhinged. The moment you introduce training jobs, batch inference, or anything remotely ML-related, costs spike fast. That’s why teams inevitably look at Spot GPU instances on Amazon Elastic Kubernetes Service and think, this could save us a large amount of money.
They’re right. And they’re also usually about to break their cluster.
Spot GPUs can absolutely slash costs, often by 70 percent or more, but only if your workloads are designed for interruption. If they aren’t, you’re not “optimizing,” you’re gambling. This article walks through how teams that actually know what they’re doing run Spot GPU workloads on EKS without chaos.
Why Spot GPUs Feel Like a Trap at First
On paper, Spot GPUs look perfect. Same hardware, massive discounts, no long-term commitment. In reality, they come with a hard truth: AWS can take them back whenever capacity tightens. You usually get a two-minute warning. Sometimes less. Sometimes your node just disappears and Kubernetes shrugs.
This is where most teams screw up. They treat Spot like cheaper on-demand instead of what it really is, opportunistic capacity. If your workloads assume stability, Spot will punish you for that assumption.
The Single Most Important Architectural Decision
If there’s one thing you remember from this article, make it this: never mix Spot GPU and non-Spot GPU workloads casually.
Spot GPU nodes should be isolated, clearly labeled, and treated as disposable. Your EKS cluster should make it painfully obvious which workloads are allowed to die and restart and which ones are not. When teams skip this separation, interruptions turn into outages, failed pipelines, and late-night Slack meltdowns.
The right approach is boring but effective. Dedicated node groups. Explicit scheduling rules. No guessing. When a GPU job lands on Spot, it’s because you chose that tradeoff on purpose.
Designing GPU Workloads That Survive Reality
Spot interruptions aren’t rare edge cases. They are the normal operating condition. That means your GPU workloads need to expect failure and recover cleanly.
Training jobs should checkpoint aggressively. Batch workloads should be idempotent. If a pod restarts halfway through, it should resume, not start over from scratch or corrupt output. This isn’t just about Spot either. It’s good engineering, but Spot makes the consequences impossible to ignore.
Teams that succeed with Spot GPUs tend to treat nodes as temporary and data as sacred. Compute can vanish. Progress should not.
Capacity Strategy Matters More Than Price
One of the fastest ways to make Spot unusable is locking yourself into a single GPU instance type. Capacity fluctuates constantly, especially for popular SKUs. If you only allow one size or family, you are begging for interruptions.
Clusters that behave well in production allow multiple compatible GPU instance types and let AWS pick whatever capacity is actually available. This doesn’t just reduce interruptions, it stabilizes autoscaling behavior and makes Spot feel far less random.
The goal isn’t the absolute cheapest price. It’s consistent availability at a discount.
The Single Most Important Architectural Decision
If there’s one place teams screw this up, it’s node selection. People say “we separate Spot and on-demand,” then don’t actually enforce it, and Kubernetes happily schedules critical GPU pods onto disposable nodes. Chaos follows.
The fix is simple and explicit. You taint your Spot GPU nodes so nothing lands on them unless it opts in.
Your Spot GPU node group should be created with a taint like:
spot=true:NoSchedule
That single line tells Kubernetes, “hands off unless a pod explicitly agrees to die.”
Then, in the GPU workload that can tolerate interruption, you opt in on purpose. A simplified pod spec looks like this:
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Now that pod is allowed to run on Spot nodes, and everything else is blocked by default. No accidents, no surprises.
To make sure the pod actually lands on GPU hardware and not some random CPU node, you add node affinity targeting your GPU labels, for example:
nodeSelector:
nvidia.com/gpu.present: "true"
At this point, scheduling is no longer “best effort.” It’s intentional. GPU workloads that are safe to interrupt run on Spot. Workloads that are not never even see those nodes.
This is the difference between Spot being a controlled cost-saving tool and Spot being a production landmine. You’re not trusting Kubernetes to guess, you’re telling it exactly what’s allowed.
Autoscaling Without Thrashing Your Cluster
GPU autoscaling on EKS can be painful if you’re sloppy. Overly aggressive scale-down settings cause nodes to disappear while workloads are still stabilizing. Under-requested GPU resources lead to scheduling failures that look like bugs but are really configuration mistakes.
When done correctly, autoscaling becomes boring. Nodes come and go. Jobs start, pause, resume, and finish. The cluster doesn’t flap. Engineers stop babysitting it. That’s the bar.
The Part Everyone Ignores Until Finance Asks Questions
Here’s the uncomfortable part. Most teams run Spot GPUs and have no idea how much they’re actually saving.
EKS costs are messy. Nodes churn. Pricing models mix. GPU usage spikes and dips. Without clean visibility, Spot can feel cheaper while still hiding waste through underutilization, poor scheduling, or fallback to on-demand nodes.
This is where Spend Shrink earns its keep. Instead of guessing, you can see exactly how much GPU spend is on Spot versus on-demand and whether your architecture is actually delivering savings or just complexity.
When Spot GPUs Are the Wrong Call
Spot GPUs are powerful, but they’re not universal. If a workload is latency-sensitive, user-facing, or politically untouchable, don’t force it onto Spot just to save a few bucks. Some workloads deserve stability. That’s fine.
The smartest teams use a mix. Spot where interruption is acceptable. On-demand where it’s not. The win comes from intentional placement, not blind optimization.
Final Take
Running Spot GPU workloads on EKS isn’t about being clever. It’s about being honest with how your workloads behave under pressure. If you design for interruption, isolate risk properly, and actually measure outcomes, Spot GPUs can take a massive chunk out of your AWS bill.
If you don’t, they’ll take chunks out of your uptime instead.