Cristian Magherusan · ex-AWS engineer · [email protected]

Fix the Bug, Plug the Cost Leak

A client was spending a lot on IO1 EBS volumes - the high-performance, high-cost storage tier. And reasonably so. Their workload had genuinely needed those IOPS at some point, so someone had provisioned the right storage for the job. The infrastructure decision was correct at the time.

I took a look anyway. I improved my EBS tooling to gather better metrics data. What I found was interesting: the high IOPS that justified the IO1 volumes in the first place were being driven by a performance problem in the application code.

The infrastructure wasn't the problem. The code was. The storage was faithfully, expensively, handling I/O requests that shouldn't have existed.

This is the kind of thing that doesn't show up in a cost optimization dashboard. The bill says "IO1 volumes, high IOPS, here's the cost." Everything looks intentional. The provisioning matches the usage. The usage matches the workload. The workload matches... a bug.

The team fixed the code. The IOPS plummeted. And once the actual I/O requirements dropped to what they should have been all along, I migrated the volumes from IO1 to GP3.

Savings: $156,000 a year.

The client had also provisioned these volumes "generously" - more IOPS than even the buggy code required. After the fix, the actual usage was a fraction of what was provisioned. GP3 conversion brought the cost below $4,000 a year, down from roughly $150,000. Over 95% savings.

GP3 is interesting because it includes a base level of IOPS and throughput that covers most workloads without any provisioning. But the bigger story here isn't about GP3 vs IO1. It's about the relationship between infrastructure cost and application behavior.

Cloud cost optimization is usually treated as an infrastructure problem. And it often is. But sometimes the infrastructure is just reflecting a problem that lives somewhere else entirely. The storage was doing exactly what it was told. It was just being told wrong things.

My approach - look at the metrics, trace the I/O back to its source, understand why the workload exists before trying to make it cheaper - found a cost leak that no amount of instance resizing or RI purchasing would have touched.

The migration to GP3 needed a careful, gradual rollout to verify performance. But once the code was fixed, the volumes were wildly overprovisioned for the actual workload. The expensive storage tier was solving a problem that no longer existed.

Not every cost leak is a cloud problem. Some of them are code problems that show up on your infrastructure bill.