The Cost Leak I Couldn't Fix

June 2024

One of my clients runs a bursty workload on Fargate. The containers sit idle most of the time, then spike to 100% CPU in seconds - so fast that Fargate can't scale quickly enough. By the time additional capacity comes online, the burst is already over. They're paying for containers that are either idle or too late.

I built a tool that estimates what a workload would cost on Lambda using actual ALB CloudWatch metrics - requests, latency, bandwidth over the last 30 days. I ran it against this client's setup. The numbers were startling.

Production: roughly $4,000 a month on Fargate. Estimated on Lambda: about $400 a month. A 10x difference. And that estimate was conservative.

Staging was worse. $1,500 a month on Fargate. $5 a month estimated on Lambda. That's 300x.

I ran the same tool against several other former and current clients. The results ranged from 6.4x to 150x to 583x.

So - clear decision. Migrate to Lambda, save a fortune, everybody wins, right?

No. Not this time.

The application uses PHP. PHP requires locally writable and persistent storage. Lambda doesn't provide that natively. It's doable, but it's risky and a lot of work. This isn't the kind of safe configuration change I can ship myself - a controlled failover, a storage tier migration, a rightsizing PR that's quick to review and safe to deploy. A Lambda migration of a PHP app requires significant internal team effort, extensive testing, and appetite for risk that has to come from the team that owns the code.

I showed them the way. I did the analysis, built the cost estimation, and laid out the path. But the implementation is on their team.

I don't charge for savings beyond the analysis work in cases like this. If I can't ship it myself, I don't pretend I can.

One conclusion here is to never use PHP. But saying that is just asking for a flame war.

The deeper conclusion is that some architecture decisions are extremely expensive to undo. The choice of runtime, the choice of compute layer, the coupling between your application and its infrastructure - these decisions compound. They're easy and cheap at the start. They get harder and more expensive every month.

If you're starting something new right now - a new service, a new project, a new architecture - this is the cheapest moment you'll ever have to make the right call.

And no, the right call isn't always Lambda. Per CPU-second, EC2 provides more compute at lower cost. But web servers rarely keep the CPU constantly at 100%. Lambda's per-request billing wins on bursty, idle-heavy workloads. Most web workloads are exactly that.

I first used Lambda for REST APIs in 2015. I'd always assumed Lambda was more expensive than EC2. The math says otherwise - if your workload fits.

The cost leak is real. The fix exists. But it's a design problem, not a configuration problem. And the most honest thing to say about a design problem is: here's what it costs you, here's what the fix looks like, and here's why it's hard.