Lean Infrastructure: A Cloud Cost Optimization Method

Lean infrastructure is four decisions: cut the cloud spend you don’t actually need, own the part of your stack that you’d be foolish to rent, observe the whole thing cheaply, and don’t hire a platform team for a load you don’t have. None of those are about being cheap. They’re about not paying a tax that buys you nothing.

I have written a year of posts that each demonstrate one of those decisions. This is the post that says what they add up to. If you read only this one, you have the thesis. The others are just the receipts.

The default is to overspend, and the numbers say so

The cloud’s promise was that you’d pay for exactly what you use. In practice most companies pay for a good deal more. Harness’s FinOps in Focus 2025 report (26 February 2025) puts wasted enterprise cloud spend at 21% — about $44.5 billion in 2025 — on resources nobody is using. The same report found 55% of developers admit that purchasing commitments are “based on guesswork.” Flexera’s 2025 State of the Cloud report found 84% of organisations name managing cloud spend as their single biggest cloud challenge (vendor: Flexera).

Read those together and the picture is plain. The waste is not an accident that careful teams avoid. It is the default state of a cloud bill that nobody is actively cutting. A fifth of the money is gone before anyone makes a single bad decision — it leaks out through idle instances, over-sized commitments bought on a guess, and storage that should have aged into a cheaper tier months ago.

This is not a new observation. In 2021 Andreessen Horowitz — not exactly cloud sceptics — published The Cost of Cloud, a Trillion Dollar Paradox, arguing that at scale the cloud quietly becomes more expensive than the infrastructure it replaced, and that the savings story breaks down precisely when a company gets big enough to matter. Four years on, the bills have only grown. Gartner forecast $723.4 billion in worldwide public cloud end-user spending for 2025 (vendor: Gartner). The default did not fix itself.

So the first thing lean infrastructure means is simple: assume you are overspending, because the data says you almost certainly are, and go and look. Not as a one-time audit. As a habit.

Cut the spend you don’t need — most of it isn’t an architecture decision

The fastest savings are not clever. They are boring. Idle resources turned off. Right-sized instances. Commitments bought on real usage instead of a guess. Storage moved to the tier that matches how often you actually read it. I wrote the whole playbook in the AWS cost optimization guide — the quick wins there get most teams a 20–40% cut without touching a line of application code.

I will say the uncomfortable part out loud: most cloud waste is not a hard engineering problem. It is an attention problem. The instances are idle because nobody owns the question “is this still needed?” The commitment was a guess because the person buying it had no visibility into real usage. The bill grows faster than the business because reducing it is everyone’s job, which means it is no one’s.

Lean infrastructure makes it someone’s job. Once a month, look at the bill the way you’d look at a personal expense you suspected was wrong. You will find the leak.

Own the stack where it matters — and only there

Cutting waste is the floor. The ceiling is deciding what you should not be renting at all.

There is a specific, repeatable pattern: a managed service charges you a multiple over its underlying cost, and that multiple is justified — right up until your volume gets large enough that the multiple becomes the whole problem. Observability is the textbook case. I lived it. At a previous company, Datadog was costing roughly $40K a month at 1 TB a day, and we were on a path to 6 TB. The bill was not going to scale; it was going to detonate. So we built our own platform on the open-source LGTM stack and ran it roughly 8× cheaper at production scale. The avoided cost was on the order of $120K a month.

That is not an argument to self-host everything. It is the opposite. It is an argument to know your break-even. Managed services are a genuinely good deal at low volume — you rent expertise you don’t have and capacity you don’t yet need. The mistake is staying on the rented version long after you’ve crossed the line where owning it is cheaper and you have the competence to run it. Below the line, rent. Above the line, and only if you can operate it, own.

I took the same logic to its end for my own work. I rent six dedicated servers in a German data centre and run a multi-tenant Kubernetes platform on top — most of what my life runs on now lives there. For more cores, more memory, and more SSD than most production clusters I worked on a decade ago, I pay a flat monthly figure that a comparable managed setup would multiply several times over. But I was honest in that post and I’ll be honest here: that only works because I can run it. If you can’t, the maths flips, and the rented version is the right answer. Owning the stack is a competence decision before it is a cost decision.

Observe it cheaply, because you can’t cut what you can’t see

You cannot trim a bill you cannot read. The reason 55% of commitments are guesswork is that the teams making them have no clean view of what they actually use. Observability is therefore not a luxury you add once things are working — it is the instrument panel that tells you where the waste is in the first place.

The trap is that observability is itself one of the fastest-growing lines on the bill, which is exactly the Datadog story above. So the lean version is recursive: instrument everything, but instrument it on a stack whose cost scales with your hardware, not with a per-gigabyte meter that someone else sets and re-prices whenever they like. Open-source telemetry on hardware you control means visibility stops being a thing you ration. You keep all the data, because keeping it costs you the marginal price of a disk, not a vendor’s margin on every log line.

Cheap observability is what makes the other three decisions possible. It is the difference between cutting cost on a guess and cutting it on a number.

Don’t hire a platform team you don’t need

Here is where the industry quietly inflates the bill in a way no FinOps dashboard catches: headcount.

The standard story is that owning infrastructure requires a platform team, an SRE rotation, a small standing army to keep the lights on — so you should stay fully managed to avoid all that. Sometimes that’s true. Often it’s a story the rented option tells to keep you renting. The honest question is not “can I afford a platform team?” It is “what is my actual operational load?”

I run a multi-tenant Kubernetes platform, a self-hosted observability stack, and the services on top of them — alone, from a modest house in Hyderabad, on a setup that would not impress anyone who saw it. I am not claiming this scales to every organisation. I am claiming the load is far lower than the staffing story implies, once the architecture is chosen to be light. That last clause is the whole game. The platform team you “need” is often the platform team a heavy architecture demands — and the heavy architecture was a choice, not a law.

This is why I rewrote four services from Python into Go over three weekends. It was not a performance vanity project. Lighter services mean a smaller cluster, fewer moving parts, less to watch at 2 a.m., and a lower standing operational cost — in money and in attention. Choosing the light architecture up front is how you avoid hiring your way out of complexity you created yourself.

Lean infrastructure, on the people side, means: build the thing so that running it doesn’t require an org chart. Then you don’t need the org chart.

The four decisions, together

Put them in order, because the order matters:

Cut the waste. Assume you’re overspending — the data says you are — and audit monthly. Most of it is attention, not architecture. This is the floor, and it’s free.
Own where it matters. Find your break-even on each rented service. Below it, rent. Above it — and only if you can run it — own. Competence first, cost second.
Observe it cheaply. Instrument everything on a stack whose cost scales with hardware, not with a vendor’s per-gigabyte meter. You can’t cut what you can’t see.
Don’t over-staff. Choose a light architecture so that running it doesn’t demand a platform team you don’t have the load for.

Notice none of these is a technology. They’re a posture. The posture is: pay for what genuinely buys you something, and refuse to pay for what doesn’t — whether the bill comes from a cloud provider, a managed-service margin, or a headcount you talked yourself into.

There is a version of this work that is just being cheap, and I want to be clear it isn’t that. Being cheap is cutting the thing that matters to save a small number. Lean infrastructure is the opposite — it is spending deliberately on what matters so you have the room to do the work that actually moves your business. The point of cutting the $120K observability line was never the $120K. It was no longer being held hostage by a bill that grew faster than the company could.

That’s the whole philosophy. Cut what’s wasted, own what’s worth owning, see everything, staff for the real load. Do those four, in that order, and the infrastructure stops being a tax and starts being an asset you actually control.

If you’re staring at a cloud bill that’s growing faster than your business and you suspect a good chunk of it is the tax I’ve described — that’s the work I do. I’ve cut it for companies and I’ve cut it for myself, and the four decisions above are the whole method. If that’s the conversation you need to have, reach me here.

Sources

Harness, FinOps in Focus 2025 (26 Feb 2025): 21% of enterprise cloud spend wasted, ~$44.5B in 2025; 55% of developers say commitments are based on guesswork; 52% cite FinOps/dev disconnect as a cause of waste. https://www.prnewswire.com/news-releases/44-5-billion-in-infrastructure-cloud-waste-projected-for-2025-due-to-finops-and-developer-disconnect-finds-finops-in-focus-report-from-harness-302385580.html
Flexera, State of the Cloud 2025: 84% of organisations name managing cloud spend as their biggest cloud challenge. https://www.flexera.com/about-us/press-center/new-flexera-report-finds-84-percent-of-organizations-struggle-to-manage-cloud-spend
Andreessen Horowitz, The Cost of Cloud, a Trillion Dollar Paradox (2021): at scale, cloud cost can exceed the infrastructure it replaced. https://a16z.com/the-cost-of-cloud-a-trillion-dollar-paradox/
Gartner public cloud forecast: ~$723.4B worldwide public cloud spend, 2025 (cited via cloud-repatriation coverage; verify against Gartner primary). https://www.cio.com/article/4061031/why-cloud-repatriation-is-back-on-the-cio-agenda.html
37signals (Basecamp) cloud exit: ~$2M/year savings after leaving AWS — supporting datapoint for own-past-break-even. https://www.hivelocity.net/blog/cloud-repatriation-why-workloads-are-moving-off-aws/