The Datadog Alternative That Cut an $18K Bill 85%

Own your observability. That is the whole post.

The longer version: at 6 TB of telemetry a day, a self-hosted Grafana LGTM stack ran our entire observability platform for about $15-18K a month. Datadog, at the same volume, would have been roughly $120K. That is an 85% cut, and the gap widens as you grow, because the two costs scale on different curves. The Datadog alternative here is not a worse tool — it is the same job, owned instead of rented. Datadog is not a bad product. The problem is that you are renting something you could own, on a meter that bills you more every time your product succeeds.

I have written the full gaming-company war story separately — the Tuesday finance slide, the parallel run, the six-week cutover (how we avoided $120K/month). This post is the teardown: how a bill gets to that size in the first place, what actually changed it, and why the lesson sent me all the way to running my own cloud.

How an observability bill quietly gets to six figures

Nobody decides to spend $120K a month watching their own logs. It accretes. Here is the mechanism, because the mechanism is the lesson.

The pricing model scales against you, not with you. SaaS observability is priced per host, per GB ingested, per million events — units that grow every time your product succeeds. At 1 TB/day we were paying Datadog $40K/month and it felt manageable. The trouble was never the bill on the day. It was the trajectory: per-GB pricing on a platform that generates more data every quarter is a tax on growth, and the rate gets worse at scale. Model it at 2x, 5x, 10x your current volume before you sign, not after.

Nobody owns the meter. We had three systems running at once. Datadog was the official tool. CloudWatch was still alive underneath it — teams had built dashboards there years ago and nobody turned it off, nobody knew what it cost. And during real incidents, engineers were still SSHing into boxes and grepping log files, because at our volume the paid search was too slow to trust. Three observability systems, none of them complete, all of them billing. For a real-time platform with 50-plus Kubernetes nodes and 200 services, nobody had ever added up what the whole picture cost.

The bill grows faster than anyone looks at it. Spend was climbing 15% month over month. Debug logs with infinite retention. Custom metrics nobody queried. Dashboards nobody opened. The bill grew and nobody questioned it, because nobody connected “what we’re logging” with “what we’re paying.” The vendor market depends on exactly this blindness.

What we actually changed

Two moves, in order.

First, we made the spend legible. Before touching observability I put a tagging strategy on every resource — environment, team, service, cost-centre, owner, no exceptions. Without tags, Cost Explorer is a wall of numbers; with tags it becomes a conversation. The first week alone surfaced seventeen staging instances running below 5% CPU that nobody remembered. You cannot cut what you cannot see, and most teams are trying to cut blind.

Then we replaced the platform. Grafana’s open-source LGTM stack — Loki for logs, Grafana for dashboards, Tempo for traces, Mimir for metrics — running on our own Kubernetes cluster, with all of it backed by S3 underneath. (If you run a smaller shop, the minimum viable version of this is four single binaries on one bucket — I wrote that build up in LGTM for small teams; the scale build that moved this 6 TB/day is in observability at 6 TB/day.) We ran it in parallel with the existing tooling for a month before decommissioning anything, because you do not rip out monitoring during a tournament with a hundred thousand players online and hope. Every alert, every dashboard, every on-call query was tested against both.

The self-hosted stack did not just match the SaaS tool. It was better for our case. Log search was faster. LogQL was more powerful than anything we had. There was no per-seat cost, so fifteen departments onboarded inside the first month instead of rationing licences. And we owned the data.

The before and after

	Datadog (projected at 6 TB/day)	Self-hosted LGTM (actual at 6 TB/day)
Monthly cost	~$120,000	~$15-18K
Difference	—	~8× cheaper / 85% less
Scales with	data volume (linear, per-GB)	compute + storage
Vendor lock-in	yes	no
Data ownership	theirs	ours

These are approximate figures and I mark them as approximate. The Datadog number at 6 TB/day is a projection from their published per-volume pricing off our real 1 TB/day bill of $40K, not an invoice we paid at that scale — we never let it get there. The $15-18K is what the self-hosted platform actually cost at production volume. The honest claim is the one in the table: same telemetry, same scale, roughly an order of magnitude less money, and the data stays yours.

Why this sent me to building my own cloud

Here is where most cost-cutting stories stop. Mine didn’t.

The observability win taught me one principle the hard way: at scale, owning your infrastructure beats renting it — on cost, on control, and on the 3 AM question of who can actually fix the thing. A SaaS vendor owns the meter and the lock-in. When you own the stack, your costs are compute and storage, both of which scale far more slowly than per-GB pricing, and there is no one between you and the fix.

I did not want to only recommend that to clients. So I run it myself. I built StackZ — my own Kubernetes-on-bare-metal platform on Hetzner dedicated machines: Talos Linux for an immutable, SSH-less OS, KubeVirt so it runs full VMs alongside pods, and LINSTOR for replicated local NVMe storage, with the usual networking and GitOps layers on top. I assembled it from the upstream open-source pieces and I operate it end to end — including its own LGTM stack watching itself, with an external dead-man’s-switch on a separate box because in-cluster alerting cannot tell you the cluster is down.

I am not going to pretend I wrote every layer from scratch — these are mature open-source building blocks and I stand on them deliberately. What I claim is narrower and, I think, more useful to you: I assembled this and I run it, on hardware I control down to the kernel, and it has bitten me in every way a real platform bites you — a NIC that hangs the node under load, a storage stall that traced back to vSwitch MTU, a secrets store that will not auto-unseal after a restart. I know what owning the stack costs, because I pay it.

The lesson, stated plainly

Your observability is probably costing you more than you think, on a curve that gets worse as you grow, on a meter you do not own. The fix is not clever. It is the boring discipline of making the spend legible and then asking, for each piece, whether you should be renting it at all.

So the question I would put to you is the one I put to myself: what is your observability costing you today, and what will it cost at 5x your current volume? Not the sticker price — the projected curve. You might not like the answer. You will be glad you looked before the bill arrived.