How We Avoided $120K/Month in Observability Costs

It was a Tuesday morning leadership meeting. I was looking at the finance slide — the one with the infrastructure cost trendline pointing sharply upward.

Our CTO looked at me. As Engineering Manager, the infrastructure decisions rolled up to my team. And I couldn’t explain the number.

I knew the bill was growing — everyone did. We were a gaming company scaling fast. A hundred thousand concurrent players on a good day, two hundred thousand on a great one. Infrastructure had to keep up. The technology decisions were mine. But I’d been so focused on keeping the platform alive that I’d never stopped to ask whether we were spending wisely.

That question haunted me for the rest of the day. By evening, I’d pulled up AWS Cost Explorer for the first time in months. What I found changed how I think about infrastructure forever.

The number that stopped the room

Our monthly AWS spend was growing 15% month over month. But when I broke it down, the story wasn’t what anyone expected.

The biggest cost wasn’t compute. It wasn’t the database. It wasn’t even the hundred thousand concurrent players hammering our game servers.

It was observability.

The observability situation was a mess. We had three systems running simultaneously — and none of them gave us the full picture.

Datadog was the “official” monitoring tool. Polished dashboards, easy setup, solid alerting. We were paying $40,000 a month for it at 1 TB of data per day.

CloudWatch was still running alongside it. Some teams had built dashboards there years ago and never migrated. Others used CloudWatch Logs because that’s where their Lambda functions wrote to by default. Nobody had turned it off. Nobody knew what it cost.

SSH and grep. I wish I was joking. During incidents, engineers were still SSHing into servers and grepping log files. Datadog was supposed to eliminate this, but the search was slow at our volume, results were inconsistent, and old habits die hard. When something was on fire, people reverted to what they trusted: a terminal.

Three observability systems. None of them complete. All of them costing money. For a real-time gaming platform with fifty-plus Kubernetes nodes and two hundred services, this wasn’t sustainable.

Our data volume was growing fast. Player counts were doubling. Every new feature meant more services, more logs, more metrics, more traces. We were headed toward 6 TB/day within the year.

I did the math on a napkin during that leadership meeting. Datadog alone was $40K at 1 TB/day — and their pricing scales roughly linearly with volume. At 6 TB/day, Datadog alone would be $120,000 a month or more. Add CloudWatch costs on top of that — costs nobody was even tracking properly. Plus the hidden cost of engineers wasting time SSHing into servers during incidents because the tools they were paying for weren’t fast enough.

More than we’d spend on the game servers themselves.

Datadog projected $120,000/month vs self-hosted LGTM at $15K-$18K/month — 8× cheaper at 6 TB/day

I presented the numbers to the CTO. He gave me the green light to explore alternatives. I had an answer — just not one the finance team would expect.

“We need to build our own.”

That’s not a statement you make lightly. Building your own observability platform means owning every outage, every gap in coverage, every 3 AM page that a SaaS vendor would have handled for you. But the math was undeniable. And I’d rather own the problem at $15-18K/month than rent someone else’s solution at $120K.

Building the replacement

I became obsessed with the observability problem. Not with cutting costs for the sake of cutting — but with the absurdity of paying $120K/month to read our own logs.

Before touching observability though, I needed to understand the full picture. The first thing I did was implement a tagging strategy across every resource in our AWS accounts.

# Every resource got these tags — no exceptions
Tags:
  Environment: production | staging | development
  Team: platform | backend | data | infrastructure
  Service: game-server | matchmaking | analytics | api
  CostCenter: engineering | operations
  Owner: [email protected]

Without tags, Cost Explorer is a wall of numbers. With tags, it becomes a conversation. I could finally answer questions like “how much does matchmaking cost?” and “why is the data team’s spend growing faster than their traffic?”

The first week alone revealed something embarrassing: we had seventeen EC2 instances running in staging with average CPU utilization below 5%. Nobody remembered what they were for. Nobody had touched them in months.

The bet

With tagging in place, I turned to the real target: that $120K/month observability bill.

This was the scariest proposal I’d ever made. You don’t just rip out your monitoring. If something breaks during a tournament with a hundred thousand players online, and you can’t see what’s happening, someone’s getting fired. Probably me.

I spent three weeks evaluating alternatives. The answer turned out to be Grafana’s open-source LGTM stack — Loki for logs, Grafana for dashboards, Tempo for traces, Mimir for metrics. All of it running on our own Kubernetes cluster.

LGTM stack architecture — K8s cluster feeds Grafana Alloy, which fans out to Mimir, Loki, and Tempo, backed by S3 long-term storage

We ran the LGTM stack in parallel with Datadog for a month. Every alert, every dashboard, every on-call query — tested against both. The LGTM stack was faster for log search. The query language (LogQL) was more powerful than anything in Datadog or CloudWatch. The dashboards were more flexible. Engineers stopped SSHing into servers because Grafana actually gave them what they needed in seconds. And we owned the data.

The cost comparison told the whole story:

	Datadog (projected at 6 TB/day)	Self-hosted LGTM (actual at 6 TB/day)
Monthly cost	~$120,000	~$15K-$18K
Difference	—	8× cheaper
Scales with	Data volume (linear, per-GB)	Compute + storage
Vendor lock-in	Yes	No
Data ownership	Theirs	Ours

At 1 TB/day, we were already paying Datadog $40K/month. We built and migrated to the LGTM stack on our own Kubernetes cluster. By the time we hit 6 TB/day production scale, the self-hosted bill landed at $15K to $18K/month — about 8× cheaper than what Datadog would have cost us at the same volume. Datadog would have been $120K+ for that 6× growth in data. We were running it for $15-18K.

The self-hosted platform was actually better. Unified interface. Custom dashboards. No vendor lock-in. No per-host pricing that punishes you for scaling. Fifteen departments started using it within the first month — because there was no per-seat cost to worry about.

What else we found along the way

The observability deep dive opened our eyes to costs everywhere. Here’s one example that still surprises people: our multi-AZ architecture was one of our most expensive mistakes.

Our EKS cluster was spread across three availability zones. Every best-practice blog post says to do this. High availability. Fault tolerance. Resilience.

But nobody mentions the bill.

Every time a game server in us-east-1a talked to its Redis cache in us-east-1b, AWS charged us $0.01 per gigabyte. Each direction. Our services were chatty — matchmaking talked to the player database, game servers talked to the session cache, analytics talked to everything. At our traffic levels, cross-AZ data transfer was costing thousands every month.

The fix was counterintuitive: we moved to a cell-based architecture. Instead of spreading each service across all three AZs, we created self-contained cells. Each cell ran in a single AZ. Game servers, their caches, their databases — all co-located. HA came from having multiple cells, not from spreading one service thin.

Before: services spread across AZs with red cross-AZ arrows. After: three self-contained cells, one per AZ, with zero cross-AZ traffic

# Karpenter provisioner for a single-zone cell
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: cell-us-east-1a
spec:
  template:
    spec:
      requirements:
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 30s

Pod-to-pod latency dropped from 1-2 ms (cross-AZ) to under 0.5 ms (same AZ). Node utilization went from 20% to 65% because Karpenter could consolidate efficiently within each cell. Cross-AZ transfer costs nearly disappeared.

And here’s the part that surprised the whole team: reliability improved. When a cell had an issue, it was isolated. Other cells kept running. The blast radius of any single failure shrank dramatically.

The quiet wins

The observability project cracked open the door to everything else. Once we had proper tagging and visibility, obvious savings appeared everywhere — S3 lifecycle policies cut storage costs by 96%, free VPC endpoints eliminated NAT Gateway charges, and Reserved Instances on databases we’d been running On-Demand for two years saved another $8K/month.

None of these are clever. They’re obvious — once you look. I’ll write about each of them separately.

The timeline

The observability migration took about six weeks from evaluation to full cutover. The parallel-run period was the longest part — we needed absolute confidence before decommissioning CloudWatch.

Week	What We Did
1-2	Evaluated LGTM stack, built proof of concept on staging
3	Deployed to production alongside CloudWatch (parallel run)
4	Migrated dashboards, alerts, and on-call queries
5	Onboarded all 15 departments to Grafana
6	Decommissioned CloudWatch, cancelled Datadog evaluation

Result: We replaced Datadog ($40K/month), decommissioned CloudWatch, and eliminated the need to grep server logs — all with a single $1,500/month self-hosted platform. And avoided what would have been $120K+/month in Datadog alone as our data grew to 6 TB/day.

Three fragmented systems became one. The LGTM stack scaled with us. Datadog’s pricing would have scaled against us.

Once the observability win was in, the momentum made everything else easier. Leadership stopped questioning infrastructure costs and started asking what we could optimize next. Over the following months, we found significant savings across compute, storage, and networking — but those are stories for another post.

What I’d do differently

If I could go back, I’d change two things.

I’d build the LGTM stack earlier. We ran CloudWatch and LGTM in parallel for a month out of caution. In hindsight, two weeks would have been enough. The self-hosted stack was clearly better within days. That extra two weeks of parallel running cost us $40K we didn’t need to spend.

I’d question the pricing model earlier, not just the price. When Datadog was at $40K/month, it felt manageable. The problem wasn’t the current bill — it was the trajectory. I should have modeled the cost at 2x, 5x, 10x our data volume before we ever signed the contract. Per-GB pricing on a platform that generates more data every quarter is a trap. The question isn’t “can we afford this today?” — it’s “can we afford this at scale?”

The real lesson

That Tuesday morning finance slide was the best thing that happened to our infrastructure. Not because of the $120K/month in savings — though that’s hard to argue with. But because it exposed a blind spot every engineering team has.

We’d been running CloudWatch on autopilot for two years. Debug logs with infinite retention. Custom metrics nobody queried. Dashboards nobody opened. The bill grew 15% month over month and nobody questioned it, because nobody connected “what we’re logging” with “what we’re paying.”

The observability vendor market depends on this blindness. They price per host, per GB ingested, per million events — units that scale linearly with your infrastructure. The more you grow, the more you pay. It’s a tax on success. And the tax rate gets worse, not better, at scale.

Self-hosted observability inverts this. Your costs are the compute to run the stack and the S3 storage underneath it — both scale far more slowly than per-GB SaaS pricing. We grew from 1 TB/day to 6 TB/day and our observability bill landed at $15-18K/month at production scale. On Datadog, that same growth would have taken us from $40K to $120K+. About 8× cheaper at the larger scale — and the gap gets wider, not narrower, as you grow.

Every company I’ve talked to since has the same trajectory. They’re at 1 TB now, growing fast, and their vendor bill is growing faster. By the time the CFO notices, they’re locked in.

So here’s my question: what’s your observability costing you today, and what will it cost at 5x your current volume? Not the sticker price — the projected cost. The curve, not the point.

You might not like what you find. But you’ll be glad you looked before the bill arrived.