Cloud Cost Optimization: Where the Money Actually Goes

The first time I opened Cost Explorer to debug a bill that had grown 40% in a quarter with no traffic increase, the biggest line item wasn't compute. It wasn't RDS. It was a single NAT Gateway processing every container's outbound traffic to S3, charging us $0.045 per GB for data that never needed to leave the VPC. Eleven thousand dollars a month to route packets to a service sitting in the same region.

That is the dirty secret of cloud cost optimization: the money rarely leaks where you're looking. Everyone fixates on instance types because they're visible in every dashboard. The real waste hides in data movement, forgotten resources, and the gap between what you provisioned and what you use. After 17 years of shipping infrastructure, I've learned the bill is a diagnostic tool, not a verdict. Here's how I read it.

Start With Attribution or Stop Now

You cannot optimize what you cannot attribute. If your bill is one undifferentiated number, every optimization is a guess. Before anything else, enforce tagging. Not "encourage" — enforce, with an AWS Organizations Service Control Policy or a tag policy that rejects untagged resources.

The non-negotiable tags I require on every resource: team, service, environment, and cost-center. Here's a Terraform default that bakes them into every resource a module creates so nobody has to remember:

# providers.tf — applies tags to every taggable resource in the account
provider "aws" {
  region = "eu-west-1"
 
  default_tags {
    tags = {
      team        = "payments"
      service     = "ledger-api"
      environment = "production"
      cost-center = "cc-4471"
      managed-by  = "terraform"
    }
  }
}

Then activate those as cost allocation tags in the Billing console (they don't show up in Cost Explorer reports until you do — a step everyone forgets). Within 24 hours you can group spend by service and finally answer "what does the ledger API actually cost us?" — including its share of the NAT, the load balancer, and inter-AZ chatter.

Once attribution works, the FinOps loop from the FinOps Foundation becomes real: inform (everyone sees their spend), optimize (act on the data), operate (make it continuous). Skip "inform" and the other two are theater.

Where the Money Actually Goes

In my experience the waste clusters into a predictable Pareto distribution. Here's roughly how it breaks down on a typical over-provisioned account, and how hard each is to fix:

Leak	Typical share of waste	Effort to fix	First move
Over-provisioned compute	30–40%	Low	Right-size from utilization data
Idle / zombie resources	15–25%	Low	Scripted sweep + delete
Data egress & cross-AZ	10–20%	Medium	VPC endpoints, AZ-aware routing
NAT Gateway processing	5–15%	Medium	Gateway/Interface endpoints
Over-replicated storage	5–10%	Low	Lifecycle policies, tier down
Logging & observability	5–15%	Medium	Sampling, retention cuts
Non-prod left running	5–10%	Low	Scheduled shutdown

Notice the pattern: most of the waste is low effort. You don't need a quarter-long migration. You need a Tuesday afternoon and the discipline to actually delete things.

Idle and zombie resources

Unattached EBS volumes, orphaned snapshots, Elastic IPs not associated with anything, idle load balancers, NAT Gateways in dead subnets. Each is small; together they're a steady drip. The fix is a scheduled sweep. Here's a script I run weekly that surfaces the usual suspects — it's read-only, so it reports and you decide:

#!/usr/bin/env bash
set -euo pipefail
REGION="${AWS_REGION:-eu-west-1}"
 
echo "== Unattached EBS volumes =="
aws ec2 describe-volumes --region "$REGION" \
  --filters Name=status,Values=available \
  --query 'Volumes[].{ID:VolumeId,GiB:Size,Created:CreateTime}' \
  --output table
 
echo "== Unassociated Elastic IPs (each ~\$3.60/mo idle) =="
aws ec2 describe-addresses --region "$REGION" \
  --query 'Addresses[?AssociationId==`null`].PublicIp' \
  --output text
 
echo "== Snapshots older than 90 days =="
cutoff=$(date -u -d '90 days ago' +%Y-%m-%d 2>/dev/null || date -u -v-90d +%Y-%m-%d)
aws ec2 describe-snapshots --region "$REGION" --owner-ids self \
  --query "Snapshots[?StartTime<='${cutoff}'].{ID:SnapshotId,Started:StartTime}" \
  --output table
 
echo "== Load balancers with zero healthy targets =="
for tg in $(aws elbv2 describe-target-groups --region "$REGION" \
    --query 'TargetGroups[].TargetGroupArn' --output text); do
  healthy=$(aws elbv2 describe-target-health --region "$REGION" \
    --target-group-arn "$tg" \
    --query 'length(TargetHealthDescriptions[?TargetHealth.State==`healthy`])' \
    --output text)
  [ "$healthy" = "0" ] && echo "  EMPTY: $tg"
done

Run it in CI on a cron, post the output to Slack, and make someone own the triage. The first run on a neglected account is always a horror show — I once found 340 GB of unattached volumes from instances terminated a year prior.

Egress and cross-AZ transfer

Cross-AZ traffic costs $0.01/GB each way in most AWS regions. A chatty microservice mesh spanning three AZs can quietly spend thousands moving data between replicas of the same service. Egress to the internet is worse — $0.05–$0.09/GB — and it's the one your CDN bill should be absorbing but often isn't.

Two fixes pay for themselves fast. First, VPC Gateway Endpoints for S3 and DynamoDB. They're free and route traffic off the NAT Gateway entirely:

resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.main.id
  service_name      = "com.amazonaws.eu-west-1.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = [aws_route_table.private.id]
}

That single resource is what killed our $11k NAT problem. S3 traffic stopped traversing the gateway, and processing charges dropped by 80%. Second, cache aggressively at the edge so repeated reads never hit your origin — CloudFront or Fastly turning a 90% cache-hit ratio means 90% less origin egress.

Over-replicated and over-tiered storage

S3 Standard at scale is rarely the right tier for data nobody reads. Lifecycle policies move objects to cheaper tiers automatically. The mistake I see is engineers manually picking tiers — use S3 Intelligent-Tiering and let AWS move objects based on access patterns, or set explicit rules for predictable data like logs:

{
  "Rules": [
    {
      "ID": "logs-tier-down-then-expire",
      "Filter": { "Prefix": "logs/" },
      "Status": "Enabled",
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER_IR" }
      ],
      "Expiration": { "Days": 365 },
      "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
    }
  ]
}

That last AbortIncompleteMultipartUpload rule matters more than it looks: failed multipart uploads leave invisible orphaned parts you pay for indefinitely and can't see in the console object list. I've seen accounts with terabytes of phantom storage from a flaky upload job.

Right-sizing and autoscaling

Compute is the biggest line item, so right-sizing has the biggest absolute payoff even if it's not the highest percentage waste. Pull utilization from CloudWatch (or AWS Compute Optimizer, which does the analysis for you) and resize anything sitting under 40% CPU and memory at peak. Then make it dynamic. For Kubernetes, the Horizontal Pod Autoscaler keyed on real signals beats a fixed replica count every time:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ledger-api
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ledger-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 65
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300

The stabilizationWindowSeconds prevents thrashing — scaling down too eagerly on a brief dip costs you more in cold starts than you save. Pair the HPA with Karpenter or Cluster Autoscaler so nodes follow pods, and use Spot instances for stateless, interruption-tolerant workloads (60–90% cheaper than On-Demand). For your steady-state baseline, Compute Savings Plans lock in 1- or 3-year discounts of up to ~66%. The rule I follow: Savings Plans for the floor you'll always run, Spot for the spiky middle, On-Demand only for the unpredictable top.

Non-prod running overnight

Dev and staging don't need to run nights and weekends. That's 128 idle hours out of 168 a week — roughly 76% of the cost for environments nobody touches. A scheduled scale-to-zero on a tag is the highest dollar-per-line-of-code optimization you'll ever write:

# Stop all instances tagged environment=dev outside business hours
aws ec2 stop-instances --region eu-west-1 \
  --instance-ids $(aws ec2 describe-instances --region eu-west-1 \
    --filters "Name=tag:environment,Values=dev" \
              "Name=instance-state-name,Values=running" \
    --query 'Reservations[].Instances[].InstanceId' --output text)

Wire it to an EventBridge schedule, start at 8pm and 8am, and your dev account bill drops by two-thirds overnight. The only objection is "what if someone's working late" — handle it with a self-service "keep alive" tag that exempts an instance for 24 hours. Make the default cheap and the exception easy.

Guardrails: Budgets and Alerts

Optimization without guardrails regresses. New services ship, someone forgets a lifecycle rule, and you're back where you started. AWS Budgets with alerts is the cheapest insurance in the cloud. Set one per team, tied to the tags you enforced, with a forecasted-spend trigger so you hear about overruns before month-end:

{
  "BudgetName": "payments-monthly",
  "BudgetLimit": { "Amount": "25000", "Unit": "USD" },
  "TimeUnit": "MONTHLY",
  "BudgetType": "COST",
  "CostFilters": { "TagKeyValue": ["user:team$payments"] },
  "NotificationsWithSubscribers": [
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 90,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [
        { "SubscriptionType": "SNS", "Address": "arn:aws:sns:eu-west-1:111122223333:cost-alerts" }
      ]
    }
  ]
}

Route the SNS topic to Slack, not email — email cost alerts go to the same graveyard as everything else.

The Honest Tradeoff

Here's what most FinOps content won't tell you: engineering time is more expensive than most of the waste you'll find. A senior engineer costs roughly $100/hour fully loaded. If a week-long optimization saves $200/month, the payback is two and a half years — you'd have been better off shipping features.

So I triage by annualized savings versus hours of effort. Anything above $1,000/month saved for under a day of work, I do immediately. Below $200/month and more than a day, I leave it unless it's a tagging or guardrail change that prevents future waste. The cheapest wins are almost always the scripted, repeatable ones — delete-the-zombies, schedule-the-dev-boxes, add-the-endpoint — not heroic re-architectures.

The other honest tradeoff is reliability. Spot instances save 70% until a workload that can't tolerate interruption gets evicted mid-transaction. Aggressive autoscaling saves money until a cold start adds latency your SLO can't absorb. Cost optimization that degrades the product isn't optimization, it's just a different bill paid by your users.

A Checklist You Can Run This Quarter

Enforce tags (team, service, environment, cost-center) via Terraform default tags and an SCP. Activate them as cost allocation tags.
Run the zombie sweep weekly in CI. Delete unattached volumes, orphan snapshots, idle EIPs.
Add S3/DynamoDB Gateway Endpoints to kill NAT processing charges — free, immediate.
Set lifecycle policies on every log and backup bucket, including the abort-multipart rule.
Right-size anything under 40% peak utilization, then add an HPA + Karpenter.
Buy Savings Plans for your steady-state floor; move stateless work to Spot.
Schedule non-prod to stop overnight and weekends.
Set per-team Budgets with forecasted alerts routed to Slack.

Do these in order and the first four pay for the rest within a billing cycle. The bill is a feedback loop — the teams that win at cost aren't the ones who optimize hardest once, they're the ones who made waste visible and kept it that way.