Cloud Cost Optimization: Where the Money Actually Goes
Cloud bills balloon in predictable places. A pragmatic FinOps guide to finding the waste, idle compute, egress, NAT, zombie resources, without crippling the team.
On this page
- Start With Attribution or Stop Now
- Where the Money Actually Goes
- Idle and zombie resources
- Egress and cross-AZ transfer
- Over-replicated and over-tiered storage
- Right-sizing and autoscaling
- Non-prod running overnight
- Guardrails: Budgets and Alerts
- The Honest Tradeoff
- A Checklist You Can Run This Quarter
- Further reading
The first time I opened Cost Explorer to debug a bill that had grown 40% in a quarter with no traffic increase, the biggest line item wasn't compute. It wasn't RDS. It was a single NAT Gateway processing every container's outbound traffic to S3, charging us $0.045 per GB for data that never needed to leave the VPC. Eleven thousand dollars a month to route packets to a service sitting in the same region.
That is the dirty secret of cloud cost optimization: the money rarely leaks where you're looking. Everyone fixates on instance types because they're visible in every dashboard. The real waste hides in data movement, forgotten resources, and the gap between what you provisioned and what you use. After 17 years of shipping infrastructure, I've learned the bill is a diagnostic tool, not a verdict. Here's how I read it.
Start With Attribution or Stop Now
You cannot optimize what you cannot attribute. If your bill is one undifferentiated number, every optimization is a guess. Before anything else, enforce tagging. Not "encourage" — enforce, with an AWS Organizations Service Control Policy or a tag policy that rejects untagged resources.
The non-negotiable tags I require on every resource: team, service, environment, and cost-center. Here's a Terraform default that bakes them into every resource a module creates so nobody has to remember:
# providers.tf — applies tags to every taggable resource in the account
provider "aws" {
region = "eu-west-1"
default_tags {
tags = {
team = "payments"
service = "ledger-api"
environment = "production"
cost-center = "cc-4471"
managed-by = "terraform"
}
}
}Then activate those as cost allocation tags in the Billing console (they don't show up in Cost Explorer reports until you do — a step everyone forgets). Within 24 hours you can group spend by service and finally answer "what does the ledger API actually cost us?" — including its share of the NAT, the load balancer, and inter-AZ chatter.
Once attribution works, the FinOps loop from the FinOps Foundation becomes real: inform (everyone sees their spend), optimize (act on the data), operate (make it continuous). Skip "inform" and the other two are theater.
Where the Money Actually Goes
In my experience the waste clusters into a predictable Pareto distribution. Here's roughly how it breaks down on a typical over-provisioned account, and how hard each is to fix:
| Leak | Typical share of waste | Effort to fix | First move |
|---|---|---|---|
| Over-provisioned compute | 30–40% | Low | Right-size from utilization data |
| Idle / zombie resources | 15–25% | Low | Scripted sweep + delete |
| Data egress & cross-AZ | 10–20% | Medium | VPC endpoints, AZ-aware routing |
| NAT Gateway processing | 5–15% | Medium | Gateway/Interface endpoints |
| Over-replicated storage | 5–10% | Low | Lifecycle policies, tier down |
| Logging & observability | 5–15% | Medium | Sampling, retention cuts |
| Non-prod left running | 5–10% | Low | Scheduled shutdown |
Notice the pattern: most of the waste is low effort. You don't need a quarter-long migration. You need a Tuesday afternoon and the discipline to actually delete things.
Idle and zombie resources
Unattached EBS volumes, orphaned snapshots, Elastic IPs not associated with anything, idle load balancers, NAT Gateways in dead subnets. Each is small; together they're a steady drip. The fix is a scheduled sweep. Here's a script I run weekly that surfaces the usual suspects — it's read-only, so it reports and you decide:
#!/usr/bin/env bash
set -euo pipefail
REGION="${AWS_REGION:-eu-west-1}"
echo "== Unattached EBS volumes =="
aws ec2 describe-volumes --region "$REGION" \
--filters Name=status,Values=available \
--query 'Volumes[].{ID:VolumeId,GiB:Size,Created:CreateTime}' \
--output table
echo "== Unassociated Elastic IPs (each ~\$3.60/mo idle) =="
aws ec2 describe-addresses --region "$REGION" \
--query 'Addresses[?AssociationId==`null`].PublicIp' \
--output text
echo "== Snapshots older than 90 days =="
cutoff=$(date -u -d '90 days ago' +%Y-%m-%d 2>/dev/null || date -u -v-90d +%Y-%m-%d)
aws ec2 describe-snapshots --region "$REGION" --owner-ids self \
--query "Snapshots[?StartTime<='${cutoff}'].{ID:SnapshotId,Started:StartTime}" \
--output table
echo "== Load balancers with zero healthy targets =="
for tg in $(aws elbv2 describe-target-groups --region "$REGION" \
--query 'TargetGroups[].TargetGroupArn' --output text); do
healthy=$(aws elbv2 describe-target-health --region "$REGION" \
--target-group-arn "$tg" \
--query 'length(TargetHealthDescriptions[?TargetHealth.State==`healthy`])' \
--output text)
[ "$healthy" = "0" ] && echo " EMPTY: $tg"
doneRun it in CI on a cron, post the output to Slack, and make someone own the triage. The first run on a neglected account is always a horror show — I once found 340 GB of unattached volumes from instances terminated a year prior.
Egress and cross-AZ transfer
Cross-AZ traffic costs $0.01/GB each way in most AWS regions. A chatty microservice mesh spanning three AZs can quietly spend thousands moving data between replicas of the same service. Egress to the internet is worse — $0.05–$0.09/GB — and it's the one your CDN bill should be absorbing but often isn't.
Two fixes pay for themselves fast. First, VPC Gateway Endpoints for S3 and DynamoDB. They're free and route traffic off the NAT Gateway entirely:
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.eu-west-1.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = [aws_route_table.private.id]
}That single resource is what killed our $11k NAT problem. S3 traffic stopped traversing the gateway, and processing charges dropped by 80%. Second, cache aggressively at the edge so repeated reads never hit your origin — CloudFront or Fastly turning a 90% cache-hit ratio means 90% less origin egress.
Over-replicated and over-tiered storage
S3 Standard at scale is rarely the right tier for data nobody reads. Lifecycle policies move objects to cheaper tiers automatically. The mistake I see is engineers manually picking tiers — use S3 Intelligent-Tiering and let AWS move objects based on access patterns, or set explicit rules for predictable data like logs:
{
"Rules": [
{
"ID": "logs-tier-down-then-expire",
"Filter": { "Prefix": "logs/" },
"Status": "Enabled",
"Transitions": [
{ "Days": 30, "StorageClass": "STANDARD_IA" },
{ "Days": 90, "StorageClass": "GLACIER_IR" }
],
"Expiration": { "Days": 365 },
"AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }
}
]
}That last AbortIncompleteMultipartUpload rule matters more than it looks: failed multipart uploads leave invisible orphaned parts you pay for indefinitely and can't see in the console object list. I've seen accounts with terabytes of phantom storage from a flaky upload job.
Right-sizing and autoscaling
Compute is the biggest line item, so right-sizing has the biggest absolute payoff even if it's not the highest percentage waste. Pull utilization from CloudWatch (or AWS Compute Optimizer, which does the analysis for you) and resize anything sitting under 40% CPU and memory at peak. Then make it dynamic. For Kubernetes, the Horizontal Pod Autoscaler keyed on real signals beats a fixed replica count every time:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ledger-api
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ledger-api
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 65
behavior:
scaleDown:
stabilizationWindowSeconds: 300The stabilizationWindowSeconds prevents thrashing — scaling down too eagerly on a brief dip costs you more in cold starts than you save. Pair the HPA with Karpenter or Cluster Autoscaler so nodes follow pods, and use Spot instances for stateless, interruption-tolerant workloads (60–90% cheaper than On-Demand). For your steady-state baseline, Compute Savings Plans lock in 1- or 3-year discounts of up to ~66%. The rule I follow: Savings Plans for the floor you'll always run, Spot for the spiky middle, On-Demand only for the unpredictable top.
Non-prod running overnight
Dev and staging don't need to run nights and weekends. That's 128 idle hours out of 168 a week — roughly 76% of the cost for environments nobody touches. A scheduled scale-to-zero on a tag is the highest dollar-per-line-of-code optimization you'll ever write:
# Stop all instances tagged environment=dev outside business hours
aws ec2 stop-instances --region eu-west-1 \
--instance-ids $(aws ec2 describe-instances --region eu-west-1 \
--filters "Name=tag:environment,Values=dev" \
"Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].InstanceId' --output text)Wire it to an EventBridge schedule, start at 8pm and 8am, and your dev account bill drops by two-thirds overnight. The only objection is "what if someone's working late" — handle it with a self-service "keep alive" tag that exempts an instance for 24 hours. Make the default cheap and the exception easy.
Guardrails: Budgets and Alerts
Optimization without guardrails regresses. New services ship, someone forgets a lifecycle rule, and you're back where you started. AWS Budgets with alerts is the cheapest insurance in the cloud. Set one per team, tied to the tags you enforced, with a forecasted-spend trigger so you hear about overruns before month-end:
{
"BudgetName": "payments-monthly",
"BudgetLimit": { "Amount": "25000", "Unit": "USD" },
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": { "TagKeyValue": ["user:team$payments"] },
"NotificationsWithSubscribers": [
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 90,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [
{ "SubscriptionType": "SNS", "Address": "arn:aws:sns:eu-west-1:111122223333:cost-alerts" }
]
}
]
}Route the SNS topic to Slack, not email — email cost alerts go to the same graveyard as everything else.
The Honest Tradeoff
Here's what most FinOps content won't tell you: engineering time is more expensive than most of the waste you'll find. A senior engineer costs roughly $100/hour fully loaded. If a week-long optimization saves $200/month, the payback is two and a half years — you'd have been better off shipping features.
So I triage by annualized savings versus hours of effort. Anything above $1,000/month saved for under a day of work, I do immediately. Below $200/month and more than a day, I leave it unless it's a tagging or guardrail change that prevents future waste. The cheapest wins are almost always the scripted, repeatable ones — delete-the-zombies, schedule-the-dev-boxes, add-the-endpoint — not heroic re-architectures.
The other honest tradeoff is reliability. Spot instances save 70% until a workload that can't tolerate interruption gets evicted mid-transaction. Aggressive autoscaling saves money until a cold start adds latency your SLO can't absorb. Cost optimization that degrades the product isn't optimization, it's just a different bill paid by your users.
A Checklist You Can Run This Quarter
- Enforce tags (
team,service,environment,cost-center) via Terraform default tags and an SCP. Activate them as cost allocation tags. - Run the zombie sweep weekly in CI. Delete unattached volumes, orphan snapshots, idle EIPs.
- Add S3/DynamoDB Gateway Endpoints to kill NAT processing charges — free, immediate.
- Set lifecycle policies on every log and backup bucket, including the abort-multipart rule.
- Right-size anything under 40% peak utilization, then add an HPA + Karpenter.
- Buy Savings Plans for your steady-state floor; move stateless work to Spot.
- Schedule non-prod to stop overnight and weekends.
- Set per-team Budgets with forecasted alerts routed to Slack.
Do these in order and the first four pay for the rest within a billing cycle. The bill is a feedback loop — the teams that win at cost aren't the ones who optimize hardest once, they're the ones who made waste visible and kept it that way.
Further reading
- AWS Cost Explorer and AWS Budgets documentation (docs.aws.amazon.com)
- The FinOps Foundation Framework (finops.org)
- Kubernetes Horizontal Pod Autoscaler documentation (kubernetes.io)
- Karpenter project site (karpenter.sh)