Cloud Cost Optimization
Reduce cloud infrastructure costs with right-sizing, reserved instances, spot instances, and automated resource scheduling across AWS, GCP, and Azure.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Cloud costs can spiral unexpectedly — unused resources, oversized instances, and forgotten development environments silently drain budgets. Cost optimization isn’t just about cutting spending; it’s about aligning infrastructure capacity with actual demand. This resource covers right-sizing, purchasing strategies (reserved vs. spot), automated scheduling, and FinOps practices that reduce waste without impacting reliability.
When to Use
Use this resource when:
- Monthly cloud bills are growing faster than user traffic
- Development and staging environments run 24/7 despite only being used during business hours
- You’re paying for overprovisioned instances that use <20% CPU
- You need to justify infrastructure costs to finance or leadership
Solution
AWS Cost Explorer Analysis (AWS CLI)
# Find top cost drivers by service
aws ce get-cost-and-usage \
--time-period Start=$(date -d '30 days ago' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE \
--query 'ResultsByTime[0].Groups[?Metrics.BlendedCost.Amount > \`100\`].Keys'
# Find unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,CreateTime]'
Terraform Scheduled Scaling
resource "aws_autoscaling_schedule" "dev_office_hours" {
scheduled_action_name = "dev-office-hours"
min_size = 1
max_size = 3
desired_capacity = 2
recurrence = "0 9 * * MON-FRI" # 9 AM UTC
autoscaling_group_name = aws_autoscaling_group.dev.name
}
resource "aws_autoscaling_schedule" "dev_night_shutdown" {
scheduled_action_name = "dev-night-shutdown"
min_size = 0
max_size = 0
desired_capacity = 0
recurrence = "0 18 * * MON-FRI" # 6 PM UTC
autoscaling_group_name = aws_autoscaling_group.dev.name
}
Spot Instance with Fallback (Kubernetes)
apiVersion: apps/v1
kind: Deployment
metadata:
name: spot-workload
spec:
replicas: 5
template:
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: node-type
operator: In
values: [spot]
tolerations:
- key: spot
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: app
image: myapp:latest
Explanation
Four pillars of cloud cost optimization:
- Right-size: Match instance type to actual usage; downsize overprovisioned resources
- Reserved capacity: Commit to 1-3 year reserved instances for predictable workloads (40-60% savings)
- Spot/preemptible: Use interruptible instances for fault-tolerant batch jobs (60-90% savings)
- Auto-scheduling: Turn off dev/staging environments nights and weekends
FinOps lifecycle:
- Inform: Visibility into cloud spend per team, project, and environment
- Optimize: Technical and rate optimizations (RI, spot, rightsizing)
- Operate: Continuous governance, budgets, and automated policies
Variants
| Strategy | Savings | Effort | Risk |
|---|---|---|---|
| Reserved instances | 40-60% | Low | Commitment lock-in |
| Spot instances | 60-90% | Medium | Interruption |
| Scheduled shutdown | 50-70% | Low | Manual oversight |
| Storage tiering | 30-50% | Low | Access latency |
| Serverless | Variable | Medium | Cold start |
Best Practices
- Tag everything: Cost allocation tags (team, project, environment) enable chargeback
- Set budgets and alerts: Alert at 80% of monthly budget; investigate immediately
- Review unused resources weekly: Dangling IPs, orphaned volumes, and stale snapshots add up
- Use Savings Plans over RIs: More flexible; apply across instance families and regions
- Implement auto-scaling: Scale to zero for dev environments; scale up for production peaks. See autoscaling policies.
Common Mistakes
- No cost ownership: When engineering doesn’t see the bill, waste accumulates
- Overcommitting to reserved instances: Buying 3-year RIs for workloads that may migrate to serverless
- Ignoring data transfer costs: NAT Gateway, cross-AZ traffic, and egress can exceed compute costs
- Leaving preview resources running: POCs and experiments that become permanent line items
- One-size-fits-all pricing: Production needs stability; dev can tolerate spot interruptions
Frequently Asked Questions
Q: Should I use spot instances for production? A: Only for stateless, fault-tolerant workloads with proper fallback to on-demand. Never for databases or singleton services.
Q: How do I prevent developers from creating expensive resources? A: SCPs (Service Control Policies) restrict instance types by OU. Terraform policies enforce approved instance families.
Q: What’s the difference between FinOps and DevOps? A: DevOps optimizes for speed and reliability. FinOps adds cost as a first-class metric, with cross-functional accountability.
Related Resources
Capacity Planning Template
A reusable template for planning system capacity, estimating growth, and preventing performance bottlenecks before they happen.
RecipeDeploy Applications to Kubernetes with Helm Charts
Package, version, and deploy Kubernetes applications using Helm charts with value overrides, template functions, and release management for reproducible infrastructure
RecipeProvision an AWS VPC with Terraform
How to use Terraform to provision a production-ready AWS VPC with public and private subnets, NAT gateways, and security groups
RecipeLocal Microservices Development with Docker Compose
Orchestrate multi-service local environments with Docker Compose including databases, caches, message brokers, and reverse proxies with hot reload and shared networks
RecipeCanary Deployments with Istio Service Mesh
How to use Istio traffic splitting to perform safe canary deployments by gradually shifting user traffic between application versions