Auto-Scaling Policy Template
A template for documenting scale-up and scale-down rules for cloud infrastructure.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Auto-scaling is the bridge between cost efficiency and availability. Scale too late and your service crashes under load; scale too early and you burn money on idle capacity. This template documents the exact rules, thresholds, and procedures your infrastructure team uses to scale workloads up and down automatically.
When to Use
Use this resource when:
- Defining scaling rules for a new service deployed to the cloud
- Auditing why an auto-scaling event caused an outage or excessive cost
- Migrating from static instance sizes to dynamic scaling
Solution
# Auto-Scaling Policy: `<Service Name>`
## 1. Service Metadata
| Field | Value |
|-------|-------|
| Service | `name` |
| Platform | `AWS / GCP / Azure / Kubernetes` |
| Owner Team | `@team-name` |
| Last Reviewed | `YYYY-MM-DD` |
## 2. Scale-Up Policy
### 2.1. Triggers
| Metric | Threshold | Duration | Scale Action | Cooldown |
|--------|-----------|----------|--------------|----------|
| CPU utilization | > 60% | 2 minutes | Add 1 instance | 3 minutes |
| Memory utilization | > 70% | 2 minutes | Add 1 instance | 3 minutes |
| Request count | > 5,000 RPS | 1 minute | Add 2 instances | 5 minutes |
| Queue depth | > 100 messages | 3 minutes | Add 1 instance | 3 minutes |
| Latency p95 | > 500ms | 2 minutes | Add 2 instances | 5 minutes |
### 2.2. Limits
| Limit | Value | Rationale |
|-------|-------|-----------|
| Max instances | 20 | Cost ceiling, database connection limit |
| Max scale-up per event | 50% of current | Prevent thundering herd on cold start |
| Scale-up cooldown | 3 minutes | Allow metric stabilization |
## 3. Scale-Down Policy
### 3.1. Triggers
| Metric | Threshold | Duration | Scale Action | Cooldown |
|--------|-----------|----------|--------------|----------|
| CPU utilization | < 30% | 10 minutes | Remove 1 instance | 5 minutes |
| Memory utilization | < 30% | 10 minutes | Remove 1 instance | 5 minutes |
| Request count | < 1,000 RPS | 10 minutes | Remove 1 instance | 5 minutes |
### 3.2. Limits
| Limit | Value | Rationale |
|-------|-------|-----------|
| Min instances | 3 | Redundancy, rolling deployment buffer |
| Max scale-down per event | 25% of current | Avoid over-correction |
| Scale-down cooldown | 5 minutes | Allow metric stabilization |
## 4. Instance Requirements
### 4.1. Health Checks
- [ ] Load balancer health check passes before instance receives traffic
- [ ] Instance must serve traffic for minimum 5 minutes before scale-down eligibility
- [ ] Connection draining allows in-flight requests to complete (30 seconds)
### 4.2. Warm-Up
- [ ] New instances complete initialization (app start, cache warm-up) before joining pool
- [ ] Warm-up time documented: `60 seconds`
- [ ] Startup probe / readiness probe configured in orchestrator
## 5. Cost Controls
| Control | Value | Notes |
|---------|-------|-------|
| Max hourly spend | $500 | Alert if exceeded |
| Instance type | `c5.large` | CPU-optimized for API workload |
| Spot / Preemptible | 50% of instances | Use for non-critical batch processing only |
| Reserved capacity | Baseline of 3 instances | Commitment discount for minimum |
## 6. Incident Response
| Scenario | Action | Owner |
|----------|--------|-------|
| Scale-up fails (quota exceeded) | Page on-call, escalate to cloud admin | SRE |
| Scale-down causes errors | Pause auto-scaling, investigate | Platform |
| Costs spike > 2x baseline | Review policy, check for runaway jobs | Finance + SRE |
| Latency rises despite scaling | Alert: likely database bottleneck, not compute | DBA + App Team |
Explanation
The template separates scale-up (fast, aggressive) from scale-down (slow, conservative). Scale-up triggers use shorter durations because you need capacity before failures happen. Scale-down uses longer durations to avoid thrashing instances in and out during normal traffic jitter. The cooldown prevents the autoscaler from reacting to metric noise caused by the scaling event itself. Min instances exist for redundancy: even at zero traffic, you need enough instances to survive a rolling deployment without downtime.
Variants
| Context | Approach | Notes |
|---|---|---|
| Kubernetes HPA | Metrics via custom metrics API | Scale on custom metrics (queue length, request latency) |
| AWS EC2 Auto Scaling | CloudWatch alarms | Use predictive scaling for known patterns |
| Serverless (Lambda) | Concurrency limits | No traditional scaling; manage max concurrency and reserved concurrency |
| GPU workloads | Scale on GPU utilization | Longer warm-up, higher cost; avoid spot instances |
Best Practices
- Always set max instance limits to prevent runaway scaling from infinite loops or DDoS
- Use predictive scaling for predictable traffic patterns (nightly batch, business hours)
- Test scale-up and scale-down events in staging before production
- Monitor scaling event frequency; frequent events indicate threshold misconfiguration
- Document why each threshold was chosen so future teams can tune intelligently
Common Mistakes
- Setting CPU threshold at 80% or higher, leaving no headroom for spikes
- Using the same policy for all services regardless of their workload patterns
- Forgetting connection draining, causing dropped requests during scale-down
- Scaling only on CPU and ignoring memory, network, or custom metrics
- Allowing scale-down to zero for stateful services that need persistent connections
Frequently Asked Questions
Should I scale on CPU or requests per second?
CPU works for compute-bound workloads (image processing, ML inference). RPS works for I/O-bound workloads (APIs, proxies). Use custom metrics (queue depth, latency) when neither CPU nor RPS correlates with user experience. The best policies use multiple metrics with OR logic.
What is predictive scaling and when should I use it?
Predictive scaling (AWS, GCP) uses historical traffic to pre-warm instances before the spike arrives. Use it for predictable patterns: daily peaks, weekly batch jobs, or marketing campaigns. Do not use it for unpredictable viral traffic.
How do I prevent cost explosions from auto-scaling?
Set a hard max instance count. Use budget alerts. Review instance types quarterly (a newer generation may be cheaper and faster). Use reserved instances for baseline capacity and auto-scaling for overflow. Tag instances by service so finance can attribute costs accurately.
Related Resources
Capacity Planning Template
A reusable template for planning system capacity, estimating growth, and preventing performance bottlenecks before they happen.
DocDeployment Checklist Template
A pre-release verification checklist for safe production deployments.
DocAPI Status Page Template
A template for a public API status page that communicates uptime, incidents, and maintenance windows to consumers.
DocBug Report Template
A structured bug report template to help teams reproduce, triage, and resolve defects faster with clear reproduction steps and expected behavior.
DocContributing Guide Template
A ready-to-use template for open-source and internal project contribution guidelines.