Retry with Exponential Backoff
Implement resilient retry strategies with exponential backoff, jitter, and circuit breaker integration for transient failure recovery.
Note: This guide follows English-language naming conventions and terminology standards common in international development teams. Examples use English identifiers and comments to maximize compatibility across codebases and tooling.
Overview
Retry with exponential backoff is the foundational pattern for handling transient failures in distributed systems. Instead of immediately failing when a network hiccup or temporary overload occurs, the client waits progressively longer between attempts. Adding jitter prevents synchronized retries from creating a thundering herd that overwhelms the recovering service.
When to Use
Use this resource when:
- Calling external APIs or services over unreliable networks
- Database connections occasionally timeout under load
- You need to distinguish transient errors (retryable) from permanent failures
- Integrating with cloud services that throttle or have regional outages
Solution
Exponential Backoff with Jitter (Python)
import random
import time
from functools import wraps
def retry(max_attempts=5, base_delay=1, max_delay=60, exceptions=(Exception,)):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(1, max_attempts + 1):
try:
return func(*args, **kwargs)
except exceptions as e:
if attempt == max_attempts:
raise
# Exponential backoff with full jitter
delay = min(base_delay * (2 ** (attempt - 1)), max_delay)
jitter = random.uniform(0, delay)
time.sleep(jitter)
return wrapper
return decorator
@retry(max_attempts=5, base_delay=1, exceptions=(ConnectionError,))
def fetch_data(url):
response = requests.get(url, timeout=10)
response.raise_for_status()
return response.json()
Resilience4j Circuit Breaker + Retry (Java)
import io.github.resilience4j.retry.annotation.Retry;
import io.github.resilience4j.circuitbreaker.annotation.CircuitBreaker;
@Service
public class PaymentService {
@Retry(name = "paymentRetry", fallbackMethod = "fallback")
@CircuitBreaker(name = "paymentCircuit")
public PaymentResult charge(PaymentRequest request) {
return paymentClient.charge(request);
}
private PaymentResult fallback(PaymentRequest request, Exception ex) {
return PaymentResult.declined("Service temporarily unavailable");
}
}
// application.yml
resilience4j:
retry:
configs:
default:
maxAttempts: 5
waitDuration: 1s
exponentialBackoffMultiplier: 2
retryExceptions:
- java.net.ConnectException
- java.net.SocketTimeoutException
Polly Retry Policy (C#)
using Polly;
var retryPolicy = Policy
.Handle<HttpRequestException>(ex =>
ex.StatusCode == HttpStatusCode.ServiceUnavailable ||
ex.StatusCode == HttpStatusCode.TooManyRequests)
.WaitAndRetryAsync(
retryCount: 5,
sleepDurationProvider: retryAttempt =>
TimeSpan.FromSeconds(Math.Pow(2, retryAttempt))
+ TimeSpan.FromMilliseconds(new Random().Next(0, 1000)),
onRetry: (exception, timeSpan, retryCount, context) =>
{
logger.LogWarning($"Retry {retryCount} after {timeSpan}s due to {exception.Message}");
});
var result = await retryPolicy.ExecuteAsync(() => httpClient.GetAsync(url));
Explanation
Backoff strategies:
| Strategy | Delay Pattern | Use Case |
|---|---|---|
| Fixed | 1s, 1s, 1s | Predictable retry intervals |
| Linear | 1s, 2s, 3s | Moderate increase |
| Exponential | 1s, 2s, 4s, 8s | Fast escape from overload |
| Decorrelated jitter | Random in [0, 2^n] | Prevents thundering herd |
| Equal jitter | (2^n)/2 + random | Balanced spread |
When NOT to retry:
- HTTP 400 (client error — retry won’t fix)
- HTTP 401/403 (auth issues)
- HTTP 404 (resource doesn’t exist)
- Business logic errors (insufficient funds, invalid input)
Variants
| Library | Language | Notable |
|---|---|---|
| Resilience4j | Java | Retry, CB, rate limiter, bulkhead |
| Polly | C# | Comprehensive; async support |
| tenacity | Python | Decorators; jitter support |
| cockroachdb/errors | Go | Structured errors; retry markers |
| axios-retry | JavaScript | Axios plugin; configurable |
Best Practices
- Set a maximum delay: Without a cap, backoff can grow to hours
- Use idempotency keys: Retrying POST requests without them creates duplicates. See message idempotency.
- Circuit breaker integration: Stop retrying when the service is clearly down. Integrate with circuit breaker.
- Log every retry: Silent retries hide systemic issues
- Respect Retry-After headers: HTTP 429/503 often include recommended wait times
Common Mistakes
- Retrying everything: Non-idempotent operations and client errors should fail fast
- No jitter: Synchronized retries from multiple clients recreate the original overload
- Infinite retries: A client that retries forever becomes a denial-of-service source
- Blocking the caller: Synchronous retries in request handlers increase response times
- Retrying inside transactions: Database transactions + retries = lock escalation
Frequently Asked Questions
Q: What’s the right number of retries? A: Usually 3-5. More retries increase latency without significantly improving success rates.
Q: Should I retry in the client or use a message queue? A: For synchronous APIs: retry in client. For background jobs: use a queue with built-in retry.
Q: How do I handle idempotency for retries?
A: Generate a unique Idempotency-Key header. The server checks if it has processed this key before. Learn more in message idempotency.
Related Resources
Microservices Architecture — When to Use and When Not To
A practical guide to microservices: benefits, trade-offs, common patterns, and when to choose them over monoliths. Covers decomposition strategies and operational complexity.
GuideSystem Design Interview Guide — Key Concepts
A practical guide to system design interviews: scalability, databases, caching, load balancing, microservices, and how to structure your answer.
GuideCAP Theorem and Database Trade-offs
A practical guide to the CAP theorem: consistency, availability, and partition tolerance. Learn how to choose the right trade-offs for your application.
RecipeMicroservices Communication Patterns
Choose between synchronous and asynchronous communication patterns for resilient microservices architectures.
RecipeWorkflow Engines
Orchestrate complex business processes with workflow engines, state machines, and long-running task coordination across distributed services.