What happens when your dependencies fail in production
Your WebAPI controller is perfect.
DTOs protect your boundaries. Validation catches bad requests before they hit the database. Status codes communicate clearly. Async operations scale under load. Everything from Part 1 is implemented.
You shipped it six months ago. It's been rock solid.
Then Black Friday hits. The payment API goes down. And your perfect controller — the one that passed every code review, the one with 94% test coverage — takes the entire site with it.
Gilligan didn't sink the Minnow because he was incompetent. He sank it because nobody planned for bad weather. Your controller is Gilligan. It means well. It just doesn't know what to do when the three-hour tour hits a storm.
Your controller was correct. The problem was you didn't design for failure.
💥 The Failure Cascade
Your "perfect" controller:
[ApiController]
[Route("api/v{version:apiVersion}/orders")]
public class OrdersController : ControllerBase
{
private readonly IOrderService _orderService;
public OrdersController(IOrderService orderService)
{
_orderService = orderService;
}
[HttpPost]
public async Task<IActionResult> CreateOrder(
CreateOrderRequest request,
CancellationToken cancellationToken)
{
var result = await _orderService.CreateOrderAsync(request, cancellationToken);
if (!result.IsSuccess)
return BadRequest(result.Error);
return CreatedAtAction(
nameof(GetOrder),
new { id = result.Value.Id },
result.Value);
}
}
Your service layer:
public class OrderService : IOrderService
{
private readonly IPaymentClient _paymentClient;
private readonly IOrderRepository _orderRepository;
public OrderService(
IPaymentClient paymentClient,
IOrderRepository orderRepository)
{
_paymentClient = paymentClient;
_orderRepository = orderRepository;
}
public async Task<Result<OrderDto>> CreateOrderAsync(
CreateOrderRequest request,
CancellationToken cancellationToken)
{
// Charge customer through external payment API
var paymentResult = await _paymentClient.ChargeCustomerAsync(
request.CustomerId,
request.Total,
cancellationToken);
if (!paymentResult.IsSuccess)
return Result<OrderDto>.Failure(paymentResult.Error);
// Save order
var order = await _orderRepository.CreateAsync(request, cancellationToken);
return Result<OrderDto>.Success(order);
}
}
Note: Result<T> is a custom type for handling errors without exceptions — we'll build it in the exception handling post. For now, just know .IsSuccess tells you if the operation worked, and .Value or .Error give you the outcome.
Looks good, right?
DTOs at the boundary. Proper status codes. Error handling with Result
What happens on Black Friday when the payment API goes down:
11:47 AM - Payment API starts responding slowly (5 seconds per request)
11:49 AM - Payment API stops responding entirely
11:50 AM - First order fails after 100-second timeout (default HttpClient timeout)
11:51 AM - 50 concurrent orders, all waiting 100 seconds
11:52 AM - Thread pool exhausted (all threads blocked waiting for payment API)
11:53 AM - New requests queue (no threads available)
11:54 AM - Site stops responding
11:55 AM - Load balancer removes your instances (health checks failing)
11:56 AM - Black Friday revenue: $0
11:57 AM - You're explaining to the CEO why the site is down and customers are shopping at competitors
Your controller was perfect. Your dependency wasn't.
🎯 The Shift
From: "My code is correct, dependencies should work"
To: "Dependencies will fail, my code should handle it"
I've debugged production APIs at scale — healthcare systems processing millions of claims, real estate platforms serving tens of millions of homes. The pattern repeats:
- Payment gateway has an outage
- Database primary fails over to replica
- Third-party API rate-limits you
- Network partition between services
- DNS resolution fails temporarily
- Cloud provider has a regional issue
Your code doesn't control when dependencies fail. But you control what happens when they do.
🛡️ Pattern 1: Timeout (Fail Fast)
The trap: Waiting forever for a dead dependency.
What happens:
// HttpClient default timeout: 100 seconds
var response = await _httpClient.PostAsync(url, content, cancellationToken);
When the payment API hangs: - Request waits 100 seconds before timing out - Thread blocked for 100 seconds (wasted) - Under load: all threads blocked - Thread pool exhausted - New requests can't execute - Site goes down
The fix: Fail fast. Don't wait 100 seconds for a dead API.
.NET 8+ with Microsoft.Extensions.Http.Resilience:
// Program.cs
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.ConfigureHttpClient(client =>
{
client.BaseAddress = new Uri("https://payment-api.example.com");
})
.AddStandardResilienceHandler(options =>
{
// Total timeout for the entire request including retries
options.TotalRequestTimeout = new HttpTimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(10)
};
// Timeout per individual attempt
options.AttemptTimeout = new HttpTimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(3)
};
});
What this does:
- Each attempt times out in 3 seconds (not 100)
- Total request (including retries) times out in 10 seconds
- Thread freed after timeout (not blocked forever)
- Application stays responsive even when dependency is down
Healthcare example:
Patient at pharmacy counter. Prescription pricing API is down.
Without timeout: Patient waits 100 seconds while you stare at loading spinner. Awkward. They leave without medication.
With 3-second timeout: System fails in 3 seconds, shows cached pricing or manual override. Patient gets medication. Revenue protected.
🔄 Pattern 2: Retry (Handle Transient Failures)
Not every failure is permanent. Sometimes a packet drops. Sometimes a load balancer hiccups. Sometimes a container is mid-restart when your request arrives. These are transient failures — they fix themselves in seconds, but your user still gets an error page.
A single network blip shouldn't cost you an order.
.NET 8+ Standard Resilience Handler includes retry automatically:
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.AddStandardResilienceHandler(options =>
{
// Retry configuration
options.Retry = new HttpRetryStrategyOptions
{
MaxRetryAttempts = 3,
BackoffType = DelayBackoffType.Exponential,
Delay = TimeSpan.FromSeconds(1),
UseJitter = true,
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(response =>
response.StatusCode >= HttpStatusCode.InternalServerError ||
response.StatusCode == HttpStatusCode.RequestTimeout)
};
});
What this does:
- Request fails with 503 Service Unavailable
- Wait 1 second (with jitter), retry
- Fails again → wait ~2 seconds (exponential backoff + jitter), retry
- Fails third time → wait ~4 seconds (exponential backoff + jitter), retry
- Still failing → propagate error
Jitter prevents thundering herd: If 1,000 requests fail at the same moment, they don't all retry at the same moment (which would DDoS the recovering service). Jitter adds randomness to retry delays.
Note: The Handle<HttpRequestException>() line is critical. Without it, you only catch HTTP error responses — not network failures where the request never gets a response at all. DNS timeouts, connection resets, TLS handshake failures — these throw HttpRequestException, and if your predicate doesn't handle it, your retry policy ignores the failures you most need to retry.
Healthcare example:
Claims processing. Pharmacy benefit manager API hiccups.
Without retry: Claim fails. Patient's medication shows "not covered." Pharmacy calls insurance. 20-minute hold time. Patient leaves.
With retry: Transient failure retries automatically. Claim processes. Patient gets medication. Nobody notices the hiccup.
⚡ Pattern 3: Circuit Breaker (Stop the Bleeding)
Timeouts and retries handle the symptoms. Circuit breaker handles the disease.
Here's the scenario: payment API goes down completely. You've got timeouts — great, each request only wastes 3 seconds instead of 100. You've got retries — great, each request wastes 3 seconds four times. You're still burning threads on a dependency you already know is dead.
- Request 1: Tries payment API, times out after 3 seconds
- Request 2: Tries payment API, times out after 3 seconds
- Request 3: Tries payment API, times out after 3 seconds
- ... (repeat for every order)
- 1,000 orders: 1,000 × 3 seconds = 3,000 seconds of wasted time
- Payment API gets hammered with requests it can't fulfill
- Your threads are wasted on calls you know will fail
The fix: After N failures, stop trying.
.NET 8+ Standard Resilience Handler includes circuit breaker:
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.AddStandardResilienceHandler(options =>
{
// Circuit breaker configuration
options.CircuitBreaker = new HttpCircuitBreakerStrategyOptions
{
FailureRatio = 0.5, // Open if 50% of requests fail
MinimumThroughput = 10, // Need at least 10 requests to evaluate
SamplingDuration = TimeSpan.FromSeconds(30),
BreakDuration = TimeSpan.FromSeconds(30),
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(response =>
response.StatusCode >= HttpStatusCode.InternalServerError ||
response.StatusCode == HttpStatusCode.RequestTimeout)
};
});
How circuit breaker works:
Closed State (Normal Operation)
- Requests go through normally
- Successes and failures are tracked
Open State (Dependency is Down)
- After 50% failure rate within 30-second window (with minimum 10 requests)
- Circuit opens
- All requests fail immediately without calling payment API
- Returns error: "Circuit breaker is open"
- Stays open for 30 seconds
Half-Open State (Testing Recovery)
- After 30 seconds, circuit moves to half-open
- Allows one test request through
- If successful: circuit closes (back to normal)
- If fails: circuit re-opens for another 30 seconds
What this does:
Payment API goes down. After 5 failures out of 10 requests:
- Circuit opens
- Next 1,000 orders fail immediately (no 3-second timeout wasted)
- Payment API is not hammered with requests it can't handle
- After 30 seconds, try one request to see if API recovered
- If recovered: resume normal operation
- If still down: fail fast for another 30 seconds
Healthcare example:
Prescription pricing API goes down. You're processing 500 claims per minute.
Without circuit breaker: - 500 claims/min × 3 seconds timeout = 25 minutes of thread-time burned every 60 seconds of wall-clock — you'll never catch up - Prescription API gets hammered with 500 requests/min it can't answer - Your system is unusable
With circuit breaker: - First 5 failures out of 10 → circuit opens - Next 495 claims fail immediately (in milliseconds, not 3 seconds) - System shows cached pricing or routes to manual review - Prescription API gets a break (not hammered with requests) - After 30 seconds, test if API recovered - Claims keep processing (degraded mode, but operational)
🏗️ Combining All Three Patterns
In production, you use all three together:
// Program.cs - .NET 8+
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.ConfigureHttpClient(client =>
{
client.BaseAddress = new Uri(
builder.Configuration["PaymentApi:BaseUrl"]
?? throw new InvalidOperationException("PaymentApi:BaseUrl not configured"));
})
.AddStandardResilienceHandler(options =>
{
// Timeout: Fail fast
options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(10);
options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(3);
// Retry: Handle transient failures
options.Retry.MaxRetryAttempts = 3;
options.Retry.BackoffType = DelayBackoffType.Exponential;
options.Retry.Delay = TimeSpan.FromSeconds(1);
options.Retry.UseJitter = true;
// Circuit Breaker: Stop hammering dead dependencies
options.CircuitBreaker.FailureRatio = 0.5;
options.CircuitBreaker.MinimumThroughput = 10;
options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
options.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);
});
How they work together:
- Request fails (payment API slow)
- Timeout kicks in after 3 seconds (don't wait forever)
- Retry waits 1 second, tries again
- Timeout applies to retry too (3 seconds max)
- After 3 retry attempts, if still failing → propagate error
- Circuit breaker watches: if 50% of requests fail → open circuit
- Once open, all requests fail immediately (no timeout, no retry)
- After 30 seconds, circuit allows one test request
- If successful → back to normal
- If fails → stay open for 30 more seconds
Order matters in the pipeline:
The AddStandardResilienceHandler() configures the pipeline in this order (innermost to outermost):
1. Attempt Timeout (innermost - per individual call)
2. Circuit Breaker (trips when enough individual calls fail)
3. Retry (retries the circuit-breaker-wrapped call)
4. Total Request Timeout (outermost - absolute limit across all attempts)
This is the correct order. Retry wraps circuit breaker, which wraps timeout. When the circuit opens, retry sees the immediate rejection and burns through attempts instantly — failing fast instead of waiting through backoff delays for a dependency you already know is down.
🔧 Custom Resilience Pipeline with Monitoring
Standard resilience handler is great for most cases. But sometimes you need custom behavior with logging to monitor what's happening in production:
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
.ConfigureHttpClient(client =>
{
client.BaseAddress = new Uri(
builder.Configuration["PaymentApi:BaseUrl"]
?? throw new InvalidOperationException("PaymentApi:BaseUrl not configured"));
})
.AddResilienceHandler("payment-pipeline", (pipelineBuilder, context) =>
{
var logger = context.ServiceProvider
.GetRequiredService<ILogger<PaymentClient>>();
// Outermost: Total timeout across all attempts
pipelineBuilder.AddTimeout(TimeSpan.FromSeconds(10));
// Retry with logging
pipelineBuilder.AddRetry(new HttpRetryStrategyOptions
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromSeconds(1),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(response =>
response.StatusCode >= HttpStatusCode.InternalServerError ||
response.StatusCode == HttpStatusCode.RequestTimeout),
OnRetry = args =>
{
logger.LogWarning(
"Payment API retry {Attempt}/{MaxAttempts}. Status: {Status}, Delay: {Delay}ms",
args.AttemptNumber,
3,
args.Outcome.Result?.StatusCode,
args.RetryDelay.TotalMilliseconds);
return ValueTask.CompletedTask;
}
});
// Circuit breaker with logging
pipelineBuilder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
{
FailureRatio = 0.5,
MinimumThroughput = 10,
SamplingDuration = TimeSpan.FromSeconds(30),
BreakDuration = TimeSpan.FromSeconds(30),
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.HandleResult(response =>
response.StatusCode >= HttpStatusCode.InternalServerError ||
response.StatusCode == HttpStatusCode.RequestTimeout),
OnOpened = args =>
{
logger.LogError(
"Payment API circuit breaker OPENED. Break duration: {Duration}s",
args.BreakDuration.TotalSeconds);
return ValueTask.CompletedTask;
},
OnClosed = args =>
{
logger.LogInformation(
"Payment API circuit breaker CLOSED. Normal operation resumed.");
return ValueTask.CompletedTask;
},
OnHalfOpened = args =>
{
logger.LogInformation(
"Payment API circuit breaker HALF-OPEN (testing recovery)");
return ValueTask.CompletedTask;
}
});
// Innermost: Timeout per attempt
pipelineBuilder.AddTimeout(TimeSpan.FromSeconds(3));
});
Critical detail: In Polly v8, the first strategy added is the outermost. The order above matches what AddStandardResilienceHandler() does internally: total timeout (outermost) → retry → circuit breaker → attempt timeout (innermost). When the circuit opens, retry sees the immediate rejection and fails fast.
When to use custom pipelines:
- Need logging on retry/circuit breaker events for production monitoring
- Different retry strategies for different error types
- More complex fallback logic
- Integration with monitoring/alerting systems
What to monitor in production:
- Retry attempt count (spike = dependency issues)
- Circuit breaker state changes (opened = dependency down)
- Timeout frequency (increase = dependency slow)
- Request success/failure ratio
- Total request duration (including retries)
Alert on: - Circuit breaker opens (dependency is down) - Retry rate >20% (dependency degraded) - Timeout rate >10% (dependency slow)
⚠️ Common Gotchas
1. No Timeout = Thread Pool Exhaustion
The trap:
// No timeout configured
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>();
What happens:
Default HttpClient timeout is 100 seconds. Dependency hangs. Every thread blocked for 100 seconds. Thread pool exhausts. Site crashes.
The fix: Always configure timeout. 3-5 seconds for most external APIs.
2. Retry Without Backoff = DDoS Yourself
The trap:
// Immediate retry (no delay)
options.Retry.Delay = TimeSpan.Zero;
options.Retry.BackoffType = DelayBackoffType.Constant;
What happens:
Dependency has brief outage. 1,000 requests fail simultaneously. All 1,000 retry immediately. Dependency gets hammered with 3,000 requests in 1 second. You DDoS the service you're trying to use.
The fix: Exponential backoff with jitter. Standard resilience handler does this by default.
3. Circuit Breaker Too Sensitive
The trap:
// Opens after single failure
options.CircuitBreaker.FailureRatio = 0.0; // Any failure opens it
options.CircuitBreaker.MinimumThroughput = 1;
What happens:
One transient failure. Circuit opens. All requests fail fast. Dependency recovers immediately but circuit stays open for 30 seconds. False positive outage.
The fix: Reasonable thresholds. 50% failure ratio with minimum 10 requests is good default.
4. No Fallback Strategy
The trap:
var paymentResult = await _paymentClient.ChargeCustomerAsync(...);
if (!paymentResult.IsSuccess)
return Result<OrderDto>.Failure("Payment failed"); // Dead end
What happens:
Payment API is down. Circuit breaker is open. Every order fails with "Payment failed." Revenue: $0.
The fix: Fallback strategy.
var paymentResult = await _paymentClient.ChargeCustomerAsync(...);
if (!paymentResult.IsSuccess)
{
// Fallback: Save the order, queue payment for later
var pendingOrder = await _orderRepository.CreatePendingAsync(request);
await _paymentQueue.EnqueueAsync(pendingOrder.Id);
return Result<OrderDto>.Success(new OrderDto
{
Id = pendingOrder.Id,
Status = "Pending Payment", // Background job will retry
Message = "Order received. Payment processing..."
});
}
Healthcare example:
Prescription pricing API down. Don't tell patient "system unavailable, go away."
Fallback options: 1. Use cached pricing from last successful call 2. Use average pricing for this medication 3. Flag for manual pharmacist review 4. Process claim, adjust pricing later if needed
Patient gets medication. Revenue protected. System operational.
🎓 Why These Patterns Matter
Every production outage I've investigated follows the same script:
Dependency fails → No resilience patterns → Cascade failure → Site down → Revenue lost
It doesn't matter if the dependency is a payment gateway, a claims processor, or a pricing API. The cascade is always the same. And every time, someone says "but the code was fine." The code was fine. The architecture wasn't.
Picture two e-commerce sites on Black Friday. Same payment gateway. Same 15-minute outage.
Site A has no resilience patterns. 100-second timeouts drain the thread pool in two minutes. The site goes dark for the entire outage plus recovery time. Customer support lines light up. Social media does the rest.
Site B has the patterns from this post. 3-second timeouts keep threads alive. Circuit breaker trips after the first wave of failures. Orders queue for later processing. The site stays up — degraded, not dead. When the gateway recovers, queued orders process automatically. Most customers never notice.
Same outage. One site lost its peak revenue hour. The other lost nothing.
Your controller is the last line of defense. When dependencies fail — and they will — your API can either cascade or degrade gracefully. These three patterns are the difference.
🧭 Key Takeaways
- Dependencies will fail in production — design for it
- Timeout: Fail fast (3-5 seconds), don't wait forever
- Retry: Handle transient failures with exponential backoff + jitter
- Circuit Breaker: Stop hammering dead dependencies
- Combine them: Use
AddStandardResilienceHandler()in .NET 8+ - Monitor: Log when patterns activate, alert on circuit breaker opens
- Fallback: Don't just fail — queue, cache, or degrade gracefully
🚀 Next Steps
Review your HTTP clients:
- Do you have timeouts? Default 100 seconds will kill your thread pool
- Do you retry transient failures? Network blips shouldn't fail orders
- Do you have circuit breakers? Stop hammering dependencies that are down
- Can you monitor resilience? Log retry attempts, circuit breaker state
- What's your fallback? When payment API is down, what happens to orders?
Start with your most critical dependency (payment, auth, inventory). Add resilience. Test it (disable the dependency, watch your circuit breaker work).
The APIs that survive Black Friday implement these patterns. The ones that crash don't.
Your controller is correct. Now make it resilient.
Related Posts: - Building Professional WebAPI Controllers (Part 1) - Boundaries, DTOs, validation, status codes - Dependency Injection in ASP.NET Core - Foundation for injecting HttpClients - Configuration Management That Won't Get You Fired - Managing API endpoints, timeouts via configuration
In the next post: Exception handling in production — why exceptions for control flow fail audits, how Result<T> patterns prevent it, and building error handling that survives regulatory review.