⚡ When the CRUD Hits the Fan: Building Resilient WebAPI Controllers

What happens when your dependencies fail in production

Your WebAPI controller is perfect.

DTOs protect your boundaries. Validation catches bad requests before they hit the database. Status codes communicate clearly. Async operations scale under load. Everything from Part 1 is implemented.

You shipped it six months ago. It's been rock solid.

Then Black Friday hits. The payment API goes down. And your perfect controller — the one that passed every code review, the one with 94% test coverage — takes the entire site with it.

Gilligan didn't sink the Minnow because he was incompetent. He sank it because nobody planned for bad weather. Your controller is Gilligan. It means well. It just doesn't know what to do when the three-hour tour hits a storm.

Your controller was correct. The problem was you didn't design for failure.

💥 The Failure Cascade

Your "perfect" controller:

[ApiController]
[Route("api/v{version:apiVersion}/orders")]
public class OrdersController : ControllerBase
{
    private readonly IOrderService _orderService;

    public OrdersController(IOrderService orderService)
    {
        _orderService = orderService;
    }

    [HttpPost]
    public async Task<IActionResult> CreateOrder(
        CreateOrderRequest request,
        CancellationToken cancellationToken)
    {
        var result = await _orderService.CreateOrderAsync(request, cancellationToken);

        if (!result.IsSuccess)
            return BadRequest(result.Error);

        return CreatedAtAction(
            nameof(GetOrder),
            new { id = result.Value.Id },
            result.Value);
    }
}

Your service layer:

public class OrderService : IOrderService
{
    private readonly IPaymentClient _paymentClient;
    private readonly IOrderRepository _orderRepository;

    public OrderService(
        IPaymentClient paymentClient,
        IOrderRepository orderRepository)
    {
        _paymentClient = paymentClient;
        _orderRepository = orderRepository;
    }

    public async Task<Result<OrderDto>> CreateOrderAsync(
        CreateOrderRequest request,
        CancellationToken cancellationToken)
    {
        // Charge customer through external payment API
        var paymentResult = await _paymentClient.ChargeCustomerAsync(
            request.CustomerId,
            request.Total,
            cancellationToken);

        if (!paymentResult.IsSuccess)
            return Result<OrderDto>.Failure(paymentResult.Error);

        // Save order
        var order = await _orderRepository.CreateAsync(request, cancellationToken);

        return Result<OrderDto>.Success(order);
    }
}

Note: Result<T> is a custom type for handling errors without exceptions — we'll build it in the exception handling post. For now, just know .IsSuccess tells you if the operation worked, and .Value or .Error give you the outcome.

Looks good, right?

DTOs at the boundary. Proper status codes. Error handling with Result. Async with CancellationToken.

What happens on Black Friday when the payment API goes down:

11:47 AM - Payment API starts responding slowly (5 seconds per request)
11:49 AM - Payment API stops responding entirely
11:50 AM - First order fails after 100-second timeout (default HttpClient timeout)
11:51 AM - 50 concurrent orders, all waiting 100 seconds
11:52 AM - Thread pool exhausted (all threads blocked waiting for payment API)
11:53 AM - New requests queue (no threads available)
11:54 AM - Site stops responding
11:55 AM - Load balancer removes your instances (health checks failing)
11:56 AM - Black Friday revenue: $0
11:57 AM - You're explaining to the CEO why the site is down and customers are shopping at competitors

Your controller was perfect. Your dependency wasn't.

🎯 The Shift

From: "My code is correct, dependencies should work"
To: "Dependencies will fail, my code should handle it"

I've debugged production APIs at scale — healthcare systems processing millions of claims, real estate platforms serving tens of millions of homes. The pattern repeats:

Payment gateway has an outage
Database primary fails over to replica
Third-party API rate-limits you
Network partition between services
DNS resolution fails temporarily
Cloud provider has a regional issue

Your code doesn't control when dependencies fail. But you control what happens when they do.

🛡️ Pattern 1: Timeout (Fail Fast)

The trap: Waiting forever for a dead dependency.

What happens:

// HttpClient default timeout: 100 seconds
var response = await _httpClient.PostAsync(url, content, cancellationToken);

When the payment API hangs: - Request waits 100 seconds before timing out - Thread blocked for 100 seconds (wasted) - Under load: all threads blocked - Thread pool exhausted - New requests can't execute - Site goes down

The fix: Fail fast. Don't wait 100 seconds for a dead API.

.NET 8+ with Microsoft.Extensions.Http.Resilience:

// Program.cs
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .ConfigureHttpClient(client =>
    {
        client.BaseAddress = new Uri("https://payment-api.example.com");
    })
    .AddStandardResilienceHandler(options =>
    {
        // Total timeout for the entire request including retries
        options.TotalRequestTimeout = new HttpTimeoutStrategyOptions
        {
            Timeout = TimeSpan.FromSeconds(10)
        };

        // Timeout per individual attempt
        options.AttemptTimeout = new HttpTimeoutStrategyOptions
        {
            Timeout = TimeSpan.FromSeconds(3)
        };
    });

What this does:

Each attempt times out in 3 seconds (not 100)
Total request (including retries) times out in 10 seconds
Thread freed after timeout (not blocked forever)
Application stays responsive even when dependency is down

Healthcare example:

Patient at pharmacy counter. Prescription pricing API is down.

Without timeout: Patient waits 100 seconds while you stare at loading spinner. Awkward. They leave without medication.

With 3-second timeout: System fails in 3 seconds, shows cached pricing or manual override. Patient gets medication. Revenue protected.

🔄 Pattern 2: Retry (Handle Transient Failures)

Not every failure is permanent. Sometimes a packet drops. Sometimes a load balancer hiccups. Sometimes a container is mid-restart when your request arrives. These are transient failures — they fix themselves in seconds, but your user still gets an error page.

A single network blip shouldn't cost you an order.

.NET 8+ Standard Resilience Handler includes retry automatically:

builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .AddStandardResilienceHandler(options =>
    {
        // Retry configuration
        options.Retry = new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            BackoffType = DelayBackoffType.Exponential,
            Delay = TimeSpan.FromSeconds(1),
            UseJitter = true,
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(response =>
                    response.StatusCode >= HttpStatusCode.InternalServerError ||
                    response.StatusCode == HttpStatusCode.RequestTimeout)
        };
    });

What this does:

Request fails with 503 Service Unavailable
Wait 1 second (with jitter), retry
Fails again → wait ~2 seconds (exponential backoff + jitter), retry
Fails third time → wait ~4 seconds (exponential backoff + jitter), retry
Still failing → propagate error

Jitter prevents thundering herd: If 1,000 requests fail at the same moment, they don't all retry at the same moment (which would DDoS the recovering service). Jitter adds randomness to retry delays.

Note: The Handle<HttpRequestException>() line is critical. Without it, you only catch HTTP error responses — not network failures where the request never gets a response at all. DNS timeouts, connection resets, TLS handshake failures — these throw HttpRequestException, and if your predicate doesn't handle it, your retry policy ignores the failures you most need to retry.

Healthcare example:

Claims processing. Pharmacy benefit manager API hiccups.

Without retry: Claim fails. Patient's medication shows "not covered." Pharmacy calls insurance. 20-minute hold time. Patient leaves.

With retry: Transient failure retries automatically. Claim processes. Patient gets medication. Nobody notices the hiccup.

⚡ Pattern 3: Circuit Breaker (Stop the Bleeding)

Timeouts and retries handle the symptoms. Circuit breaker handles the disease.

Here's the scenario: payment API goes down completely. You've got timeouts — great, each request only wastes 3 seconds instead of 100. You've got retries — great, each request wastes 3 seconds four times. You're still burning threads on a dependency you already know is dead.

Request 1: Tries payment API, times out after 3 seconds
Request 2: Tries payment API, times out after 3 seconds
Request 3: Tries payment API, times out after 3 seconds
... (repeat for every order)
1,000 orders: 1,000 × 3 seconds = 3,000 seconds of wasted time
Payment API gets hammered with requests it can't fulfill
Your threads are wasted on calls you know will fail

The fix: After N failures, stop trying.

.NET 8+ Standard Resilience Handler includes circuit breaker:

builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .AddStandardResilienceHandler(options =>
    {
        // Circuit breaker configuration
        options.CircuitBreaker = new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,              // Open if 50% of requests fail
            MinimumThroughput = 10,          // Need at least 10 requests to evaluate
            SamplingDuration = TimeSpan.FromSeconds(30),
            BreakDuration = TimeSpan.FromSeconds(30),
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(response =>
                    response.StatusCode >= HttpStatusCode.InternalServerError ||
                    response.StatusCode == HttpStatusCode.RequestTimeout)
        };
    });

How circuit breaker works:

Closed State (Normal Operation)

Requests go through normally
Successes and failures are tracked

Open State (Dependency is Down)

After 50% failure rate within 30-second window (with minimum 10 requests)
Circuit opens
All requests fail immediately without calling payment API
Returns error: "Circuit breaker is open"
Stays open for 30 seconds

Half-Open State (Testing Recovery)

After 30 seconds, circuit moves to half-open
Allows one test request through
If successful: circuit closes (back to normal)
If fails: circuit re-opens for another 30 seconds

What this does:

Payment API goes down. After 5 failures out of 10 requests:

Circuit opens
Next 1,000 orders fail immediately (no 3-second timeout wasted)
Payment API is not hammered with requests it can't handle
After 30 seconds, try one request to see if API recovered
If recovered: resume normal operation
If still down: fail fast for another 30 seconds

Healthcare example:

Prescription pricing API goes down. You're processing 500 claims per minute.

Without circuit breaker: - 500 claims/min × 3 seconds timeout = 25 minutes of thread-time burned every 60 seconds of wall-clock — you'll never catch up - Prescription API gets hammered with 500 requests/min it can't answer - Your system is unusable

With circuit breaker: - First 5 failures out of 10 → circuit opens - Next 495 claims fail immediately (in milliseconds, not 3 seconds) - System shows cached pricing or routes to manual review - Prescription API gets a break (not hammered with requests) - After 30 seconds, test if API recovered - Claims keep processing (degraded mode, but operational)

🏗️ Combining All Three Patterns

In production, you use all three together:

// Program.cs - .NET 8+
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .ConfigureHttpClient(client =>
    {
        client.BaseAddress = new Uri(
            builder.Configuration["PaymentApi:BaseUrl"] 
            ?? throw new InvalidOperationException("PaymentApi:BaseUrl not configured"));
    })
    .AddStandardResilienceHandler(options =>
    {
        // Timeout: Fail fast
        options.TotalRequestTimeout.Timeout = TimeSpan.FromSeconds(10);
        options.AttemptTimeout.Timeout = TimeSpan.FromSeconds(3);

        // Retry: Handle transient failures
        options.Retry.MaxRetryAttempts = 3;
        options.Retry.BackoffType = DelayBackoffType.Exponential;
        options.Retry.Delay = TimeSpan.FromSeconds(1);
        options.Retry.UseJitter = true;

        // Circuit Breaker: Stop hammering dead dependencies
        options.CircuitBreaker.FailureRatio = 0.5;
        options.CircuitBreaker.MinimumThroughput = 10;
        options.CircuitBreaker.SamplingDuration = TimeSpan.FromSeconds(30);
        options.CircuitBreaker.BreakDuration = TimeSpan.FromSeconds(30);
    });

How they work together:

Request fails (payment API slow)
Timeout kicks in after 3 seconds (don't wait forever)
Retry waits 1 second, tries again
Timeout applies to retry too (3 seconds max)
After 3 retry attempts, if still failing → propagate error
Circuit breaker watches: if 50% of requests fail → open circuit
Once open, all requests fail immediately (no timeout, no retry)
After 30 seconds, circuit allows one test request
If successful → back to normal
If fails → stay open for 30 more seconds

Order matters in the pipeline:

The AddStandardResilienceHandler() configures the pipeline in this order (innermost to outermost): 1. Attempt Timeout (innermost - per individual call) 2. Circuit Breaker (trips when enough individual calls fail) 3. Retry (retries the circuit-breaker-wrapped call) 4. Total Request Timeout (outermost - absolute limit across all attempts)

This is the correct order. Retry wraps circuit breaker, which wraps timeout. When the circuit opens, retry sees the immediate rejection and burns through attempts instantly — failing fast instead of waiting through backoff delays for a dependency you already know is down.

🔧 Custom Resilience Pipeline with Monitoring

Standard resilience handler is great for most cases. But sometimes you need custom behavior with logging to monitor what's happening in production:

builder.Services.AddHttpClient<IPaymentClient, PaymentClient>()
    .ConfigureHttpClient(client =>
    {
        client.BaseAddress = new Uri(
            builder.Configuration["PaymentApi:BaseUrl"] 
            ?? throw new InvalidOperationException("PaymentApi:BaseUrl not configured"));
    })
    .AddResilienceHandler("payment-pipeline", (pipelineBuilder, context) =>
    {
        var logger = context.ServiceProvider
            .GetRequiredService<ILogger<PaymentClient>>();

        // Outermost: Total timeout across all attempts
        pipelineBuilder.AddTimeout(TimeSpan.FromSeconds(10));

        // Retry with logging
        pipelineBuilder.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromSeconds(1),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(response =>
                    response.StatusCode >= HttpStatusCode.InternalServerError ||
                    response.StatusCode == HttpStatusCode.RequestTimeout),
            OnRetry = args =>
            {
                logger.LogWarning(
                    "Payment API retry {Attempt}/{MaxAttempts}. Status: {Status}, Delay: {Delay}ms",
                    args.AttemptNumber,
                    3,
                    args.Outcome.Result?.StatusCode,
                    args.RetryDelay.TotalMilliseconds);
                return ValueTask.CompletedTask;
            }
        });

        // Circuit breaker with logging
        pipelineBuilder.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            FailureRatio = 0.5,
            MinimumThroughput = 10,
            SamplingDuration = TimeSpan.FromSeconds(30),
            BreakDuration = TimeSpan.FromSeconds(30),
            ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
                .Handle<HttpRequestException>()
                .HandleResult(response =>
                    response.StatusCode >= HttpStatusCode.InternalServerError ||
                    response.StatusCode == HttpStatusCode.RequestTimeout),
            OnOpened = args =>
            {
                logger.LogError(
                    "Payment API circuit breaker OPENED. Break duration: {Duration}s",
                    args.BreakDuration.TotalSeconds);
                return ValueTask.CompletedTask;
            },
            OnClosed = args =>
            {
                logger.LogInformation(
                    "Payment API circuit breaker CLOSED. Normal operation resumed.");
                return ValueTask.CompletedTask;
            },
            OnHalfOpened = args =>
            {
                logger.LogInformation(
                    "Payment API circuit breaker HALF-OPEN (testing recovery)");
                return ValueTask.CompletedTask;
            }
        });

        // Innermost: Timeout per attempt
        pipelineBuilder.AddTimeout(TimeSpan.FromSeconds(3));
    });

Critical detail: In Polly v8, the first strategy added is the outermost. The order above matches what AddStandardResilienceHandler() does internally: total timeout (outermost) → retry → circuit breaker → attempt timeout (innermost). When the circuit opens, retry sees the immediate rejection and fails fast.

When to use custom pipelines:

Need logging on retry/circuit breaker events for production monitoring
Different retry strategies for different error types
More complex fallback logic
Integration with monitoring/alerting systems

What to monitor in production:

Retry attempt count (spike = dependency issues)
Circuit breaker state changes (opened = dependency down)
Timeout frequency (increase = dependency slow)
Request success/failure ratio
Total request duration (including retries)

Alert on: - Circuit breaker opens (dependency is down) - Retry rate >20% (dependency degraded) - Timeout rate >10% (dependency slow)

⚠️ Common Gotchas

1. No Timeout = Thread Pool Exhaustion

The trap:

// No timeout configured
builder.Services.AddHttpClient<IPaymentClient, PaymentClient>();

What happens:

Default HttpClient timeout is 100 seconds. Dependency hangs. Every thread blocked for 100 seconds. Thread pool exhausts. Site crashes.

The fix: Always configure timeout. 3-5 seconds for most external APIs.

2. Retry Without Backoff = DDoS Yourself

The trap:

// Immediate retry (no delay)
options.Retry.Delay = TimeSpan.Zero;
options.Retry.BackoffType = DelayBackoffType.Constant;

What happens:

Dependency has brief outage. 1,000 requests fail simultaneously. All 1,000 retry immediately. Dependency gets hammered with 3,000 requests in 1 second. You DDoS the service you're trying to use.

The fix: Exponential backoff with jitter. Standard resilience handler does this by default.

3. Circuit Breaker Too Sensitive

The trap:

// Opens after single failure
options.CircuitBreaker.FailureRatio = 0.0;  // Any failure opens it
options.CircuitBreaker.MinimumThroughput = 1;

What happens:

One transient failure. Circuit opens. All requests fail fast. Dependency recovers immediately but circuit stays open for 30 seconds. False positive outage.

The fix: Reasonable thresholds. 50% failure ratio with minimum 10 requests is good default.

4. No Fallback Strategy

The trap:

var paymentResult = await _paymentClient.ChargeCustomerAsync(...);

if (!paymentResult.IsSuccess)
    return Result<OrderDto>.Failure("Payment failed");  // Dead end

What happens:

Payment API is down. Circuit breaker is open. Every order fails with "Payment failed." Revenue: $0.

The fix: Fallback strategy.

var paymentResult = await _paymentClient.ChargeCustomerAsync(...);

if (!paymentResult.IsSuccess)
{
    // Fallback: Save the order, queue payment for later
    var pendingOrder = await _orderRepository.CreatePendingAsync(request);
    await _paymentQueue.EnqueueAsync(pendingOrder.Id);

    return Result<OrderDto>.Success(new OrderDto
    {
        Id = pendingOrder.Id,
        Status = "Pending Payment",  // Background job will retry
        Message = "Order received. Payment processing..."
    });
}

Healthcare example:

Prescription pricing API down. Don't tell patient "system unavailable, go away."

Fallback options: 1. Use cached pricing from last successful call 2. Use average pricing for this medication 3. Flag for manual pharmacist review 4. Process claim, adjust pricing later if needed

Patient gets medication. Revenue protected. System operational.

🎓 Why These Patterns Matter

Every production outage I've investigated follows the same script:

Dependency fails → No resilience patterns → Cascade failure → Site down → Revenue lost

It doesn't matter if the dependency is a payment gateway, a claims processor, or a pricing API. The cascade is always the same. And every time, someone says "but the code was fine." The code was fine. The architecture wasn't.

Picture two e-commerce sites on Black Friday. Same payment gateway. Same 15-minute outage.

Site A has no resilience patterns. 100-second timeouts drain the thread pool in two minutes. The site goes dark for the entire outage plus recovery time. Customer support lines light up. Social media does the rest.

Site B has the patterns from this post. 3-second timeouts keep threads alive. Circuit breaker trips after the first wave of failures. Orders queue for later processing. The site stays up — degraded, not dead. When the gateway recovers, queued orders process automatically. Most customers never notice.

Same outage. One site lost its peak revenue hour. The other lost nothing.

Your controller is the last line of defense. When dependencies fail — and they will — your API can either cascade or degrade gracefully. These three patterns are the difference.

🧭 Key Takeaways

Dependencies will fail in production — design for it
Timeout: Fail fast (3-5 seconds), don't wait forever
Retry: Handle transient failures with exponential backoff + jitter
Circuit Breaker: Stop hammering dead dependencies
Combine them: Use AddStandardResilienceHandler() in .NET 8+
Monitor: Log when patterns activate, alert on circuit breaker opens
Fallback: Don't just fail — queue, cache, or degrade gracefully

🚀 Next Steps

Review your HTTP clients:

Do you have timeouts? Default 100 seconds will kill your thread pool
Do you retry transient failures? Network blips shouldn't fail orders
Do you have circuit breakers? Stop hammering dependencies that are down
Can you monitor resilience? Log retry attempts, circuit breaker state
What's your fallback? When payment API is down, what happens to orders?

Start with your most critical dependency (payment, auth, inventory). Add resilience. Test it (disable the dependency, watch your circuit breaker work).

The APIs that survive Black Friday implement these patterns. The ones that crash don't.

Your controller is correct. Now make it resilient.

Related Posts: - Building Professional WebAPI Controllers (Part 1) - Boundaries, DTOs, validation, status codes - Dependency Injection in ASP.NET Core - Foundation for injecting HttpClients - Configuration Management That Won't Get You Fired - Managing API endpoints, timeouts via configuration

In the next post: Exception handling in production — why exceptions for control flow fail audits, how Result<T> patterns prevent it, and building error handling that survives regulatory review.

💥 The Failure Cascade

🎯 The Shift

🛡️ Pattern 1: Timeout (Fail Fast)

🔄 Pattern 2: Retry (Handle Transient Failures)

⚡ Pattern 3: Circuit Breaker (Stop the Bleeding)

Closed State (Normal Operation)

Open State (Dependency is Down)

Half-Open State (Testing Recovery)

🏗️ Combining All Three Patterns

🔧 Custom Resilience Pipeline with Monitoring

⚠️ Common Gotchas

1. No Timeout = Thread Pool Exhaustion

2. Retry Without Backoff = DDoS Yourself

3. Circuit Breaker Too Sensitive

4. No Fallback Strategy

🎓 Why These Patterns Matter

🧭 Key Takeaways

🚀 Next Steps

Patterns worth internalizing.