🧨 Exception Handling That Survives Production

When exceptions aren't exceptional

The Minnow had a perfectly good weather forecast. The tour was planned for three hours. The Skipper knew the route. But when the weather turned, there was no graceful degradation — just a catastrophic failure that stranded seven people on an island for three years.

If only the Skipper had used Result<WeatherForecast> instead of throwing StormException.

Your production API makes the same mistake every day. Not with weather forecasts — with user lookups, database queries, payment processing, claim validation. Every time you throw an exception for something that's not exceptional, you're betting your three-hour tour won't hit a storm.

Here's what actually happens when inexperienced developers handle exceptions the "obvious" way:

2:47 PM - Claims processing API running smoothly
2:52 PM - Database experiences momentary connection hiccup
2:53 PM - 500 concurrent claim lookups all throw ClaimNotFoundException
2:54 PM - Exception handler logs 500 stack traces (each 2KB)
2:55 PM - Log aggregation system falls behind processing 1MB of exception data
2:56 PM - Thread pool exhausted from exception overhead
2:57 PM - API response time degrades from 45ms to 3,500ms
2:58 PM - Load balancer removes instances (health checks timing out)
2:59 PM - Pharmacy staff calling support: "System is down, patients waiting"
3:00 PM - You're explaining to leadership why a temporary database hiccup took down claims processing

Your exception handling was "correct." The problem was you treated expected failures as exceptions.

💥 The Five Mistakes That Sink Production APIs

Every production outage I've investigated involving exception handling follows the same pattern. Smart, motivated inexperienced developers make the same five mistakes — not because they're careless, but because exception handling is one of those areas where the "obvious" approach quietly destroys reliability, observability, and maintainability.

Let's break down what goes wrong, why it matters, and how to fix it.

❌ Mistake #1: Catching Exception Everywhere

The rookie instinct:
"If I catch everything, nothing can break."

What inexperienced developers write:

[HttpGet("claims/{claimId}")]
public async Task<IActionResult> GetClaim(string claimId)
{
    try
    {
        var claim = await _claimService.GetClaimAsync(claimId);
        return Ok(claim);
    }
    catch (Exception ex)  // ❌ Catches EVERYTHING
    {
        _logger.LogError(ex, "Error getting claim");
        return BadRequest("Unable to retrieve claim");
    }
}

Looks defensive, right? You're catching exceptions, logging them, returning an error response. Code review approves it. It ships.

What actually happens in production:

This code catches: - ClaimNotFoundException (expected - claim doesn't exist) - SqlException (unexpected - database timeout) - OutOfMemoryException (critical - process running out of resources) - OperationCanceledException (expected - request cancelled) - InvalidOperationException (could be anything - you can't tell)

You just treated "claim not found" the same as "server running out of memory."

Why it's harmful:

You lose the actual failure mode. Was it a bad claim ID? Database timeout? Network partition? The log says "Error getting claim" — useless.
You prevent higher-level handlers from doing their job. ASP.NET Core has middleware that maps exceptions to proper HTTP status codes, implements retry policies, and handles cross-cutting concerns. You just bypassed all of it.
You catch exceptions that should never be caught. OutOfMemoryException is rarely recoverable. StackOverflowException terminates the process regardless — you can't even catch it. Blanket catch (Exception) gives you the illusion of safety while hiding critical failures.

The fix: Catch specific exceptions you can handle

[HttpGet("claims/{claimId}")]
public async Task<IActionResult> GetClaim(string claimId)
{
    // Don't catch anything here - let it bubble to middleware
    var claim = await _claimService.GetClaimAsync(claimId);
    return Ok(claim);
}

// Service layer - handle specific, recoverable exceptions
public async Task<Claim> GetClaimAsync(string claimId)
{
    try
    {
        return await _repository.GetClaimAsync(claimId);
    }
    catch (SqlException ex) when (ex.Number == -2) // Timeout
    {
        // Timeout is recoverable - retry or fail with context
        throw new ClaimServiceException(
            $"Database timeout retrieving claim {claimId}", ex);
    }
    // Let everything else bubble to global handler
}

Or better: Don't throw exceptions for expected outcomes at all. We'll get there.

🧹 Mistake #2: Swallowing Exceptions Without Logging or Action

The rookie instinct:
"If I catch it and keep going, the user won't notice."

What inexperienced developers write:

public async Task ProcessClaimsAsync(List<string> claimIds)
{
    foreach (var claimId in claimIds)
    {
        try
        {
            await ProcessClaimAsync(claimId);
        }
        catch (Exception ex)
        {
            // ❌ Swallowed - no log, no action, no indication this failed
            continue;
        }
    }
}

The thinking: "If one claim fails, keep processing the others. Don't let one bad claim stop the whole batch."

What actually happens in production:

A swallowed exception is a lost root cause. Three hours later, something else breaks and nobody can trace it back to the original failure.

Real scenario from healthcare claims processing:

Batch processes 10,000 prescription claims
Claim #3,847 has invalid formatting (exception thrown and swallowed)
System continues, marks batch as "complete"
Patient goes to pharmacy to pick up prescription
Pharmacy system shows "not covered" (claim never processed)
Patient calls insurance, 45-minute hold time
Customer service can't find the claim (it was swallowed)
Patient leaves without medication

Nobody knew claim #3,847 failed. The exception was swallowed. The batch was marked complete. The system continued in an undefined state.

Why it's harmful:

Debugging becomes guesswork. No log entry, no indication the failure occurred
State corruption becomes possible. System assumes operation succeeded when it didn't
Production issues become intermittent and impossible to reproduce. The failure happened hours ago; the symptom appears now

The fix: If you catch, you must act

public async Task ProcessClaimsAsync(List<string> claimIds)
{
    var failures = new List<(string ClaimId, Exception Error)>();

    foreach (var claimId in claimIds)
    {
        try
        {
            await ProcessClaimAsync(claimId);
        }
        catch (Exception ex)
        {
            // ✅ Log it
            _logger.LogError(ex, 
                "Failed to process claim {ClaimId}. Continuing batch.", claimId);

            // ✅ Track it
            failures.Add((claimId, ex));

            // ✅ Decide what to do
            // Option 1: Continue (logged for investigation)
            // Option 2: Add to dead letter queue for retry
            // Option 3: Fail fast if critical
        }
    }

    if (failures.Any())
    {
        // Report partial batch failure
        throw new BatchProcessingException(
            $"Batch completed with {failures.Count} failures", failures);
    }
}

If you catch an exception, you must: - Log it, OR - Transform it (wrap with context), OR - Retry it, OR - Add to dead letter queue, OR - Fail fast

Doing nothing is never acceptable.

🔄 Mistake #3: Using Exceptions for Control Flow

This is the big one. This is why we need Result.

The rookie instinct:
"Exceptions are just another way to branch logic."

What inexperienced developers write:

public async Task<Claim> GetClaimAsync(string claimId)
{
    var claim = await _repository.FindAsync(claimId);

    if (claim == null)
        throw new ClaimNotFoundException(claimId);  // ❌ Expected outcome

    return claim;
}

public async Task<Prescription> GetPrescriptionAsync(int prescriptionId)
{
    var prescription = await _repository.FindAsync(prescriptionId);

    if (prescription == null)
        throw new PrescriptionNotFoundException(prescriptionId);  // ❌ Expected outcome

    if (prescription.Refills == 0)
        throw new NoRefillsRemainingException(prescriptionId);  // ❌ Expected outcome

    if (prescription.ExpirationDate < DateTime.UtcNow)
        throw new PrescriptionExpiredException(prescriptionId);  // ❌ Expected outcome

    return prescription;
}

Looks like error handling, right? You're checking conditions, throwing typed exceptions, providing context.

What actually happens in production:

Healthcare claims system processes prescription lookups. During peak hours (4-6 PM, people picking up medications after work):

10,000 prescription lookups per hour
15% don't exist in system (entered wrong ID, typo, etc.)
1,500 PrescriptionNotFoundException exceptions per hour
Each exception: Stack unwind, object allocation, logging overhead
Performance degrades from 45ms → 180ms per lookup
Thread pool pressure increases
Pharmacy staff waiting
Patients waiting

You just made "prescription not found" as expensive as a database timeout.

Why it's harmful:

Performance under load. Exceptions are expensive. Stack unwinding, allocation, logging. Using them for expected outcomes (user not found, item out of stock, validation failure) destroys performance.
Harder-to-read logic. Is "not found" exceptional? Or is it a normal outcome that should be handled in normal flow?
Confusion between "expected" and "unexpected." When you use exceptions for both, you lose the signal. Every exception becomes noise.

The shift:

From: "Exceptions for everything"
To: "Exceptions for unexpected failures, Result for expected outcomes"

Ask yourself: "If this happens 100 times per hour during normal operation, is it exceptional?"

User enters wrong claim ID → Not exceptional, happens constantly
Database connection timeout → Exceptional, should rarely happen
Prescription has zero refills → Not exceptional, normal validation
OutOfMemoryException → Exceptional, critical failure

The fix: Result for expected failures

public async Task<Result<Claim>> GetClaimAsync(string claimId)
{
    var claim = await _repository.FindAsync(claimId);

    if (claim == null)
        return Result<Claim>.Failure($"Claim {claimId} not found");  // ✅ Expected outcome

    return Result<Claim>.Success(claim);
}

public async Task<Result<Prescription>> GetPrescriptionAsync(int prescriptionId)
{
    var prescription = await _repository.FindAsync(prescriptionId);

    if (prescription == null)
        return Result<Prescription>.Failure(
            $"Prescription {prescriptionId} not found");

    if (prescription.Refills == 0)
        return Result<Prescription>.Failure(
            $"Prescription {prescriptionId} has no refills remaining");

    if (prescription.ExpirationDate < DateTime.UtcNow)
        return Result<Prescription>.Failure(
            $"Prescription {prescriptionId} expired on {prescription.ExpirationDate:d}");

    return Result<Prescription>.Success(prescription);
}

What this does:

Expected failures: Cheap, lightweight, fast
Clear intent: Caller knows to check IsSuccess
No performance penalty: No stack unwinding, minimal allocation
Better logging: Only log actual exceptions (unexpected failures)
Cleaner code: Success/failure is part of normal flow

Controller integration:

[HttpGet("prescriptions/{prescriptionId}")]
public async Task<IActionResult> GetPrescription(int prescriptionId)
{
    var result = await _prescriptionService.GetPrescriptionAsync(prescriptionId);

    if (!result.IsSuccess)
        return NotFound(new ProblemDetails 
        { 
            Title = "Prescription not found",
            Detail = result.Error,
            Status = 404
        });

    return Ok(result.Value);
}

Now let's build Result properly.

🏗️ Building Result: The Right Way

You've seen Result<T> used in the WebAPI posts. Time to build it.

What Result solves:

Expected failures (not found, validation, business rule violations) without exceptions
Explicit success/failure handling in type system
Cheap, fast, no performance overhead
Forces callers to handle both paths

Complete implementation:

public class Result<T>
{
    public bool IsSuccess { get; }
    public T Value { get; }
    public string Error { get; }

    private Result(bool isSuccess, T value, string error)
    {
        IsSuccess = isSuccess;
        Value = value;
        Error = error;
    }

    public static Result<T> Success(T value)
    {
        return new Result<T>(true, value, null);
    }

    public static Result<T> Failure(string error)
    {
        return new Result<T>(false, default, error);
    }
}

Note: This is the teaching version. In production, you'd want accessing .Value on a failed result to throw InvalidOperationException — forcing callers to check IsSuccess first instead of silently getting default(T). Libraries like FluentResults or ErrorOr handle this and more.

Usage patterns:

// Service layer - return Result<T>
public async Task<Result<Claim>> GetClaimAsync(string claimId)
{
    if (string.IsNullOrWhiteSpace(claimId))
        return Result<Claim>.Failure("Claim ID is required");

    var claim = await _repository.FindAsync(claimId);

    if (claim == null)
        return Result<Claim>.Failure($"Claim {claimId} not found");

    if (claim.Status == ClaimStatus.Cancelled)
        return Result<Claim>.Failure($"Claim {claimId} has been cancelled");

    return Result<Claim>.Success(claim);
}

// Controller - handle Result<T>
[HttpGet("claims/{claimId}")]
public async Task<IActionResult> GetClaim(string claimId)
{
    var result = await _claimService.GetClaimAsync(claimId);

    if (!result.IsSuccess)
        return NotFound(new ProblemDetails
        {
            Title = "Claim not found",
            Detail = result.Error,
            Status = 404
        });

    return Ok(result.Value);
}

When to use Result vs exceptions:

Use Result for: - Not found (user, claim, prescription, order) - Validation failures (invalid input, business rule violations) - Business logic failures (insufficient funds, expired prescription, duplicate submission) - Any failure that happens during normal operation

Use exceptions for: - Database connection failures - Network timeouts - OutOfMemoryException (you can't recover anyway — and StackOverflowException kills the process before you get the chance) - Third-party API failures (unexpected) - File system failures (disk full, permission denied) - Any failure that indicates the system is in an abnormal state

The test: If it happens 100+ times per hour during normal operation, it's not exceptional — use Result.

🧱 Mistake #4: Re-Throwing Exceptions Incorrectly

The rookie instinct:
"I'll just rethrow it with throw ex;"

What inexperienced developers write:

public async Task<Claim> ProcessClaimAsync(ClaimRequest request)
{
    try
    {
        var claim = await _externalClaimService.SubmitClaimAsync(request);
        return claim;
    }
    catch (Exception ex)
    {
        _logger.LogError(ex, "Failed to process claim");
        throw ex;  // ❌ DESTROYS STACK TRACE
    }
}

What happens:

System.Exception: Claim service unavailable
   at ClaimService.ProcessClaimAsync(ClaimRequest request) in ClaimService.cs:line 47

You just lost the original stack trace. The log points to line 47 (your rethrow), not the actual failure point deep inside _externalClaimService. The real failure could be on line 1,247 of a third-party library — you'll never know.

Why it's harmful:

Logs point to the wrong location
Root causes become harder to identify
Debugging time increases dramatically
Production issues take longer to resolve

The fix: Preserve the stack trace

Option 1: Just throw (no variable)

try
{
    var claim = await _externalClaimService.SubmitClaimAsync(request);
    return claim;
}
catch (Exception ex)
{
    _logger.LogError(ex, "Failed to process claim for {ClaimId}", request.ClaimId);
    throw;  // ✅ Preserves full stack trace
}

Option 2: Wrap with context (keep inner exception)

try
{
    var claim = await _externalClaimService.SubmitClaimAsync(request);
    return claim;
}
catch (Exception ex)
{
    throw new ClaimProcessingException(
        $"Failed to process claim {request.ClaimId} for patient {request.PatientId}",
        ex);  // ✅ Wraps with context, preserves original as InnerException
}

Now your log shows:

ClaimProcessingException: Failed to process claim CLM-38474 for patient PT-92847
  InnerException: HttpRequestException: Connection timeout
    at ExternalClaimService.SubmitClaimAsync() in ExternalClaimService.cs:line 1247
    at ClaimService.ProcessClaimAsync() in ClaimService.cs:line 47

You have both: The context (which claim, which patient) AND the original failure point (line 1247).

🧪 Mistake #5: Handling Exceptions at the Wrong Layer

The rookie instinct:
"Every layer should handle its own exceptions."

What inexperienced developers write:

// Repository layer
public async Task<Claim> GetClaimAsync(string claimId)
{
    try
    {
        return await _context.Claims.FindAsync(claimId);
    }
    catch (SqlException ex)
    {
        // ❌ Repository knows about HTTP status codes?
        throw new HttpRequestException("Database error", ex);
    }
}

// Service layer
public async Task<Claim> GetClaimAsync(string claimId)
{
    try
    {
        return await _repository.GetClaimAsync(claimId);
    }
    catch (Exception ex)
    {
        // ❌ Service knows about logging implementation?
        _logger.LogError(ex, "Error");
        throw;
    }
}

// Controller layer
[HttpGet("claims/{claimId}")]
public async Task<IActionResult> GetClaim(string claimId)
{
    try
    {
        var claim = await _claimService.GetClaimAsync(claimId);
        return Ok(claim);
    }
    catch (Exception ex)
    {
        // ❌ Controller knows about specific exception types?
        return BadRequest(ex.Message);
    }
}

What's wrong here:

Repository layer knows about HTTP concepts (HttpRequestException)
Service layer duplicates logging logic
Controller layer returns BadRequest for everything (database timeout = 400?)
Every layer handles exceptions differently
No consistent error response format

Why it's harmful:

Error handling becomes scattered and inconsistent
Layers leak implementation details
You lose the ability to enforce global policies
Impossible to change exception handling strategy without touching every layer

The fix: Handle exceptions at the highest appropriate level

Use middleware for truly unexpected exceptions — the ones that shouldn't happen during normal operation. Expected failures (not found, validation) are handled by Result\<T> in the service layer and never reach middleware.

// Program.cs - Global exception handler for unexpected failures only
app.UseExceptionHandler(errorApp =>
{
    errorApp.Run(async context =>
    {
        var exceptionHandler = context.Features.Get<IExceptionHandlerFeature>();
        var exception = exceptionHandler?.Error;
        var logger = context.RequestServices.GetRequiredService<ILogger<Program>>();

        logger.LogError(exception,
            "Unhandled exception on {Method} {Path}",
            context.Request.Method, context.Request.Path);

        var problemDetails = exception switch
        {
            SqlException ex when ex.Number == -2 => new ProblemDetails
            {
                Title = "Service temporarily unavailable",
                Detail = "Please retry in a few seconds",
                Status = StatusCodes.Status503ServiceUnavailable
            },
            _ => new ProblemDetails
            {
                Title = "Internal server error",
                Detail = "An unexpected error occurred",
                Status = StatusCodes.Status500InternalServerError
            }
        };

        context.Response.StatusCode = problemDetails.Status ?? 500;
        await context.Response.WriteAsJsonAsync(problemDetails);
    });
});

Now your layers are clean:

// Repository - just data access
public async Task<Claim?> FindClaimAsync(string claimId)
{
    return await _context.Claims.FindAsync(claimId);
    // Let exceptions bubble - repository doesn't handle them
}

// Service - business logic + Result<T>
public async Task<Result<Claim>> GetClaimAsync(string claimId)
{
    var claim = await _repository.FindClaimAsync(claimId);

    if (claim == null)
        return Result<Claim>.Failure($"Claim {claimId} not found");

    if (claim.Status == ClaimStatus.Cancelled)
        return Result<Claim>.Failure($"Claim {claimId} has been cancelled");

    return Result<Claim>.Success(claim);
}

// Controller - just HTTP concerns
[HttpGet("claims/{claimId}")]
public async Task<IActionResult> GetClaim(string claimId)
{
    var result = await _claimService.GetClaimAsync(claimId);

    if (!result.IsSuccess)
        return NotFound(new ProblemDetails 
        { 
            Title = "Claim not found",
            Detail = result.Error 
        });

    return Ok(result.Value);
}

What this gives you:

Consistent error responses across entire API
Single place to change exception → HTTP status mapping
Centralized logging
Clean layer separation
Easy to add retry policies, circuit breakers, etc.

⚠️ Bonus Mistake: Useless Exception Messages

This one isn't talked about in enterprise circles because it feels too "basic," but it's everywhere:

throw new Exception();  // ❌
throw new Exception("Error occurred");  // ❌
throw new Exception("Something went wrong");  // ❌
throw new InvalidOperationException("Invalid operation");  // ❌

These tell you nothing. They're worse than no exception because they give the illusion of information without providing any.

What happens in production:

[Error] System.Exception: Error occurred
   at ClaimService.ProcessClaimAsync() in ClaimService.cs:line 847

You know: - Something failed on line 847 - It was "an error"

You don't know: - Which claim ID - What operation was being performed - What state the system was in - Why it failed - What to do about it

The fix: Actionable, contextual messages

throw new InvalidClaimStateException(
    $"Claim {claimId} cannot transition from {currentState} to {requestedState}. " +
    $"Valid transitions from {currentState} are: {string.Join(", ", validTransitions)}");

Now your log shows:

[Error] InvalidClaimStateException: 
  Claim CLM-38474 cannot transition from Submitted to Cancelled. 
  Valid transitions from Submitted are: Pending, Approved, Denied
   at ClaimService.ProcessClaimAsync() in ClaimService.cs:line 847

You know: - Which claim (CLM-38474) - What was attempted (Submitted → Cancelled) - Why it failed (invalid transition) - What's valid (Pending, Approved, Denied) - Where to look (line 847)

A good exception message answers: - What failed: "Claim CLM-38474 transition" - Why it failed: "Cannot transition from Submitted to Cancelled" - What was being attempted: "Transition to Cancelled state" - What's valid: "Valid transitions: Pending, Approved, Denied"

This is the difference between "junior code" and "production-ready engineering."

🎓 Why These Patterns Matter

I've debugged production incidents at healthcare companies processing millions of claims, real estate platforms serving tens of millions of homes. Every exception-handling outage follows the same script:

Expected failure treated as exception → Exception storm under load → Performance degrades → System fails

Picture two healthcare claims APIs on a busy Monday morning. Same database timeout affecting both.

System A uses exceptions for everything: - "Claim not found" throws ClaimNotFoundException - Database timeout throws SqlException - Both handled the same way (catch Exception) - Under load: 5,000 "not found" lookups + 50 database timeouts = 5,050 exceptions - Performance tanks from exception overhead - Thread pool pressure from stack unwinding - Logs flooded with stack traces - Pharmacy staff can't process claims - Patients waiting for prescriptions

System B uses Result for expected failures: - "Claim not found" returns Result<Claim>.Failure() - Database timeout throws SqlException (unexpected, should be rare) - Under load: 5,000 "not found" lookups return fast (no exception overhead) + 50 timeouts logged and handled by middleware - Performance stays stable - Logs show only actual problems (50 timeouts) - System operational (degraded but functional) - Patients get prescriptions

Same database timeout. One system treats expected failures cheaply. The other treats everything as exceptional.

Your exception handling strategy is the difference.

🧭 Key Takeaways

Don't catch Exception — catch specific exceptions you can handle, let the rest bubble
Don't swallow exceptions — log, transform, retry, or fail fast. Never do nothing.
Don't use exceptions for control flow — if it happens 100+ times/hour, it's not exceptional
Use Result for expected failures — not found, validation, business rules
Use exceptions for unexpected failures — database timeouts, network failures, system errors
Preserve stack traces — use throw; not throw ex;
Handle exceptions at the right layer — middleware for cross-cutting, not in every method
Write actionable exception messages — what failed, why, what was attempted, what's valid

🚀 Next Steps

Review your exception handling:

Are you catching Exception everywhere? Replace with specific types or remove the catch
Are you throwing exceptions for "not found"? Replace with Result
Are you swallowing exceptions? Add logging, dead letter queues, or fail fast
Are your exception messages useful? Add context: what, why, where
Is exception handling scattered across layers? Move to middleware

Start with your most-called endpoints (user lookup, item search, claim validation). Replace exception-based "not found" with Result. Watch your logs get cleaner and your performance improve.

The APIs that survive production don't throw exceptions at expected failures. They use the right tool for the job.

The Skipper's exception handling sunk the Minnow. Don't let yours sink your API.

Related Posts: - Building Professional WebAPI Controllers - See Result used in production controllers - When the CRUD Hits the Fan: Resilient Controllers - Exception handling + resilience patterns - Dependency Injection in ASP.NET Core - Foundation for injecting services properly

In the next post: Unit testing with NSubstitute — testing code that has dependencies without mocking everything, and the patterns that make tests maintainable instead of brittle.

💥 The Five Mistakes That Sink Production APIs

❌ Mistake #1: Catching Exception Everywhere

🧹 Mistake #2: Swallowing Exceptions Without Logging or Action

🔄 Mistake #3: Using Exceptions for Control Flow

🏗️ Building Result: The Right Way

🧱 Mistake #4: Re-Throwing Exceptions Incorrectly

🧪 Mistake #5: Handling Exceptions at the Wrong Layer

⚠️ Bonus Mistake: Useless Exception Messages

🎓 Why These Patterns Matter

🧭 Key Takeaways

🚀 Next Steps

Patterns worth internalizing.