Fail Fast or Fail Smart? Lessons from Real-World Software Delivery

Failing is an essential part of almost every learning process. When it’s planned and under control, we learn safely—even kindly. But when failure is unplanned, it becomes unpleasant, chaotic, and sometimes a nightmare.

Take flight training as an example. Failing while flying a jet simulation is expected—it’s where real learning happens. You can crash, review, and try again. But if you make the same mistake in a real jet, it’s not a learning experience anymore—it’s a disaster. The cost is too high.

The same is true in software delivery.

Automated tests and Continuous Delivery pipelines are our flight simulators. They give us a safe zone to fail, catch mistakes early, and fix them with confidence.

But when we don’t catch issues until it’s too late, those failures become painful and costly.

Think about that moment when you say in the daily stand-up:

“My story is ready to deploy.”

And just hours later, you realize something's broken—maybe a test was skipped, maybe the integration wasn’t complete. You end up spending the entire afternoon under pressure, trying to patch things together before production explodes.

There’s no time to reflect, no room to debug carefully. It’s not just frustrating—it’s stressful, and it teaches the wrong lesson:

Don’t trust deployments. Don’t deploy on Fridays. Or Mondays. Or without someone holding their breath.

That kind of failure doesn't improve the system. It just makes us more afraid to change it.

This is why failing fast is important—but failing smart is critical.

In the rest of this post, I’ll explore that difference by walking through:

How we approached failure and learning in a complex model-driven system using tests.
How we applied the same mindset to optimize our CI/CD pipeline for insight, not just speed.

Failing Fast in Long-Running Models

Recently, I’ve been working on a modeling system that involves a huge number of calculations. These calculations don’t just run once—they span days, sometimes even years of simulation.

We had unit tests for individual calculations, and they did a good job of verifying isolated methods. But in practice, that wasn’t enough.

Why? Because most calculations are interdependent. One incorrect result doesn’t stay isolated—it propagates forward. If a calculation goes wrong on Day 1, multiple calculations on Day 2 will be wrong as well, and the issue snowballs. A single defect can contaminate the entire simulation output.

Our system consists of multiple services running in separate containers in a Kubernetes cluster. Given an input, the model runs over time and produces an output:

One row per day
Multiple columns per day, one for each calculation

Each user story is validated using Excel files that define the expected results for given inputs.

We decided to build reusable integration tests that run across the necessary components and validate the output against the Excel sheets.

But here’s the challenge:

Speed matters. If the tests are slow, you can’t write many of them. If you can’t write many, you limit your ability to fail fast.

We initially considered spinning up the full Kubernetes cluster—including all containers—to simulate these runs. Even using TestContainers was too slow. And not all services were relevant; many were just infrastructure or orchestration layers.

So we rethought our approach.

Instead of deploying everything, we used .NET's WebApplicationFactory to spin up only the essential componentsdirectly in-memory and communicate via HTTP as they normally would. The result?

We could simulate multiple components over long time spans and complex conditions in milliseconds.

That speed unlocked a powerful capability:

We could run many acceptance tests.
Each test mapped directly to a user story.
Each test compared the system’s output to the user story’s expected Excel result.

We had built a fast, lightweight simulation framework. We could now fail fast—but after a while, I started asking a new question:

Are We Really Learning from These Failures?

The tests were fast—but the feedback wasn’t always helpful.

Here’s an example of a failure message:

"The result does not match for Day 2, field 'Field One'. Expected X, but found Y."

That’s fine for quick checks. But when a failure occurred, I often had to debug, step through logic, and figure out the root cause manually. That felt like wasted time.

The more I rely on the debugger, the more I feel there’s something wrong with my design.

So I asked myself:

Is failing fast enough? Or do I need to fail smarter?

Moving from Fast Failures to Smart Failures

I didn’t want to just see what failed—I wanted to understand why.

The breakthrough was simple: don’t stop at the first mismatch. Instead, validate the entire result set and aggregate the failures.

Suddenly, I had more insight:

If Calculation A and Calculation B failed, and B depends on A, I could skip investigating B—because the root cause was A.
If Day 2 failed, but Day 1 was fine, I had a starting point to isolate the issue.

In C#, there are a couple of ways to do this:

If you use FluentAssertions, you can wrap your assertions in an AssertionScope, which collects all failures and throws them together.
If you're using xUnit without FluentAssertions, you can collect failure messages in a list and throw them as a combined error.

This is failing smarter. And it leads to faster debugging, clearer insight, and stronger confidence in the system.

Failing Smart in the Pipeline

Let’s now zoom out and apply the same mindset to build pipelines.

A basic delivery pipeline often looks like this:

→ Run Unit Tests → Run Integration Tests → Run Acceptance Tests → Create Release Candidate

A real pipeline may also include:

Static code analysis
Performance profiling
Security scanning
Artifact signing
Deployment staging

The goal of every pipeline is to:

Catch failure before it reaches production.

And the faster the pipeline, the more frequently you can run it. That’s why we aim to keep pipelines lean and fast. Techniques like build once, reuse everywhere are well-known and effective.

But here’s another common anti-pattern:

Stopping the pipeline at the first test failure.

At first glance, this makes sense: a failure = stop and fix.

But is that always the smartest thing to do?

Just like in our modeling system, this approach hides other potential failures.

Maybe:

Unit tests passed, but integration tests would have failed.
Acceptance tests would have shown deeper problems.
Multiple components had regression issues.

If the pipeline stops too soon, you get only partial feedback.

Instead, consider this approach:

Continue running all stages in isolation.
Aggregate all failures at the end.
Present a full, consolidated picture of what’s broken.

This gives your team a smarter feedback loop:

You fix once, not multiple times.
You plan better.
You learn more per failure.

Here’s a diagram illustrating two distinct approaches to implementing your pipelines:

Conclusion

Failing fast is not the end goal. Learning fast is.

Automation testing and continuous delivery are our simulation environment. They allow us to crash safely, experiment freely, and recover quickly.

But speed alone isn’t enough.

Failing smart means aggregating insights, understanding dependencies, and giving ourselves the tools to improve—not just react.

Whether it’s in a test method or in your CI/CD pipeline:

Don’t just ask “How fast can I catch a failure?”
Ask “How much can I learn from it?”

Because in software development, the quality of failure defines the speed of progress.

Hossein Bakhtiari