Pros and Cons of Backdoor Stubbing

Testing around effects, Part 2

Dec 03, 2023

Hello! This is the second post in my series on “testing around effects.” Part 1 was here. If you’re not sure what an “effect” is, start with my post on effects.

In Part 1, we learned how to stub out the effectful dependencies of the code we want to test. In this post, I want to analyze the pros and cons of testing this way. While backdoor stubbing often isn’t the best approach to testing around effects, there are circumstances where it’s very helpful, so it’s important to understand when and where it’s useful.

A double-edged sword

Steampunk Link wielding a sword that crackles with energy — "Steampunk Link" by Wild Guru Larry is licensed under CC BY 2.0.

The strength of backdoor stubbing is that you can apply it without making invasive changes to the code under test. Recall that in order to test sendSignupReport, the only change we had to make to the production code was to replace sendEmail() with Emails.send(). This simple change allowed us to swap in a stub for the send method in our tests.

Backdoor stubbing gives you a way to test any code, no matter its shape. If you’re trying to retrofit tests around legacy code, this is a huge advantage. However, the resulting tests are not always pleasant to read.

The example we saw in Part 1 was in many ways a best-case scenario. What if instead, the API for sending an email looked like this?

Emails.builder()
  .from("noreply@example.com")
  .to("eliza@example.com")
  .subject("Weekly signup report")
  .body(reportHtml)
  .send();

Here’s how we might write a stub for that in our tests:

const sentEmails = [];
Doubles.replace(Emails, "builder", () => ({
  from(sender) {
    return {
      to(recipient) {
        return {
          subject(subjectLine) {
            return {
              body(html) {
                return {
                  async send() {
                    sentEmails.push({
                      from: sender,
                      to: recipient,
                      subject: subjectLine,
                      body: html,
                    });
                    return Promise.resolve()
                  },
                };
              },
            };
          },
        };
      },
    };
  },
}));

// ...

expect(sentEmails, contains, {
  from: "noreply@example.com",
  to: "eliza@example.com",
  subject: "Weekly signup report",
  body: "<li>US - 1</li>",
});

This is a classic pyramid of doom—the kind of code that will make the unfortunate soul who next reads it run git blame and curse your name.

(It’s also kind of a strawman—there are better ways to write this test double, though those arguably veer towards being fakes, not stubs. Still, the code above is what I’d expect an automatic stub-generator to produce.)

This stub is far harder to understand than the code it enables you to test, and that’s never a good sign. At some point, we have to question whether this test is worth the cost of writing and maintaining it.

Thus, backdoor stubbing is a double-edged sword: a hero programmer’s weapon. Armed with it, you can dive into messy code and very quickly get a test running. Just know that by doing so, you are treating symptoms, not causes, and you may well end up making the mess worse.

For this reason, I recommend backdoor stubbing only as a first step toward remediating legacy code. Once you have passing tests, you can begin to refactor the code into a more test-friendly shape.

Missing Coverage

The second disadvantage of backdoor stubbing is that it’s easy to leave code untested without realizing it. Admittedly, missing coverage is always a risk when testing legacy code, but backdoor stubbing makes the risk worse.

Let me digress for a moment here and clarify what I mean by “coverage.” I do not mean “coverage” as typically measured by test-coverage tools. Those tools consider an expression covered if it is executed by any test. I only consider an expression to be covered to the extent that behavior-altering changes to that expression can cause a test failure.

To see the distinction, imagine a codebase with 100% test coverage, as measured by a tool. That is, every single statement is executed by at least one test. Now suppose we delete the assertions from all the tests. Still 100% measured coverage, but now, we can pick any arbitrary expression we like, tweak it to introduce a bug (e.g. by inverting a conditional) and no tests will fail. The codebase now has essentially zero coverage according to my stricter definition.

(Why don’t coverage tools measure this stricter type of coverage? Well, they could, in theory, but it would be very computationally expensive.)

Equipped with that definition, let’s look at an example.

In our tests for sendSignupReport(), we left the call to weeksAgo() uncovered. Any changes to weeksAgo, or to the argument we’re passing, won’t cause any test failures. We could even delete weeksAgo and pass a hardcoded time value to the query instead, and our tests would be fine with it.

Now, this isn’t necessarily a bad thing. The choice to cover code with tests is always a judgment call. Are the safety and knowledge that we gain by testing worth the cost of maintaining and running the test? It’s fine to consider this question and answer “no,” but if we never consider it, our testing choices are unlikely to be wise.

So let’s ask ourselves: if we did want our test to assert that weeksAgo was called with the right argument, how would we do it?

To recap, here is the relevant production code:

const newUsers = await database.run(
  queryFrom("users").where("signup_date", ">", weeksAgo(1)),
);

And here is where we left the stub for database.run():

Doubles.replace(database, "run", () => Promise.resolve(newUsers));

In the best case—assuming our query object lets us get at the where filters somehow—we might be able to do this:

let query;
Doubles.replace(database, "run", (arg) => {
  query = arg;
  return Promise.resolve(newUsers);
});

// ...

expect(query.whereFilters[0].operand, isNoLaterThan, weeksAgo(1));

(Why am I using isNoLaterThan (a function I just made up on the spot, btw) in the assertion? Why not equals? Well, time is passing as our test runs. The call to weeksAgo in production will, in general, return a different value than the one in our assertion.)

However, this couples our test to details of the query object, which could have disadvantages. E.g. if the query object comes from a library, our test might break when we upgrade that library.

The revised test also leaves us open to a class of bugs. If we replace weeksAgo(1) with weeksAgo(2) in the production code, the test won’t fail, since two weeks ago is certainly noLaterThan one week ago.

So it seems like in order to really test this properly, we have to stub time somehow. There are JavaScript test libraries that let you do exactly that, e.g. jest.useFakeTimers(), but if for some reason those aren’t an option, we’d have to do something like this:

First, in our production code, we’d refactor weeksAgo to Time.weeksAgo, and queryFrom to Query.from, so we have a place to insert stubs:

const newUsers = await database.run(
  Query.from("users").where("signup_date", ">", Time.weeksAgo(1)),
);

Then, in our test:

Doubles.replace(Time, "weeksAgo", (n) =>
  `return value from weeksAgo(${n})`);

let queryOperand;
Doubles.replace(Query, "from", () => ({
  where(field, operator, operand) {
    queryOperand = operand;
  },
}));

// ...

expect(queryOperand, is, "return value from weeksAgo(1)");

The string "return value from weeksAgo(1)" is a kind of test double—a dummy—that we haven’t yet been introduced to. Dummies are values that pass through the production code but are not actually used by it. This dummy allows our test to verify two things: first, that weeksAgo was called with the right argument, and second, that the value returned by weeksAgo got passed to the query.

With this change in place, we’ve guarded against several possible bugs related to weeksAgo. If weeksAgo isn’t called, or is called with the wrong argument, our test will fail. If it’s called with the right argument but the return value isn’t used, the test will fail. Our coverage has improved.

What I hope you take away from this example is that revising our test to cover weeksAgo is nontrivial, and not the sort of thing that will fall naturally out of following “best practices”. We had to notice the coverage problem, decide it was worth fixing, and figure out how to test it. It’s extremely easy to overlook coverage problems of this kind, since the test with missing coverage doesn’t look wrong or incomplete in any way. In real-world systems, coverage problems that arise from backdoor stubbing often don’t get fixed.

Coupling to Implementation

A further disadvantage of stubbing is that it can couple your tests to incidental implementation details of the code.

Ideally, tests fail when the behavior of our software regresses, and pass when we make structural changes that don’t affect behavior. When tests are coupled to implementation details, changes that are purely structural can cause false-alarm test failures.

As an example, consider the revised Emails API from earlier:

Emails.builder()
  .from("noreply@example.com")
  .to("eliza@example.com")
  .subject("Weekly report")
  .body(html)
  .send();

This API appears to be using the Builder pattern, which means we should be able to call from, to, subject, and body in any order before calling send, and the resulting behavior should be identical.

E.g., maybe we want to put the subject line first:

Emails.builder()
  .subject("Weekly report")
  .from("noreply@example.com")
  .to("eliza@example.com")
  .body(html)
  .send();

If our tests are using the pyramid-of-doom stub, this code will fail the test. The object returned from Emails.builder() doesn’t have a subject method; it expects to receive a call to from, and will throw an exception for anything else.

Tests are supposed to catch our mistakes while allowing us to refactor and add functionality. Tests with implementation-coupled stubs work against the second half of that goal, by preventing us from making purely structural or additive changes. It’s all too common for complex stubs to be overly-coupled to implementation details.

Wrap-up and final score

In conclusion, I rate backdoor stubbing 6 out of 10. Nah, just kidding—trying to give it a numeric score is kind of silly. Backdoor stubbing is a good technique for getting tests around legacy code, as a first step toward refactoring. However, its downsides mean that I generally don’t use it unless my hand is forced.

One exception: tests that involve a single backdoor stub, for a simple function, are generally fine. Those tests aren’t hard to read, and they’re easy to write. But if you have multiple stubs, especially if they’re nested as in the pyramid of doom example, things can get out of hand quickly. Beware!

In the next episode of “testing around effects,” we’ll explore an alternative to backdoor stubbing: redesigning the code to solve a more general problem. Stay tuned.

Ben’s Guide to Software Development

Discussion about this post