Wally’s comment is insightful. Giving a thing a name makes it possible discuss that thing. Witness the enormous discussion of Test Driven Development on the internet.

The purpose of this post is to give a name to a testing practice I have found useful, and to explain why it is useful. I call the practice Head to Head testing, or HTH. I propose HTH as a solution to a problem that arises in test driven development, and will introduce HTH by first explaining the problem. My explanation will use functional programming notation, but the underlying issues are the same in object oriented or any other form of programming. This is because the issues are mathematical in nature.

Suppose my task is to write a function to compute logarithms base 10. TDD demands that I begin by constructing some tests. OK; here they are.

``` (assert (= (log 1) 0)) (assert (= (log 10) 1)) (assert (= (log 100) 2)) (assert (= (log 1000) 3)) ```

What do these tests accomplish? A function that fails these tests is certainly an inadequate log function, but passing this suite proves little. Infinitely many functions satisfy these tests. Let’s step back and consider the logic of testing.

Suppose we are testing the implementation of a function F defined on a finite domain D. An ideal test suite would have one test case for each element of D, and would exhaustively check whether the implementation of F was correct at each point in D. An implementation that passed this ideal test suite would be demonstrably correct.

The problem is that exhaustive testing is impossible in practice. Many functions have infinite domains, or domains so large that exhaustive testing is not physically feasible. Since we cannot test our implementation at each point in its domain, we are forced to settle for testing on a subset the domain. But which subset? I contend that we have now entered the realm of statistical sampling.

My test suite for the log function proposed to test at the points {1, 10, 100, 1000}. Looked at through the lens of statistical sampling, this test suite suffers from two defects. The lesser defect is that we have no idea whether a test of size N=4 is sufficient. The fatal defect is that the points chosen for testing were not chosen randomly. This makes the test suite statistically worthless.

Imagine by way of analogy that you engaged me to conduct a survey. I offer to use my friends as the sample population. You ought to conclude either that I am pulling your leg or that I am incompetent statistician. But in the programming world, this approach to sampling is commonly accepted. Test suites are hand written by programmers who write whatever test cases they deem appropriate. This flies in the face of statistical practice. A statistically valid test requires randomly sampled data, not data hand-chosen by a human being.

It could be argued that as software testers we are not in search of statistically valid tests, but of errors. And that is true as far as it goes, but the search for errors demands randomized data for two reasons. The computer can generate untold millions of random test cases for each one a tester writes write by hand. But the deeper purpose of randomization is to avoid blind spots or bias in the selection of test data. If I design the tests, presumably I choose the data based on what I think might go wrong in the software. What about all the things that never occur to me to test? In the long run, randomized tests go everywhere. They have no long term biases or blind spots.

Suppose I decide to do the right thing by testing the log function on randomly sampled data. How could I do that? It’s not hard. Here’s how to conduct a single trial. Randomly pick two positive real numbers x and y. The corresponding test is this:

``` (assert (= (log (* x y) (+ (log x) (log y))))) ```

The test exploits the fact that log(x * y) must equal log(x) + log(y). It would be a simple matter automate this test by constructing a stream of randomly generated pairs (x y). Doing this would make it possible to test the log function at hundreds of millions of points rather the relative few a developer or tester could test using hand written tests.

That’s all well and good, but my example is exceptionally well suited to randomized testing. In my example I could exploit a mathematical property of the log function to construct tests with known expected outcomes. It is not obvious how to do this while testing non-mathematical software. That is where Head to Head testing enters the picture.

The HTH approach is as follows. Suppose we are implementing a function F. We build not one but two independent implementations of F; call the implementations g and h. We also create a suitable of set randomized data D, and test at each point d in D whether g(d) = h(d). If g and h disagree on d, then at least one of them is in error. Notice that we do not have to trust that either g or h is correct, nor do we need to know correct value of F at d. If g and h disagree at d, at least one of them must be in error. Both must be investigated. This is a feature, not a bug. Examining the differences in the two implementations will lead us to the flaw much more quickly and easily than would a the study of a single implementation.

A nice side effect of this approach is that a failed HTH test does not result in a mere assertion of failure. We are handed a test case known to trigger the failure, a very useful thing for debugging. And it is rare to find a lone bug. Very likely running more tests will result in a growing collection data points that trigger failures.

Return for a moment to the question of sample size. How many test cases are appropriate? As software testers we have a tremendous advantage over statisticians. In general it is very difficult to determine how large a sample size N is required to create a statistical test of known power. But we are not doing mathematical statistics, we are testing software. Just make N huge. Better still, make N infinite. Run HTH tests continuously. Many advocate continuous builds; why not continuous testing? By continuous testing I don’t mean a test suite that runs as part of the build, with successful completion acting a gate on code check-in. I mean tests that run 24/7 on dedicate hardware, never completing, just searching tirelessly for bugs. Machines are cheap and bugs are expensive.