3May/10Off

53,000 Tests

by Dan

One of the things that we spend a tremendous amount of time on is in testing CULA. In this post, I'm going to discuss some of the challenges of thoroughly testing a large library and the methodologies we have come up with to meet these challenges.

Perhaps the biggest challenge in testing CULA is in fully testing the breadth of functions we support. CULA of has over 150 different API functions. To complicate matters, some routines use job codes to change their functionality, which multiplies the number of tests by the number of combinations of job codes. For example, many functions have three job codes, leading to eight different variants of the function that need to be tested.

In total, a full test run of CULA includes 53,000 tests. This number comes from multiplying the number of functions (with all their variants) against a set of hundreds of inputs. Such a huge number of tests leads to many interesting challenges, the most critical of which is that there simply aren't enough hours in a day to run a full suite of tests.

24 hours = 86400 seconds
6400 seconds / 53000 tests = 1.6 seconds per test

For many tests, an allotment of 1.6 seconds to tests is plenty of time. For others, however, this doesn't even come close. Take SVD, for instance. For an SVD of appreciable size, say 4K by 4K, CULA's runtime is 42 seconds. This is actually a great result, because the runtime of MKL is three times longer.

Beyond just a performance number, however, we need to make sure that our results are correct by comparing them with the results of other functions. Although each routine is different, we run through a battery of tests for each including norm comparisons, matrix reconstructions, and quality analyses (e.g. identity, orthogonality, etc.). As an example, after a full SVD run we perform as many of the following tests as possible for the job codes under test:

Analyze the orthogonality (or unitarity) of U and V by multiplying U*U' and V*V'. Both of these are norm-tested against the identity matrix.
Reconstruct the original matrix by multiplying U*S*V' and norm-test the reconstructed version against the original.
The above two bullets indicate a successful singular value decomposition, but we also do further tests. For instance, we compare the quality of our obtained singular values to that of MKL.
These tests add more time to a total runtime that has already greatly exceeded a potential allotment. Given this, how do we compensate for having so many tests that take such an extreme time to run? The answer to this is that we run a smaller set of tests during the day and an expanded, rotating set of tests each night. Our daily tests run every time we make a commit to our software version control system. In this way, we learn quickly of problems as they arise. Our nightly tests, on the other hand, start automatically each evening and exercise CULA exhaustively with a rotating problem set that includes larger problems, non-square matrices of (wildly) varying shapes, matrices designed to test against regressions, and matrices that have been specially formulated to exercise corner cases.

While this combination of quick daily tests and exhaustive nightly tests is an effective solution to ensuring that a full test suite can be run, it doesn't actually reduce the time an individual test run takes. To compensate for this, we use a caching technique to further reduce the demanding runtimes of our full test suite. Changes to the CULA codebase happen regularly, but changes to the tools we compare against happen far less frequently. With this in mind, why rerun the tests for these tools when the results never change? In short, we don't. On their first run, we cache the results of these runs to disk along with their runtimes. When a future test is run, instead of re-computing answers, we read the answers from disk. In this way, we trade a long computational runtime for a simple read from disk, which, for even large matrices, takes less than a second. For a function that would have taken 2 hours to run, reducing it a second leads to a speedup of 7000 times! That's right, we accelerate testing too.

Of course, this speedup does not mean anything outside of this example. But what it does mean is that it allows us to run far more tests than would be practically possible in the small amount of time that we have. 53,000 tests, to be specific, and this number grows quadratically as we expand both the testing data and the number of routines.

When designing CULA, we strive to make the fastest linear algebra library available. However, we don't just care about speed; we also care about ease-of-use, quality documentation, operating system availability, and excellent support. And most importantly of all, we care about correctness. Correctness in all circumstances, including corner cases, always trumps speed -- a fast, but incorrect implementation is useless.

Whenever you read a paper or post talking about performance numbers, question how they test their code. We've laid out our entire testing strategy because we're proud of the focus we place on making an error-free product, not just for one or two routines, but for every routine we release . Testing isn't easy, but we do the hard work to make sure that when you use CULA, you are using a product that you can rely on.