Benchmarking and CULA

by John

We often get the question asking us exactly how we produce our publicly displayed numbers.  This is a fair question and one that it is always good to see asked.  Benchmarking is hard to do fairly, and even more so when the benchmarks are going to be used for promotion or compared against other similar benchmarks.  As a developer of GPU libraries, we're constantly on the lookout for benchmarks related to programming GPUs, and we're surprised by the errors that many programmers make when benchmarking their code.   Today I'm going to talk about the benchmarking policies we've put into place for CULA and discuss how we avoid the pitfalls in benchmarking a GPU program.

When we designed our benchmarking policies for CULA we wanted our numbers to be unimpeachable, and so we decided to show our best performance but only in practical circumstances and only compared to high-quality competitors.  Although we compare CULA against many different packages, we choose only to publish benchmarks against the most highly tuned ones.  Believe me, it feels great to see 100x performance out of CULA against packages like a single-threaded LAPACK implementation, but it would be unfair to lean on these when better implementations exist.  In many other fields this would be a perfectly adequate comparison - for example, sometimes our customers bring us custom, untuned code and we get performance numbers like 100x on a regular basis. But it's a different story in the linear algebra field, where the algorithms are well known, highly modular, and there are incredibly well tuned multi-core implementations out there.

After choosing the packages you want to compare against, there is a second level, which is how to set up the problem. Too often GPU code is benchmarked in a loop of a hundred iterations, which may not be how a kernel is called in the real world, hiding the true cost of using the code.   Or worse -- a user omits a cudaThreadSynchronize before marking the end of the test; it will probably look like they're getting great performance,  but they're not counting the execution fairly because kernel launches are asynchronous and may not be completed yet.  Also, did they check for errors or not - that takes time too!  And it seems that in almost every circumstance we've encountered, memory allocation time is not counted, even though it's quite necessary for functionality (the rationale, of course, being that the hundreds of loops amortize this cost to zero,  but it's hard to be certain without benchmarking).  As a library vendor, we want our users to have the best experience possibly when using our code and so we always synchronize and check for errors at the library boundary so that we can do as much with these errors as possible -- instead of putting the checking and error handling on our users.  Because we do this (and we feel that any good code should), we choose to count all of these "overheads" in the times we report in our benchmarks.

Beyond actually timing your code, there is also the concern of fairness in selecting which results to present.  It can be easy to cherry-pick your best results, presenting a false impression of how well your code really runs.  To be as fair as possible, our method is to select the tests we want to presentbefore running any tests.  For each routine, we took the parameters and job combinations we felt best present a common and "real world" usage pattern.  Only after this choice do we run our tests and present our findings.  So you can trust that there is no funny business involved, such as "I'm getting 1.2x on average, but my code runs great for matrices sized precisely 1040x1040 so I'll use that."

What's really great is that while we tend to present conservative numbers for our own code, our users are under no such constraints.  We're always happy when we see someone with a top of the line GPU and an aging CPU benchmark our code and get a speedup of 10-15x!  For obvious reasons we can't market ourselves with that kind of testing, but it's always good to get a reminder that in real world usage everyone has a different setup and a different set of needs.  Although we release benchmarks on a top-of-the-line machine, as a cross-OS, cross-GPU-generation, cross-CPU, cross-language library vendor, we try to get the best performance for all of our users and you can be sure that CULA will perform great no matter what you're using.

Comments (2) Trackbacks (0)
  1. Can CULA work on multi GPU systems like GPU based Supercomputers?

  2. The current version of CULA 2.0 can leverage multi-GPU for task level parallelism. If your algorithm requires multiple independent function calls, CULA can support this. Future version of CULA will provide functions that can use multiple GPUs to accelerate a single routine.

Leave a comment

Trackbacks are disabled.