Today NVIDIA announced that on March 4th they will be releasing a CUDA 4.0 release candidate to those in their registered developer program. With this release, NVIDIA has endeavored to simplify programming with CUDA, especially in the area of multi-GPU designs. The new capabilities this version introduces drastically simplify the task of writing a general purpose, multi-GPU library, and we’re happy to say that we’ll be benefiting from these new features in our next release.
You can read more about the new version at NVIDIA’s press center.
In the previous post, we took the time to describe the performance of the state of the art in GPU-assisted linear algebra computations. While performance is a huge motivating factor for the adoption of GPU code, there is also a lot to be said for the usability and capabilities of that library. We will take this post to highlight some of our favorite features.
Equally important to speed is the question "are all my routines supported?" - a critical first question when evaluating an alternative library. Counting precision variants, our present routine roster is at 158 LAPACK routines and 34 BLAS routines (see here for info on our BLAS system.) In comparison, MAGMA has roughly 100 routines (ignoring non-LAPACK variants) and not all of them have both CPU and GPU interfaces, which all CULA routines do. This is a point of pride for us; we want to provide a consistent and confusion-free experience across all platforms, all interfaces, and across as many languages as possible.
Speaking of interfaces, we provide many interfaces into CULA in order to best match as many programming styles as possible. We have the basic bindings in C that most libraries support, and also do type overloaded calls in the C++ headers - and both of these have host memory and device memory interfaces too. We have Fortran bindings too for gfortran, Intel Fortran, and PGI Fortran. We have a Bridge interface that is a very low effort interface to quickly try out CULA's host interface for ALL of the supported LAPACK and BLAS calls in your whole program! For all the Matlab users out there, we have demonstrated how to call CULA functions in your Matlab Mex routines. In comparison, Magma supports only plain C interfaces for host and device calls, so the integration effort is placed on the user.
This isn't to say we're perfect, but if you check out our forums, you will see that we make an effort to aid users in their integration, and when bugs are discovered we attempt to correct them very quickly (see, as an example, this post). We feel strongly that CULA provides the best user experience, and heartily encourage you to take it for a test drive, starting with the free CULA Basic version.
|Number of Unique LAPACK Routines||100||158|
|Number of Unique BLAS Routines||36||34|
|Optimized SVD Solver|
|Optimized Symmetric Eigenvalue Solver|
|Check and Report Errors|
|Host Memory Interface||Partial|
|Device Memory Interface||Partial|
|Compiles Easily||Requires edits|
We are often asked about the differences between CULA and the University of Tennessee's GPU linear algebra package, MAGMA. The simple answer we normally give is that CULA is a commercial product developed for deployment while MAGMA is an academic project focused on research. As a commercial product, we strive to produce a cutting edge library that is well supported, feature rich, easy to use, and regularly updated.
Performance wise, both CULA and MAGMA both provide substantial speedups compared to the CPU. Since both libraries provide a wide range of routines, it's better to analyze them on an individual basis rather than generalize about the entire library.
The following graph shows that the performance of the popular routine, DGETRF, is fairly consistent between the two libraries with CULA pulling ahead at the large sizes. We have seen a similar pattern for other routines such as Cholesky and QR factorization. These tests were performed using an Intel Core i7 and NVIDIA C2070 where MAGMA was compiled using MKL with full threading enabled.
While the performance of GETRF is at parity for both libraries, for other routines, the performance of CULA is leaps and bounds ahead of the competition. The following chart shows that the performance of CULA's SGESVD is far ahead of MAGMA's performance when finding both the U and V unitary matrices. This performance gap is because CULA contains a parallelized implementation of the step that generates the unitary matrices, where other implementations have left it as a CPU operation. This step consumes significant processing time and must be implemented for the GPU in order to see a speedup! This same holds true for the symmetric eigenvalue solver CULA contains accelerated implementations of both the tridiagonalization step and the vectors stage.
While we strive for high performance in CULA, we'd like to reiterate that the CULA design philosophy is first-and-foremost focused on an error-free and accurate solver under any circumstances. As we have discussed before, here, here, and here, we are constantly testing our entire code base to make sure that any code change allows for stable and accurate results. Using these extensive tests, we are able to find and fix bugs and inaccuracies before we release the final product.