We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for GPU acceleration.
Obtained from the The University of Florida Sparse Matrix Collection, the matrix Schmid/thermal2 is a steady state thermal problem (FEM) on an unstructured grid. This is a fairly large matrix with 1.2 million rows and 8.5 million non-zero elements. It's worth noting that this problem only needs about 100 MB of storage so it can easily fit on even an entry level GPU offerings.
Like many FEM problems, the resulting matrix representation is positive definite so the conjugate gradient (CG) solver was chosen. Using this solver, we tried all of the available preconditioners available in CULA Sparse.
|ILU + Reorder||211.2||54.04||1789||1789|
As demonstrated above, the GPU showed an appreciable speedup for all of the preconditioner methods. In the best case, with no preconditioner selected, the GPU was over 10x faster than the CPU! However, on the more serial CPU, the best time was achieved using the ILU0 preconditioner. Interestingly enough, the ILU0 preconditioner was not the best choice on the GPU. While this preconditioner did half the number of iterations, the overhead introduced became a bottleneck and the un-preconditioned version has the lowest wall clock performance. Comparing the best GPU algorithm to the best CPU algorithm we still see an 8.5x speedup!
All timing benchmarks obtained in this example were performed using an NVIDIA C2050 and an Intel X5660. The CPU results were calculated using fully optimized MKL libraries while the GPU results were obtained with CULA Sparse S1. All transfer overheads are included.
On Thursday, July 21, at 9AM pacific time I will be conducting a CUDA Webinar Series feature with NVIDIA. I will be covering the basics of CULA, with some emphasis time on the new Link Interface feature from R12. We'll even do a demo with some live applications to show the power of this interface, followed by some live Q&A with the attendees. You must register for this event at NVIDIA's page.
One the exciting new features in CULA R12 is the link interface. In a previous blog post we introduced the features of this new tool and today we'll demonstrate how to easily use this interface with the popular computing tool MATLAB.
MATLAB has a feature that allows you to externally specify a library for your LAPACK and BLAS calls. Typically this feature is used if your architecture does not perform well with the libraries included with MATLAB. However, you can also use this feature to utilize GPU accelerated CULA libraries to boost performance! This is achieved by simply changing a few environment variables -- there are no MEX files to compile, no clunky gpuArray objects, and no changes MATLAB function names!
The first variables that need to be set are: LAPACK_VERSION and BLAS_VERSION. These are specific to MATLAB and should each be pointed to the 'cula_lapack_link.dll' file (cula_lapack_link.so on Linux).
The next variables that should be set are related to the CULA link library. A useful option is the CULA_DEBUG_LOG environment variable, which when set will write messages to a log file that will allow you see to see for which functions the CULA library is called. For 64-bit versions of MATLAB, set the CULA_ILP64 flag because MATLAB uses 64-bit integers internally.
On Windows, an easy way to use CULA-accelerated MATLAB is through the use of a batch file. Simply create a new .bat file with to set the environment variables and launch the MATLAB executable. For convenience, we have provided a Windows batch file to do just that. Simply place this file in your MATLAB bin folder alongside the standard matlab.bat file. Be sure that the CULA bin path is also in your Windows path so the appropriate libraries can be loaded.
Running the new batch file will launch MATLAB with CULA acceleration enabled. Running a few simple commands we can see that our linear algebra functions (matrix multiplication, QR, and SVD decomposition) are running faster:
>> tic; A = A*A'; toc; Elapsed time is 3.414187 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 11.318329 seconds. >> tic; x = C \ b; toc; Elapsed time is 19.133406 seconds.
Contrast this to the CPU implementation where the operations take up to 8x as long to complete!
>> tic; C = A*A'; toc; Elapsed time is 7.035089 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 49.837156 seconds. >> tic; x = C \ b; toc; Elapsed time is 151.153907 seconds.
Many functions in MATLAB use LAPACK under the hood. Other MATLAB routines that will automatically be accelerated include (but are not limited to):
- matrix multiply (*)
- matrix solve (\)
More information about the link interface can be found in the link_interface.txt file contained in the doc folder of your CULA install.
If you have any questions, please ask on our forums!
Edited on January 23, 2012 to update all occurrences of cula_link to cula_lapack_link.