## CULA Sparse – Performance

We've had plenty of questions regarding the performance of the upcoming CULA Sparse package - hopefully the following performance plot will answer some of those questions!

Here, we have plotted the performance of CULA Sparse (beta-1) against the performance of another GPU library, CUSP (0.2), and an optimized CPU library, Intel MKL (10.3). As you can see, the GPU accelerated libraries perform over **a magnitude faster** than the CPU counterpart with CULA coming out about 10-20% faster than CUSP!

For this benchmark, we measured the throughput of the conjugate gradient (CG) iterative solver in GB/s such that the execution time is related to the size of the matrix. The CPU benchmarks were obtained using dual hex-core Intel Xeon X5560s (all 12 cores active) and the GPU benchmarks were obtained using an NVIDIA C2050. No preconditioners were used and all solvers converged within very similar iteration counts.

Stay tuned for more performance numbers and the upcoming CULA Sparse (beta-2) release!

## Sparse 101: Matrix Formats

With the release of the CULA Sparse beta, we thought it would be useful to present an introduction to sparse matrix formats. Traditionally, a matrix is considered sparse when the number of non-zero elements is significantly less than the number of zero elements. When represented in a program, sparse matrices, unlike the dense matrices used in CULA’s LAPACK functions, are a logical, compressed representation of a matrix. Whereas a dense matrix represents all elements of a matrix, including zeroes, a sparse matrix will represent only non-zero elements. An immediate benefit of this approach is that algorithmic speed improvements can be made by disregarding the zero elements for many operations. An arguably more important benefit (and a focus of this article) is that a representation that stores only non-zero elements allows the total memory used by a sparse matrix to be significantly less than it would be if it were stored densely.

Consider an 8x8 matrix, shown to the right. In this matrix, only 12 of the 64 entries (18%) are populated. If we were to adopt a sparse storage format for this matrix, we could reduce the storage by ~60%, from 512 bytes down to 192 with a compressed format.

The simplest compressed format, coordinate (COO), represents a matrix by its non-zero values and an index at which each non-zero is located. For example, in the example matrix, the value 3.0 is located at (2,2) using 1-based indexing. These indices do add to the storage cost for the matrix, but because the number of non-zeros is small, there is a net gain when compared with a dense representation. The full representation of this matrix in COO is the following:

values = [ 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 ] column index = [ 1 1 2 2 3 3 4 4 5 6 7 8 ] row index = [ 1 6 2 8 1 4 5 7 3 6 2 8 ]

There are several other popular sparse matrix formats in addition to coordinate format. Although coordinate format is the easiest to understand and implement, it is not always the preferred format, because other formats, such as compressed sparse row format (CSR), can increase the compression at the expense of a little bit more work. In fact, CSR is the preferred format for CULA Sparse, because of its size advantage and amenability to GPU acceleration.

In the above example, we showed only 18% of the entries as non-zero, but it is common for the sparsity (number of non-zeros) of the matrix to be much larger for some problem domains. Lower sparsity leads to lower storage requirements, which means that we can fit larger and larger problem within the memory that is available to us. For example, whereas in CULA matrices that can be solved on a typical workstation typically max out at about 16K by 16K, in CULA Sparse matrices can be as large as 100M by 100M, depending on its sparsity.

## Accelerate MATLAB with the CULA Link Interface

One the exciting new features in CULA R12 is the link interface. In a previous blog post we introduced the features of this new tool and today we'll demonstrate how to easily use this interface with the popular computing tool MATLAB.

MATLAB has a feature that allows you to externally specify a library for your LAPACK and BLAS calls. Typically this feature is used if your architecture does not perform well with the libraries included with MATLAB. However, you can also use this feature to utilize GPU accelerated CULA libraries to boost performance! This is achieved by simply changing a few environment variables -- there are no MEX files to compile, no clunky gpuArray objects, and no changes MATLAB function names!

The first variables that need to be set are: LAPACK_VERSION and BLAS_VERSION. These are specific to MATLAB and should each be pointed to the 'cula_lapack_link.dll' file (cula_lapack_link.so on Linux).

The next variables that should be set are related to the CULA link library. A useful option is the CULA_DEBUG_LOG environment variable, which when set will write messages to a log file that will allow you see to see for which functions the CULA library is called. For 64-bit versions of MATLAB, set the CULA_ILP64 flag because MATLAB uses 64-bit integers internally.

On Windows, an easy way to use CULA-accelerated MATLAB is through the use of a batch file. Simply create a new .bat file with to set the environment variables and launch the MATLAB executable. For convenience, we have provided a Windows batch file to do just that. Simply place this file in your MATLAB bin folder alongside the standard matlab.bat file. Be sure that the CULA bin path is also in your Windows path so the appropriate libraries can be loaded.

Running the new batch file will launch MATLAB with CULA acceleration enabled. Running a few simple commands we can see that our linear algebra functions (matrix multiplication, QR, and SVD decomposition) are running faster:

>> tic; A = A*A'; toc; Elapsed time is 3.414187 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 11.318329 seconds. >> tic; x = C \ b; toc; Elapsed time is 19.133406 seconds.

Contrast this to the CPU implementation where the operations take up to 8x as long to complete!

>> tic; C = A*A'; toc; Elapsed time is 7.035089 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 49.837156 seconds. >> tic; x = C \ b; toc; Elapsed time is 151.153907 seconds.

Many functions in MATLAB use LAPACK under the hood. Other MATLAB routines that will automatically be accelerated include (but are not limited to):

- matrix multiply (*)
- matrix solve (\)
- svd
- eig
- inv

More information about the link interface can be found in the link_interface.txt file contained in the doc folder of your CULA install.

If you have any questions, please ask on our forums!

Edited on January 23, 2012 to update all occurrences of cula_link to cula_lapack_link.