One the exciting new features in CULA R12 is the link interface. In a previous blog post we introduced the features of this new tool and today we'll demonstrate how to easily use this interface with the popular computing tool MATLAB.
MATLAB has a feature that allows you to externally specify a library for your LAPACK and BLAS calls. Typically this feature is used if your architecture does not perform well with the libraries included with MATLAB. However, you can also use this feature to utilize GPU accelerated CULA libraries to boost performance! This is achieved by simply changing a few environment variables -- there are no MEX files to compile, no clunky gpuArray objects, and no changes MATLAB function names!
The first variables that need to be set are: LAPACK_VERSION and BLAS_VERSION. These are specific to MATLAB and should each be pointed to the 'cula_lapack_link.dll' file (cula_lapack_link.so on Linux).
The next variables that should be set are related to the CULA link library. A useful option is the CULA_DEBUG_LOG environment variable, which when set will write messages to a log file that will allow you see to see for which functions the CULA library is called. For 64-bit versions of MATLAB, set the CULA_ILP64 flag because MATLAB uses 64-bit integers internally.
On Windows, an easy way to use CULA-accelerated MATLAB is through the use of a batch file. Simply create a new .bat file with to set the environment variables and launch the MATLAB executable. For convenience, we have provided a Windows batch file to do just that. Simply place this file in your MATLAB bin folder alongside the standard matlab.bat file. Be sure that the CULA bin path is also in your Windows path so the appropriate libraries can be loaded.
Running the new batch file will launch MATLAB with CULA acceleration enabled. Running a few simple commands we can see that our linear algebra functions (matrix multiplication, QR, and SVD decomposition) are running faster:
>> tic; A = A*A'; toc; Elapsed time is 3.414187 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 11.318329 seconds. >> tic; x = C \ b; toc; Elapsed time is 19.133406 seconds.
Contrast this to the CPU implementation where the operations take up to 8x as long to complete!
>> tic; C = A*A'; toc; Elapsed time is 7.035089 seconds. >> tic; [q,r] = qr(B); toc; Elapsed time is 49.837156 seconds. >> tic; x = C \ b; toc; Elapsed time is 151.153907 seconds.
Many functions in MATLAB use LAPACK under the hood. Other MATLAB routines that will automatically be accelerated include (but are not limited to):
- matrix multiply (*)
- matrix solve (\)
More information about the link interface can be found in the link_interface.txt file contained in the doc folder of your CULA install.
If you have any questions, please ask on our forums!
Edited on January 23, 2012 to update all occurrences of cula_link to cula_lapack_link.
We are very pleased to announce that CULA R12, based on CUDA 4.0, is available immediately at our downloads page.
Besides CUDA 4.0 support, this release also introduces the new link-compatible interface, which allows for zero-effort porting of existing codes which use LAPACK routines. For more information, please see the documentation included or read our blog post about the feature.
Michael Feldman from HPCWire wrote a very interesting piece on CRAY's first GPU Supercomputer - the XK6, a pretty impressive system that combines AMD X86 processors with NVIDIA GPUs. In fact, the news is being covered by all major media outlets and you may have already read about it either directly from CRAY or your favorite news site.
We enjoyed reading Feldman's coverage of the story because he mentioned some details in addition to the information provided by CRAY on their May 24 press release. One of the details he mentioned was about CRAY's plans to offer third-party GPU software libraries like CULA:
"Cray also will be developing additional GPU compilers, runtime libraries, and tools, as well as bringing in third-party software, such as EM Photonics' CULA library, to make the environment richer and more productive. The idea here is to bring GPU acceleration in line with its Adaptive Supercomputing approach. The ultimate goal is to be able to write source code that could automatically be transformed to run on either CPUs, GPUs or some mix of the two. The goal is not just to deliver performance, says (Barry) Bolding, but to "get your codes to better performance faster."
Read HPCWire's full story and feel fee to ask us any questions that you may have.