pCULA: Multi-GPU Support

by Kyle

CULA Dense R14 previews an exciting new feature in the CULA libraries - multi-GPU support through the new pCULA function family!

The pCULA routines found within the CULA Dense library attempt to utilize multiple GPUs and CPUs within a single system in an effort to increase both performance and maximum problem sizes when compared to the standard CULA Dense library. This is achieved by utilizing different algorithms to distribute linear algebra problems across the GPUs and CPUs of a system.

IMPORTANT!  Please note that the pCULA routines are in an alpha state and should not be used in any production code. They are merely a preview to demonstrate a sub-set of multi-GPU routines to be included in a future release of CULA Dense. It is expected that performance, accuracy, hardware requirements, and interfaces will change between the alpha and final releases.

While pCULA is still in alpha state, the basic functionality will not change much between now and the final release. We aim to provide a simple to use interface that will be easy to use, yet customizatable for user that need fine grain control over multiple devices. For example, the following code snippet shows all that is needed to utilize a pCULA function. The only added step is the creation and initializing of the control structure.

#include "cula_scalapack.h"
// ...
pculaConfiguration config;
culaStatus status;
status = pculaConfigInit( &config );
status = pculaDgetrf( &config, m, n, data, ld, IPIV );

The performance of pCULA is designed to scale well for multi-GPU systems. The following chart shows the performance of a double precision Cholesky factorization (pPOTRF) when using an addition GPU.

It can be expected that as the pCULA routines  move towards a final release more functions, performance, and features will be added! If you have any questions, comments, or concerns about the pCULA routines, please visit our forums.


CULA Dense R14 and Sparse S2 – Now Supporting CUDA 4.1

by John

We're pleased to announce the release of our latest CULA Dense and Sparse versions, with full compatibility for CUDA 4.1. A major highlight of R14 is the inclusion of a preview of multi-GPU LAPACK routines, hereby called the pCULA branch of CULA Dense. Again, this is a preview designed to show potential performance as well as an interface which will likely continue to evolve over time. The new multi-GPU routines are:

pculaGetrf (LU decomposition)
pculaGetrs (LU solve)
pculaGesv (general system solve via LU)
pculaPotrf (Cholesky decomposition)
pculaPotrs (Cholesky solve)
pculaPosv (hermitian/symmetric postive-definite system solve)
pculaTrsm (BLAS triangular system solve)
pculaGemm (BLAS general matrix multiply)

An upcoming blog post will contain more on the usage and expectations of these routines, but a simple example is quite easy to create:


pculaConfig config;
// some users may wish to tweak the default options here
// the default is to use all CUDA devices and to allow the routine
// to select the parameters it feels is best

culaStatus status = pculaPotrf(&config, m, n, A, lda);

As always, in addition to new features are bug fixes and speed/stability improvements. The full release notes for both R14 and S2 are available at the dense downloads page and the sparse downloads page, respectively.


Debugging with CULA Sparse

by Dan

CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.

Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a previous post, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.

In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.

We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, try out the demo and see how our combination or performance and ease-of-use work for you!