pCULA: Multi-GPU Support

by Kyle

CULA Dense R14 previews an exciting new feature in the CULA libraries - multi-GPU support through the new pCULA function family!

The pCULA routines found within the CULA Dense library attempt to utilize multiple GPUs and CPUs within a single system in an effort to increase both performance and maximum problem sizes when compared to the standard CULA Dense library. This is achieved by utilizing different algorithms to distribute linear algebra problems across the GPUs and CPUs of a system.

IMPORTANT!  Please note that the pCULA routines are in an alpha state and should not be used in any production code. They are merely a preview to demonstrate a sub-set of multi-GPU routines to be included in a future release of CULA Dense. It is expected that performance, accuracy, hardware requirements, and interfaces will change between the alpha and final releases.

While pCULA is still in alpha state, the basic functionality will not change much between now and the final release. We aim to provide a simple to use interface that will be easy to use, yet customizatable for user that need fine grain control over multiple devices. For example, the following code snippet shows all that is needed to utilize a pCULA function. The only added step is the creation and initializing of the control structure.

#include "cula_scalapack.h"
// ...
pculaConfiguration config;
culaStatus status;
status = pculaConfigInit( &config );
status = pculaDgetrf( &config, m, n, data, ld, IPIV );

The performance of pCULA is designed to scale well for multi-GPU systems. The following chart shows the performance of a double precision Cholesky factorization (pPOTRF) when using an addition GPU.

It can be expected that as the pCULA routines  move towards a final release more functions, performance, and features will be added! If you have any questions, comments, or concerns about the pCULA routines, please visit our forums.


Debugging with CULA Sparse

by Dan

CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.

Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a previous post, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.

In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.

We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, try out the demo and see how our combination or performance and ease-of-use work for you!


Batched Operations

by John

Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).

It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N3), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.

In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed.

At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.

One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.

In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the newest developer versions. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000).

As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our contact page.