CULA Dense R14 previews an exciting new feature in the CULA libraries - multi-GPU support through the new pCULA function family!
The pCULA routines found within the CULA Dense library attempt to utilize multiple GPUs and CPUs within a single system in an effort to increase both performance and maximum problem sizes when compared to the standard CULA Dense library. This is achieved by utilizing different algorithms to distribute linear algebra problems across the GPUs and CPUs of a system.
IMPORTANT! Please note that the pCULA routines are in an alpha state and should not be used in any production code. They are merely a preview to demonstrate a sub-set of multi-GPU routines to be included in a future release of CULA Dense. It is expected that performance, accuracy, hardware requirements, and interfaces will change between the alpha and final releases.
While pCULA is still in alpha state, the basic functionality will not change much between now and the final release. We aim to provide a simple to use interface that will be easy to use, yet customizatable for user that need fine grain control over multiple devices. For example, the following code snippet shows all that is needed to utilize a pCULA function. The only added step is the creation and initializing of the control structure.
#include "cula_scalapack.h" // ... pculaConfiguration config; culaStatus status; status = pculaConfigInit( &config ); status = pculaDgetrf( &config, m, n, data, ld, IPIV );
The performance of pCULA is designed to scale well for multi-GPU systems. The following chart shows the performance of a double precision Cholesky factorization (pPOTRF) when using an addition GPU.
It can be expected that as the pCULA routines move towards a final release more functions, performance, and features will be added! If you have any questions, comments, or concerns about the pCULA routines, please visit our forums.