New CULA Feature! Banded Solvers

by Kyle
Bande Matrix

A banded matrix only has non-zero values above and below the diagonal

In the upcoming CULA release, we are pleased to announce our first offering of GPU accelerated banded matrix solvers! As far as we know, these are the first GPU accelerated banded solvers publicly available. The new functions of interest are based upon the LAPACK functions xGBTRF and xPBTRF.  These two routines perform triangular factorization on general band matrices and positive definite matrices, respectively. Once factorized, these matrices can be easily solved by xTBTRS and xPBTRS.

Unlike the general matrix solvers, these banded matrix solvers scale with the bandwidth of the matrix and not the size of the matrix.  This scaling is a result of the BLAS based implementation which breaks the band into large square and triangular chunks to be worked on separately.  This segmentation causes the performance curve to look very similar to that of the general matrix solver, xGETRF.  You'll need a bandwidth of at least 700 before the GPU outperforms the CPU.  However, at large bandwidths over 5000, the GPU reaches speedups over 10x that of a CPU!

Since performance of these functions scale with bandwidth, we are calling these solvers the "large band solvers".  In the future, we plan on releasing other banded solvers that use different algorithms that scale on matrix size rather than bandwidth. These solvers will be known as the "thin band solvers" and will be available in a future CULA release.

Banded matrix are common in many fields of scientific computing that requires the solving of large coupled system such as computation fluid dynamics, optimizations, and structural engineering.  If you find any of these solvers useful, please leave us feedback and let us know!

Comments (2) Trackbacks (0)
  1. CUDA 3.2 and Cula R10 along with porting from Windows 7 x64 to Fedora 13 x64 gave a GETRF speedup of a factor of ten over the previous release ( 5000 to 500 milliseconds). I don’t know which part contributed how much.
    I hope the new banded solver contributes even more gains. The fact that these matrices can be easily solved by xTBTRS and xPBTRS is not apparent in the manual though. Each routine is independently documented so the association is up to user knowledge and experience.

  2. I’d imagine the R10 upgrade would be responsible for the speed-up. We typically see the performance between Windows and Linux to be the same when using Tesla-line cards.

Leave a comment

Trackbacks are disabled.