4Nov/11Off

CULA Sparse – Real World Results

by Kyle

We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for GPU acceleration.

Obtained from the The University of Florida Sparse Matrix Collection, the matrix Schmid/thermal2 is a steady state thermal problem (FEM) on an unstructured grid. This is a fairly large matrix with 1.2 million rows and 8.5 million non-zero elements. It's worth noting that this problem only needs about 100 MB of storage so it can easily fit on even an entry level GPU offerings.

Like many FEM problems, the resulting matrix representation is positive definite so the conjugate gradient (CG) solver was chosen. Using this solver, we tried all of the available preconditioners available in CULA Sparse.

Time Iterations
Method CPU GPU CPU GPU
None 246.6 24.57 4589 4589
ILU 208.5 74.61 1946 1947
ILU + Reorder 211.2 54.04 1789 1789
Jacobi 250.0 29.49 4558 4555
Block Jacobi 271.9 31.99 4694 4694

As demonstrated above, the GPU showed an appreciable speedup for all of the preconditioner methods. In the best case, with no preconditioner selected, the GPU was over 10x faster than the CPU! However, on the more serial CPU, the best time was achieved using the ILU0 preconditioner. Interestingly enough, the ILU0 preconditioner was not the best choice on the GPU. While this preconditioner did half the number of iterations, the overhead introduced became a bottleneck and the un-preconditioned version has the lowest wall clock performance. Comparing the best GPU algorithm to the best CPU algorithm we still see an 8.5x speedup!

All timing benchmarks obtained in this example were performed using an NVIDIA C2050 and an Intel X5660. The CPU results were calculated using fully optimized MKL libraries while the GPU results were obtained with CULA Sparse S1. All transfer overheads are included.

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.