29Nov/11Off

Introducing the CULA Sparse Demo

by Dan

We are very pleased to announce that we have recently released a free demo for CULA Sparse. This demo is manifested in a standalone, command line driven program with which you can choose your options and see the performance for a particular routine. All solvers and most features that are provided by CULA Sparse are supported.

For example, to run the demo with a cg solver and jacobi preconditioner, you can use the command below. The demo accepts matrices that are in the matrix market format (.mtx). For information on this format, see the resources provided by this NIST site.

iterativeBenchmark solver=cg preconditioner=jacobi A=myfile.mtx b=ones tolerance=1e-5



The CULA Sparse demo is powerful because it allows you to easily try our several different solvers, preconditioners, and other features without coding or building any software. And once you’ve found out the combination of inputs that is ideal for you, you can easily transition this knowledge into your CULA Sparse implementation.

Download the CULA Sparse demo today and see how our GPU-accelerated solvers can work for you.

15Nov/11Off

Directly from SC11

by Liana

The entire CULA team is here in Seattle and everyone is pumped up for the first big day of action. Last night, at the opening gala, we were pleased to see familiar faces all around us. It's not an easy showroom to navigate, but we hope our users will find us at booth # 244.  A number of people came by our booth to ask about CULA Sparse, as well as a few scavenger hunters (fun!), and we hope this will be another great show for everyone. Today we will be catching up with our partners to find out what their vision of the SC market is and how we can work together and contribute to their strategies.

By the way, it is TODAY that John Humphrey will be giving his presentation on CULA Sparse and all of the great features added to the CULA Dense library!  We hope you can make it!

What: Exhibitor Forums: Advances in the CULA Linear Algebra Library 
Where: 613/614

Enjoy the show!

4Nov/11Off

CULA Sparse – Real World Results

by Kyle

We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for GPU acceleration.

Obtained from the The University of Florida Sparse Matrix Collection, the matrix Schmid/thermal2 is a steady state thermal problem (FEM) on an unstructured grid. This is a fairly large matrix with 1.2 million rows and 8.5 million non-zero elements. It's worth noting that this problem only needs about 100 MB of storage so it can easily fit on even an entry level GPU offerings.

Like many FEM problems, the resulting matrix representation is positive definite so the conjugate gradient (CG) solver was chosen. Using this solver, we tried all of the available preconditioners available in CULA Sparse.

Time Iterations
Method CPU GPU CPU GPU
None 246.6 24.57 4589 4589
ILU 208.5 74.61 1946 1947
ILU + Reorder 211.2 54.04 1789 1789
Jacobi 250.0 29.49 4558 4555
Block Jacobi 271.9 31.99 4694 4694

As demonstrated above, the GPU showed an appreciable speedup for all of the preconditioner methods. In the best case, with no preconditioner selected, the GPU was over 10x faster than the CPU! However, on the more serial CPU, the best time was achieved using the ILU0 preconditioner. Interestingly enough, the ILU0 preconditioner was not the best choice on the GPU. While this preconditioner did half the number of iterations, the overhead introduced became a bottleneck and the un-preconditioned version has the lowest wall clock performance. Comparing the best GPU algorithm to the best CPU algorithm we still see an 8.5x speedup!

All timing benchmarks obtained in this example were performed using an NVIDIA C2050 and an Intel X5660. The CPU results were calculated using fully optimized MKL libraries while the GPU results were obtained with CULA Sparse S1. All transfer overheads are included.