## CULA Sparse – Real World Results

We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for GPU acceleration.

Obtained from the The University of Florida Sparse Matrix Collection, the matrix Schmid/thermal2 is a steady state thermal problem (FEM) on an unstructured grid. This is a fairly large matrix with 1.2 million rows and 8.5 million non-zero elements. It's worth noting that this problem only needs about 100 MB of storage so it can easily fit on even an entry level GPU offerings.

Like many FEM problems, the resulting matrix representation is positive definite so the conjugate gradient (CG) solver was chosen. Using this solver, we tried all of the available preconditioners available in CULA Sparse.

Time | Iterations | |||
---|---|---|---|---|

Method | CPU | GPU | CPU | GPU |

None | 246.6 | 24.57 | 4589 | 4589 |

ILU | 208.5 | 74.61 | 1946 | 1947 |

ILU + Reorder | 211.2 | 54.04 | 1789 | 1789 |

Jacobi | 250.0 | 29.49 | 4558 | 4555 |

Block Jacobi | 271.9 | 31.99 | 4694 | 4694 |

As demonstrated above, the GPU showed an appreciable speedup for all of the preconditioner methods. In the best case, with no preconditioner selected, the GPU was **over 10x faster** than the CPU! However, on the more serial CPU, the best time was achieved using the ILU0 preconditioner. Interestingly enough, the ILU0 preconditioner was not the best choice on the GPU. While this preconditioner did half the number of iterations, the overhead introduced became a bottleneck and the un-preconditioned version has the lowest wall clock performance. Comparing the best GPU algorithm to the best CPU algorithm we still see an **8.5x speedup**!

All timing benchmarks obtained in this example were performed using an NVIDIA C2050 and an Intel X5660. The CPU results were calculated using fully optimized MKL libraries while the GPU results were obtained with CULA Sparse S1. All transfer overheads are included.

## Interpreting CULA Sparse Results

One design goal for CULA Sparse was to give the user informative output so to avoid the user having to write verbose checking routines. The routine culaIterativeResultString() is key here. This routine accepts a culaIterativeResult structure which is an output from each CULA Sparse solver (it is the last parameter). The output produced is shown below:

Solver: Cg Precond: Block Jacobi (block size 16) Flag: Converged successfully in 27 iterations Residual: 8.424304e-07 Total Time: 0.02827s (overhead + precond + solve) Overhead: 0.000569s Precond: 2.8e-05s Solve: 0.02767s

You will notice that basic stats are produced, such as the solver and preconditioner used. The Flag field helps to interpret the mathematical status of the solve process. The example here shows a successful convergence in 27 iterations, but the Flag can also indicate conditions such as solver stagnation (failing to make progress for several consecutive iterations) or numerical breakdown. The Residual field indicates the quality of the final answer.

There is then a timing output block, which shows a total execution time plus a breakdown of where the time was spent. The Overhead field shows time spent for GPU-specific operations such as device memory allocation and transfer. The Precond field shows the total time required to *generate* the preconditioner, because the time required to generate a given preconditioner can vary wildly among different matrices and different preconditioners. The final field, Solve, shows the time taken for the actual system solution.

In addition to the culaIterativeResult field, each solver *returns* a culaStatus that is used to indicate important runtime information, such as incorrect parameters (specifying a matrix size less than zero, for example) or not having the proper version of the CUDA driver installed. Users of CULA Dense will already be familiar with this parameter. In all cases, it is recommended to first check the returned status, followed then by obtaining the iterative result string. The examples in your CULA Sparse installation clearly show how to integrate this into your code.

## Sparse 101: Calling a Sparse System Solve

In this post, I’ll show you how to call a matrix solve using our soon to be released CULA Sparse package. For this example, I will show the main tasks to calling our Cg solver with a Block Jacobi preconditioner for a CSR matrix with double-precision data.

The first set of parameters to consider is the matrix system to be solved, Ax=b. For these inputs, you need to consider which matrix formats and which precision you are using; see this blog post for a discussion of matrix formats. The relevant parameters for this system are:

- n - size of the matrix system
- nnz - number of non-zero elements
- A, colInd, rowPtr - a CSR representation of the matrix to be solved
- x - the solution vector
- b - the right-hand-side to solve against

These parameters will be passed directly to a function with “Dcsr” in its name to denote the double-precision data (D) and CSR representation (csr), such as in the line below:

culaDcsr{Function}(..., N, nnz, A, colInd, rowPtr, x, b, ...);

Now that I’ve discussed the matrix system, the next parameter to consider is the configuration structure for setting options that are common to all the solvers. Among these options are the solution relative tolerance, maximum number of iterations, maximum runtime, indexing-format, and whether debugging mode has been enabled. An example configuration may look like:

culaIterativeConfig config; culaIterativeConfigInit(&config); config.maxIterations = 1000; config.tolerance = 1e-6; config.indexing = 1;

As you can see, setting up problem options is an easy task. Each option is clear and self-documenting. The config initialization routine will set sensible defaults for most problems, but it’s worth double checking to see if they meet up with your own needs and perhaps overriding them as we have done above.

The last set of options to consider are the solver- and preconditioner-specific options. These options are done with a set of structures that are initialized similarly to the general configuration structure. To use the Cg solver with a Blockjacobi preconditioner, you would write:

culaCgOptions solverOptions; culaCgOptionsInit(&solverOptions); culaBlockjacobiOptions preOptions; culaBlockjacobiOptionsInit(&preOptions); preOptions.blockSize = 4;

Above, we have default initialized both structures but then overrided the Block Jacobi preconditioner block size. Because each solver and preconditioner is initialized very similarly, this makes trying out different solvers and preconditioners an easy task.

Putting it all of the parameters together, we end up with the following line:

culaDcsrCgBlockjacobi(&config, &solverOptions, &preOptions, N, nnz, A, colInd, rowPtr, x, b, &result);

That’s it! We built CULA Sparse so that it would be easy to set up and work with different options while making sure that the code is always clear about what it is doing.