Debugging with CULA Sparse

CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.
Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a previous post, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.
In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.
We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, try out the demo and see how our combination or performance and ease-of-use work for you!
Reordering to increase parallelism
Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into levels where an entire level of unknowns can be solved in parallel before moving on to the next level. In the worst case, there are n levels and the entire process must be serialized. In the best case, there is one level and the entire process can be parallelized (this would only happen in a diagonal matrix).
As expected, one might seek to the decrease the number of levels in their sparse matrix in an effort to increase parallelism and (potentially) decrease the time it takes to solve their sparse triangular matrix. In many cases this can be achieved by reordering the sparse matrix. A matrix reordering simply involves swapping a number of rows and/or columns to alter the structure of the matrix while preserving the values. In many cases, reordering is performed to increase accuracy or reduce the fill-in of direct solve methods. However, in the case of CULA Sparse, we reorder the matrix to increase parallelism. In the sparse linear algebra domain, there exist a number of different reordering schemes (such as minimum degree, approximate minimum degree) but they are beyond the scope of this blog post.

This image illustrates how matrix reordering is applied to a sparse matrix. In this case, a large circuit simulation problem (AMD/G3_circuit) is reordered using the symmetric minimum degree reordering method.
One of the options for CULA Spare’s ILU0 preconditioner is reordering. This will (potentially) reduce the levels in the lower and upper triangular factors produced by the factorization in an effort to increase the parallelism. Since applying the ILU0 preconditioner through two triangular solves is typically a massive bottleneck, any speed-up here will directly decrease the total time needed to converge on a solution.
When applying matrix reordering to a real world matrix such as the circuit simulation problem introduced above, we can decrease the number of levels from 2594 to 15. This decreases the time to solve the triangular matrixes from 24.2 ms to 8.5 ms - a 2.8x speedup! When using the reordered ILU0 preconditioner with the conjugate gradient solver, we see the total time per iteration drop to 53.4 ms to 22.1 ms – 2.4x speedup!
In conclusion, CULA Sparse’s ILU0 reordering option (more info) can be used to drastically reduce the time it takes to apply the triangular factors produced by the LU factorization. However, one must also consider that the reordering step has a steep calculation overhead. Additionally, since the structure of the matrix is changing, some of the conjugate-based methods will now take a different number of iterations to converge on a solution.
Interpreting CULA Sparse Results
One design goal for CULA Sparse was to give the user informative output so to avoid the user having to write verbose checking routines. The routine culaIterativeResultString() is key here. This routine accepts a culaIterativeResult structure which is an output from each CULA Sparse solver (it is the last parameter). The output produced is shown below:
Solver: Cg Precond: Block Jacobi (block size 16) Flag: Converged successfully in 27 iterations Residual: 8.424304e-07 Total Time: 0.02827s (overhead + precond + solve) Overhead: 0.000569s Precond: 2.8e-05s Solve: 0.02767s
You will notice that basic stats are produced, such as the solver and preconditioner used. The Flag field helps to interpret the mathematical status of the solve process. The example here shows a successful convergence in 27 iterations, but the Flag can also indicate conditions such as solver stagnation (failing to make progress for several consecutive iterations) or numerical breakdown. The Residual field indicates the quality of the final answer.
There is then a timing output block, which shows a total execution time plus a breakdown of where the time was spent. The Overhead field shows time spent for GPU-specific operations such as device memory allocation and transfer. The Precond field shows the total time required to generate the preconditioner, because the time required to generate a given preconditioner can vary wildly among different matrices and different preconditioners. The final field, Solve, shows the time taken for the actual system solution.
In addition to the culaIterativeResult field, each solver returns a culaStatus that is used to indicate important runtime information, such as incorrect parameters (specifying a matrix size less than zero, for example) or not having the proper version of the CUDA driver installed. Users of CULA Dense will already be familiar with this parameter. In all cases, it is recommended to first check the returned status, followed then by obtaining the iterative result string. The examples in your CULA Sparse installation clearly show how to integrate this into your code.
