## Reordering to increase parallelism

Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into *levels* where an entire level of unknowns can be solved in parallel before moving on to the next level. In the worst case, there are *n* levels and the entire process must be serialized. In the best case, there is one level and the entire process can be parallelized (this would only happen in a diagonal matrix).

As expected, one might seek to the decrease the number of levels in their sparse matrix in an effort to increase parallelism and (potentially) decrease the time it takes to solve their sparse triangular matrix. In many cases this can be achieved by reordering the sparse matrix. A matrix reordering simply involves swapping a number of rows and/or columns to alter the structure of the matrix while preserving the values. In many cases, reordering is performed to increase accuracy or reduce the fill-in of direct solve methods. However, in the case of CULA Sparse, we reorder the matrix to increase *parallelism*. In the sparse linear algebra domain, there exist a number of different reordering schemes (such as minimum degree, approximate minimum degree) but they are beyond the scope of this blog post.

One of the options for CULA Spare’s ILU0 preconditioner is *reordering*. This will (potentially) reduce the levels in the lower and upper triangular factors produced by the factorization in an effort to increase the parallelism. Since applying the ILU0 preconditioner through two triangular solves is typically a massive bottleneck, any speed-up here will directly decrease the total time needed to converge on a solution.

When applying matrix reordering to a real world matrix such as the circuit simulation problem introduced above, we can decrease the number of levels from 2594 to 15. This decreases the time to solve the triangular matrixes from 24.2 ms to 8.5 ms - a 2.8x speedup! When using the reordered ILU0 preconditioner with the conjugate gradient solver, we see the total time per iteration drop to 53.4 ms to 22.1 ms – 2.4x speedup!

In conclusion, CULA Sparse’s ILU0 reordering option (more info) can be used to drastically reduce the time it takes to apply the triangular factors produced by the LU factorization. However, one must also consider that the reordering step has a steep calculation overhead. Additionally, since the structure of the matrix is changing, some of the conjugate-based methods will now take a different number of iterations to converge on a solution.

## Batched Operations

Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).

It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N3), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.

In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed.

At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.

One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.

In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the newest developer versions. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000).

As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our contact page.

## Interpreting CULA Sparse Results

One design goal for CULA Sparse was to give the user informative output so to avoid the user having to write verbose checking routines. The routine culaIterativeResultString() is key here. This routine accepts a culaIterativeResult structure which is an output from each CULA Sparse solver (it is the last parameter). The output produced is shown below:

Solver: Cg Precond: Block Jacobi (block size 16) Flag: Converged successfully in 27 iterations Residual: 8.424304e-07 Total Time: 0.02827s (overhead + precond + solve) Overhead: 0.000569s Precond: 2.8e-05s Solve: 0.02767s

You will notice that basic stats are produced, such as the solver and preconditioner used. The Flag field helps to interpret the mathematical status of the solve process. The example here shows a successful convergence in 27 iterations, but the Flag can also indicate conditions such as solver stagnation (failing to make progress for several consecutive iterations) or numerical breakdown. The Residual field indicates the quality of the final answer.

There is then a timing output block, which shows a total execution time plus a breakdown of where the time was spent. The Overhead field shows time spent for GPU-specific operations such as device memory allocation and transfer. The Precond field shows the total time required to *generate* the preconditioner, because the time required to generate a given preconditioner can vary wildly among different matrices and different preconditioners. The final field, Solve, shows the time taken for the actual system solution.

In addition to the culaIterativeResult field, each solver *returns* a culaStatus that is used to indicate important runtime information, such as incorrect parameters (specifying a matrix size less than zero, for example) or not having the proper version of the CUDA driver installed. Users of CULA Dense will already be familiar with this parameter. In all cases, it is recommended to first check the returned status, followed then by obtaining the iterative result string. The examples in your CULA Sparse installation clearly show how to integrate this into your code.