## Not enough HPC programmers. How to fill the gap?

Engineers with top notch parallel programming experience are highly in demand in the U.S. This fact was recently pointed out in stories published by the mainstream Daily Beast, as well as HPC Wire. A quote from Stan Ahalt in the Daily Beast story caught my attention: “It’s not enough to keep building powerful supercomputers unless we have the brains. Think of a supercomputer as a very fast racing engine. We need more drivers to use those engines." Stan is the director of a supercomputing center at the University of North Carolina at Chapel Hill.

Programming supercomputers is hard work. Those involved in programming large HPC systems go through in-depth training and spend months (sometimes years) fine-tuning their algorithms until they are fully leveraging the massive computing power these machines offer. There is a growing number of tools and libraries for HPC programmers, but not necessarily suitable for all levels of computer engineers. For non HPC-experts, programming small to mid-scale systems can be a pretty challenging and time-consuming task, something we hear quite often from our customers and partners.

**Where EM Photonics Can Make a Difference**

Companies with recently installed small- to mid-scale supercomputing systems often need help porting their applications to their new machines. This is where we bring tremendous value. We are easy to engage with and offer in-depth understanding of parallel architectures. On top of parallel programming expertise, we bring knowledge and experience in physics-based modeling and simulation, image processing, life sciences, finance, military and defense applications. (Typically, the bigger the problem, the greater the fun!)

We encourage you to take a peak at our EM Photonics site to learn more about our consulting services, as well as current research projects and published papers. We have a team of talented engineers looking forward to tackling new challenges. Just let us know how we can help!

## Reordering to increase parallelism

Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into *levels* where an entire level of unknowns can be solved in parallel before moving on to the next level. In the worst case, there are *n* levels and the entire process must be serialized. In the best case, there is one level and the entire process can be parallelized (this would only happen in a diagonal matrix).

As expected, one might seek to the decrease the number of levels in their sparse matrix in an effort to increase parallelism and (potentially) decrease the time it takes to solve their sparse triangular matrix. In many cases this can be achieved by reordering the sparse matrix. A matrix reordering simply involves swapping a number of rows and/or columns to alter the structure of the matrix while preserving the values. In many cases, reordering is performed to increase accuracy or reduce the fill-in of direct solve methods. However, in the case of CULA Sparse, we reorder the matrix to increase *parallelism*. In the sparse linear algebra domain, there exist a number of different reordering schemes (such as minimum degree, approximate minimum degree) but they are beyond the scope of this blog post.

One of the options for CULA Spare’s ILU0 preconditioner is *reordering*. This will (potentially) reduce the levels in the lower and upper triangular factors produced by the factorization in an effort to increase the parallelism. Since applying the ILU0 preconditioner through two triangular solves is typically a massive bottleneck, any speed-up here will directly decrease the total time needed to converge on a solution.

When applying matrix reordering to a real world matrix such as the circuit simulation problem introduced above, we can decrease the number of levels from 2594 to 15. This decreases the time to solve the triangular matrixes from 24.2 ms to 8.5 ms - a 2.8x speedup! When using the reordered ILU0 preconditioner with the conjugate gradient solver, we see the total time per iteration drop to 53.4 ms to 22.1 ms – 2.4x speedup!

In conclusion, CULA Sparse’s ILU0 reordering option (more info) can be used to drastically reduce the time it takes to apply the triangular factors produced by the LU factorization. However, one must also consider that the reordering step has a steep calculation overhead. Additionally, since the structure of the matrix is changing, some of the conjugate-based methods will now take a different number of iterations to converge on a solution.

## Batched Operations

Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).

It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N3), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.

In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed.

At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.

One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.

In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the newest developer versions. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000).

As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our contact page.