Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).
It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N3), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.
In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed.
At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.
One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.
In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the newest developer versions. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000).
As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our contact page.