Feedback for Sparse Solvers

by Dan

Hi everyone,

In the coming weeks we are going to be making an announcement related to sparse solvers. I wanted to make those of you who don't visit our forums regularly that we have put out an official request for feedback related to sparse solvers. Specifically, we're looking to answer the following questions:

  • Are your solvers home-grown or do they use a toolkit like MKL, PARDISO, UMFPACK, or ITSOL?
  • Do you primarily use direct or iterative solvers?
  • What speedup would you need to see before you would consider moving to a GPU accelerated solver for your sparse problems?
  • What would you define as a typical problem size? What would you define as a large problem size?

You can voice your opinion here. We appreciate your feedback because it helps us to deliver solutions that are most relevant to you.


HPCwire on GPU Computing

by Kyle

Dr. Vincent Natoli from Stone Ridge Technology has recently published a very good article in HPCwire evaluating 10 common objections to GPU computing.  In this article he brings up 10 reasons why people have been hesitant to get involved with GPU computing and provides a counter-argument to each of these reasons.

The GPU team at EM Photonics agrees with a number of Dr. Natoli's points as can be seen in our CULA designs principals.  For example, his #1 fallacy is:

"I don’t want to rewrite my code or learn a new language."

In CULA, we designed a system that completely abstracts all GPU details from the user - there is no need to learn CUDA to accelerate an existing LAPACK function. You can simply compile your code using CULA functions (or alternatively link against the new link interface) and all details including initialization, memory transfers, and computation are performed with a single function call. This approach allows scientists and developers unfamiliar with GPU programming to quickly accelerate codes with LAPACK functionality.

Another fallacy examined by Dr. Natoli is:

"The PCIe bandwidth will kill my performance"

In CULA, for large enough problems, the PCIe transfer time accounts for less than 1% of the total runtime! For many of the LAPACK functions, the memory requirements are of order O(N2) while the computation is of order O(N3).  This discrepancy means that the amount of computation needed is growing at a much faster rate than the memory required. While this might not always be true for other domains, it is certainly the case for the majority of CULA. Additionally, through creative implementations it is possible to overlap GPU computation with GPU transfers and CPU computation. This technique is used heavily in CULA to achieve even higher speeds.

Overall, the article answers some of the most common misconceptions about GPU computing and is a good read for both novices and experts in the area.


User Spotlight: Andrzej Karwowski, Ph.D, D.Sc.

by Liana

From left to right are Dr. Topa, Dr. Karwowski, and Dr. Noga, members of the GPU Computing Team at the Silesian University of Technology in Poland.

Today, we are putting a spotlight on Dr. Andrzej Karwowski and his colleagues Dr. Tomasz Topa and Dr. Artur Noga of the GPU computing group at the Silesian University of Technology, Poland.

Dr. Karwowski is with the Department of Electronics, where he currently holds a position of Professor and is a leader of the Radioelectronics group.  Dr. Topa and Dr. Noga are Faculty members who have been working closely with Professor Karwowski. Most of the group's work is the fields of computational electromagnetics (CEM), electromagnetic compatibility, antennas and wireless communication. Recently, the focus is on creating GPU-based low-cost hardware platforms for computational electromagnetics (CEM). Dr. Karwowski and the other researchers are examining the possibilities of accelerating the full-wave method of moments (MoM) by employing CUDA-capable GPUs.

How CULA has helped

“We used CULA in the context of numerical modeling and analysis of electromagnetic radiation and scattering with the use of the so-called method of moments (MoM). Roughly speaking, this latter consists in constructing and then solving a matrix equation that describes a system in question. The key problem here is that even relatively simple structures generate complex-valued dense system matrices whose size can be of the order of thousands. With the use of LU factorization routine available in CULA we were able to offload to the GPU the most intensive computations required for the solution of the matrix equation thus attaining noticeable speedup of MoM simulations,” said Dr. Karwowski.

The group has recently published two papers that we suggest reading if you are also looking to accelerate your MoM simulations on CUDA.  You will find the abstracts and links for both papers under our Research Papers section.

Filed under: CULA Users Comments Off