Feedback for Sparse Solvers

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

Feedback for Sparse Solvers

Postby dan » Thu Jun 09, 2011 8:59 am

Hi everyone,

We have had a lot of our users asking us about sparse and I wanted to let everyone know we are listening. As has been mentioned in a couple of other places, we are looking to introduce sparse solvers with the next major release of CULA. There are several other threads currently on this forum with questions and comments but I would like to start a thread the collects all of the information and extends an invitation for any and all feedback.

With that in mind, here are the things we are most interested in getting feedback on:

  • Are your solvers home-grown or do they use a toolkit like MKL, PARDISO, UMFPACK, or ITSOL?
  • Do you primarily use direct or iterative solvers?
  • What speedup would you need to see before you would consider moving to a GPU accelerated solver for your sparse problems?
  • What would you define as a typical problem size? What would you define as a large problem size?

We appreciate any input you might have, because the more information you can provide us with the better we can target our solutions to meet your needs. Of course, any other feedback beyond these questions that you feel is relevant is welcome too.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re: Feedback for Sparse Solvers

Postby gusgw » Thu Jun 09, 2011 4:48 pm

While I can't comment on the first two points, as I don't write my own routines I can give some information on the second two.

Typical problems are in the realm of ~1,000 - ~6,000 square, while large problems can be upwards of ~30,000 square. At this size they are too large for current GPU's but as most entries are zero, a sparse format would make them immensely smaller. I have tried a couple of other sparse routines and apart from not performing as well as I'd hoped, they were very hard to code into fortran. I love how easy CULA is to call so I would not hesitate to use them in my code at all, even if they were no quicker than other available CPU routines.

Nick
gusgw
 
Posts: 21
Joined: Wed Nov 17, 2010 9:50 pm

Re: Feedback for Sparse Solvers

Postby gserban » Tue Jun 14, 2011 12:36 am

I am not currently a CULA user since I require sparse solvers rather than dense ones. However, this might change if you plan to introduce sparse solvers into the package.

To answer you questions, I am using home-grown sparse solvers, which use either home-grown BLAS functions or MKL. I use both direct and iterative solvers. Regarding speedup, I think 2x faster than a 12-core high-end Nehalem CPU blade (e.g., 1GPU = 24 cores, or 4x X5670 CPU) would be good.
A typical problem size for me is in the order of 2 millions DOF while a large one would be 10-100x this size. A typical problem can be solved on one node, but you need a not-so-small cluster for the big one.

As you know, there are many packages out there which offer GPU-accelerated iterative solvers, however good preconditioners are missing. A SSOR or Incomplete Cholesky preconditioner would be very important for me. Also, there some very ill-conditioned problems where sparse direct solvers are a must, however the packages which offer this are very few.

To summarize, what I would need is the following:
1. CG solver with ILU or IC0 preconditioner
2. Sparse direct solver
working on at least a few millions DOFs and, preferably, scalable to a GPU cluster.

I would be very interested in hearing more about sparse solvers in CULA.

Regards,
Serban
gserban
 
Posts: 1
Joined: Tue Jun 14, 2011 12:15 am

Re: Feedback for Sparse Solvers

Postby dandan » Tue Jun 14, 2011 7:17 am

Hi,

My research is about development and performance acceleration of an integral equation based electromagnetic simulation solver. My equations can have types of complex (in frequency domain) and real (in time domain) and I solve equations in the form "Ax=b" in several iterations or frequency/time steps. I had already developed a sequential version with GMM++ open source and free computational library and also had ported the code to clusters with distributed memory architectures using ScaLAPACK. Then, I used MKL to exploit multu-core machines and I got very good results with almost linear speed-up. In next step, because my matrices are usually dense, so using some safe sparsification techniques, I converted my system to a sparse system and then tries PARDISO and MUMPS (as parallel direct solvers) and got extremely good results. My memory usage decreased up to 90% and my solver got to be 10 times faster with lower time complexity, comparing to O(n^3) when LAPACK direct solvers are used. Now, I am using Krylov-based method and namely BiCG and GMRES with ILU preconditioner with promising results so far. Now, according to your questions:


* Are your solvers home-grown or do they use a toolkit like MKL, PARDISO, UMFPACK, or ITSOL?



MKL's support for sparse systems is actually poor! They use PARDISO as parallel direct solver which is not thread-safe and there is even a bug in the pivoting of the highly-indefinite systems. I have submitted the bug report and they confirmed and are working on it. Their iterative solvers are not complete neither. It only supports ILU and ILU0 preconditioners and GMRES solver and even their solver does not support complex data types. You can of course transfer your system into a structurally symmetric of real values but you will need double space to store the coefficient matrix. PARDISO itself is a very good solver. It is really worth to be ported to GPUs. I have read some papers that they have started to port their code to run on GPUs. MUMPS is also another great option, but it is designed to use MPI interface and unlike PARDISO, not originally for shared memory architectures. So, as a user, I would prefer to use PARDISO if I want to use a parallel sparse direct solver on GPUs. Please be aware that the bug in the pivoting exists in the PARDISO versions and not only MKL's implementation. The PARDISO team is working on it to fix it.


* Do you primarily use direct or iterative solvers?



Both! LAPACK direct solvers are great for dense matrices (as I have) and PARDISO for sparse matrices. I would also prefer to use GMRES, CG or BiCG iterative solvers on GPUs with a good collection of preconditioners. ILU and ILU0 are very basic so SPAI (SParse Approximate Inverse) or AMG (Algebraic Multigrid) would be very useful.


* What speedup would you need to see before you would consider moving to a GPU accelerated solver for your sparse problems?



Using PARDISO or MUMPS, I got a speedup of 10. With iterative solvers and a good preconditioner with convergence and low residual, I expect to have a speedup of >10 or even 20 comparing to LAPACK's direct solvers for dense systems.


* What would you define as a typical problem size? What would you define as a large problem size?



For a dense solver, I would define a problem with ~80K of unknowns (~102 GB of memory) and for sparse systems it can be 1-2 millions of unknowns which would correspond to 90 GB of main memory.

Needless to say, CULA would be an extremely useful package, if the memory management of the solvers are changed, so the whole data is not loaded into the card's memory at once. The memory of the card is quite low, so the package should be able to count on the main memory on the system and transfer data on-demand since the main memory can be 90-100 GB for modern systems.

Regards,

Danesh Daroui
dandan
 
Posts: 16
Joined: Sat Feb 26, 2011 7:30 am

Re: Feedback for Sparse Solvers

Postby mharb » Thu Jun 30, 2011 10:07 pm

We use CULA to accelerate quantum transport calculations. A common calculation that arises is inverting and multiplying matrices (We are interested in explicit elements of the inverse so we cannot just solve the system). A 'large' matrix would be between 10000 - 20000 complex elements, but they are usually very sparse, so a sparse LU decomposition solver would be immensely useful for us.

Currently, a single Tesla C2050 is giving us a speed up factor of ~5 over 6 cores for dense routines. It would be nice to see a comparable speed up for sparse routines (factor of 3?).

Mohammed
mharb
 
Posts: 9
Joined: Wed Feb 23, 2011 11:26 am


Return to General CULA Discussion

Who is online

Users browsing this forum: No registered users and 0 guests

cron