CULA speed-up

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

CULA speed-up

Postby dandan » Wed Jul 13, 2011 8:51 am


I have recently ported my code which utilized Intel MKL to use CULA functions and tested new and old implementations on a machine, equipped with two quad-core CPUs (8 cores in total) and two M2050 Tesla cards. My questions are:

1. According to my tests, the speed-up factor when I compare new CULA code with old MKL version, is not linear or even close to linear. But, it highly depends to the problem size and as the problem size increases, better speed-up is gained using CULA routines. Is there any explanation for that?

2. Is there any way, so CULA routines will be able to use both GPU devices installed on my test machine? I think, it only uses one GPU device at the time. Does CULA development team have any plan to release new versions to support more than one GPU device or is it essentially possible without writing MPI code and manage the communication manually by the programmer?


Posts: 16
Joined: Sat Feb 26, 2011 7:30 am

Re: CULA speed-up

Postby john » Wed Jul 20, 2011 8:33 am

1) Larger matrices have more opportunities to exploit the massive parallelism required to get good perf from NV GPUs. We try to be up-front about that, such as on the performance page and in our benchmarking example.

2) At present, CULA is limited to one GPU. Expanding to multiple GPUs is current work for us (for some routines, not all). One of the biggest challenges in this regard is that in order to keep the interface sensible, it must be a host-memory interface, where in the GPU community there is a very strong tendency of users to want only device-memory interfaces.
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 2 guests