Page 1 of 1

CULA speed-up

PostPosted: Wed Jul 13, 2011 8:51 am
by dandan
Hi,

I have recently ported my code which utilized Intel MKL to use CULA functions and tested new and old implementations on a machine, equipped with two quad-core CPUs (8 cores in total) and two M2050 Tesla cards. My questions are:

1. According to my tests, the speed-up factor when I compare new CULA code with old MKL version, is not linear or even close to linear. But, it highly depends to the problem size and as the problem size increases, better speed-up is gained using CULA routines. Is there any explanation for that?

2. Is there any way, so CULA routines will be able to use both GPU devices installed on my test machine? I think, it only uses one GPU device at the time. Does CULA development team have any plan to release new versions to support more than one GPU device or is it essentially possible without writing MPI code and manage the communication manually by the programmer?

Regards,

D.

Re: CULA speed-up

PostPosted: Wed Jul 20, 2011 8:33 am
by john
Hello,
1) Larger matrices have more opportunities to exploit the massive parallelism required to get good perf from NV GPUs. We try to be up-front about that, such as on the performance page and in our benchmarking example.

2) At present, CULA is limited to one GPU. Expanding to multiple GPUs is current work for us (for some routines, not all). One of the biggest challenges in this regard is that in order to keep the interface sensible, it must be a host-memory interface, where in the GPU community there is a very strong tendency of users to want only device-memory interfaces.