Page 5 of 6

Re:sgesv in 1.1 is slow...

PostPosted: Thu Mar 04, 2010 1:40 pm
by dan
Boxed Cylon wrote:
So the question is not so much why sgesv is slow, as why is it slow when the 2nd dimension of B is large. In my own application, this dimension is about 1000. This result seems odd - presumably all the computing is in setting up the inverse; I would have thought the timing would be rather independent of the 2nd dimension of B.
[/quote]

Thanks for this analysis. Here are my results for large right-hand sides:

Code: Select all
>> A = rand(2048,2048, 'single');
>> B = rand(2048,64, 'single');
>> tic; A\B; toc;
Elapsed time is 0.468459 seconds.
>> tic; [x y z] = culaGesv(A,B); toc;
Elapsed time is 0.210827 seconds.
>> A = rand(2048,2048, 'single');
>> B = rand(2048,5000, 'single');
>> tic; A\B; toc;
Elapsed time is 1.570242 seconds.
>> tic; [x y z] = culaGesv(A,B); toc;
Elapsed time is 4.637038 seconds.


As you can see, my results are similar to yours. We've finally got some common ground! =)

It appears that this issue is unrelated to other slowdowns I've discovered in Matlab. As you said, it looks like it appears to be slow when the RHS is large. I'll do some more digging into this specific case and let you know what we find.

Dan

Re:sgesv in 1.1 is slow...

PostPosted: Thu Mar 04, 2010 2:05 pm
by jpeinado
Hi:


Boxed Cyclon and Dan...I am happy to hear that there is a common ground with the problem... I will test my machine in next days to see also if the problem is a large B.


Thank you very much to all of you


By the way, if the problem is that B is large, then (I understand) that the culprit routines must be:


- swapping B (using ipiv)


- Computing the triangular systems.



Anyway, I did several test using sgetrf and also it was slow. Could you please to check if you have problems using sgetrf? I suppose that CULA sgev must be:

sgesv = sgetrf (computing LU) + sgetrs (swapping B, solving triangular systems)


jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 1:31 am
by cjest
Here comes the result out of my machine, working on complex precision

>> A = rand(2048,2048,'single')+1i*rand(2048,2048,'single');
>> B = rand(2048,64,'single')+1i*rand(2048,64,'single');
>> tic; A\B; toc
Elapsed time is 1.594184 seconds.
>> tic; x = culasv(A,B ); toc;
Elapsed time is 0.338316 seconds.
>> B = rand(2048,2048,'single')+1i*rand(2048,2048,'single');
>> tic; A\B; toc;
Elapsed time is 2.029847 seconds.
>> tic; x = culasv(A,B ); toc;
Elapsed time is 0.770844 seconds.
>> B = rand(2048,5000,'single')+1i*rand(2048,5000,'single');
>> tic; A\B; toc;
Elapsed time is 3.651626 seconds.
>> tic; x = culasv(A,B ); toc;
Elapsed time is 1.451895 seconds.

Note: culasv is "Cula Solver" using "Cgesv".
Speedup is obtained even where size of B >= size of A.


BR/CJ

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 3:20 am
by jpeinado
My results:

Dan's Routine:

Code: Select all

A=rand(2048,2048,'single');
A=rand(2048,64,'single');

tic;A\B;toc
Elapsed time is 0.629523 seconds.
>> tic;A\B;toc

>> tic;[X]=culaGesv2(A,B) ;toc
Initializing CULA...
$$$$$$$$$$  0.144 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 1.398374 seconds.



Changing the Matrix B

Code: Select all
>> B=rand(2048,2048,'single');
>> tic;A\B;toc
Elapsed time is 1.220912 seconds.
>> tic;[X]=culaGesv2(A,B) ;toc
Initializing CULA...
$$$$$$$$$$  1.848 s
X-top = 8.216147e-02 -1.769290e-01 3.114136e-01
X-bottom = 8.668031e-02 2.901745e-01 3.632181e-01
Elapsed time is 1.906207 seconds.




Using my routine called culaDeviceSgesv (similar to Boxed Cyclon's routine)

Code: Select all
A=rand(2048,2048,'single');
A=rand(2048,64,'single');

tic;A\B;toc
Elapsed time is 0.579672 seconds.

tic;[X]=culaDeviceSgesv(A,B) ;toc
Elapsed time is 0.170306 seconds.


Changing the Matrix B

Code: Select all
>> B=rand(2048,2048,'single');
>> tic;A\B;toc
Elapsed time is 1.322053 seconds.
>> tic;[X]=culaDeviceSgesv(A,B);toc
Elapsed time is 1.893691 seconds.


My results are very similar to yours
I used a new machine with a C2DUO processor and a Geforce GTX280


It would be very important to test if culaSgetrf works correctly. I looked for my tests and I did not this test yet. If culaSgetrf works correctly, then the problem could be in swapping B or solving the triangular systems



jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 8:23 am
by Boxed Cylon
jpeinado wrote:My results:

Dan's Routine:

Code: Select all

A=rand(2048,2048,'single');
A=rand(2048,64,'single');

tic;A\B;toc
Elapsed time is 0.629523 seconds.
>> tic;A\B;toc

>> tic;[X]=culaGesv2(A,B) ;toc
Initializing CULA...
$$$$$$$$$$  0.144 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 1.398374 seconds.



Changing the Matrix B

Code: Select all
>> B=rand(2048,2048,'single');
>> tic;A\B;toc
Elapsed time is 1.220912 seconds.
>> tic;[X]=culaGesv2(A,B) ;toc
Initializing CULA...
$$$$$$$$$$  1.848 s
X-top = 8.216147e-02 -1.769290e-01 3.114136e-01
X-bottom = 8.668031e-02 2.901745e-01 3.632181e-01
Elapsed time is 1.906207 seconds.


jpeinado


Its important to run the CUDA test, at least, twice. The second time the device is initialized already, which makes it the better test. In the first test above the time "$$$$$$$$$$ 0.144 s" is the more accurate measure, rather than the "tic;toc" time of 1.398374 seconds. (I suspect you know this already... :) )

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 8:42 am
by jpeinado
Boxed Cylon wrote:

Its important to run the CUDA test, at least, twice. The second time the device is initialized already, which makes it the better test. In the first test above the time "$$$$$$$$$$ 0.144 s" is the more accurate measure, rather than the "tic;toc" time of 1.398374 seconds. (I suspect you know this already... :) )



Yes. In fact, I did before another run, to get the first test time, to avoid to include the init time :)


jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 4:04 pm
by john
Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS. Currently I am solving the 2048 problem (with B sized at 2048x2048) in 0.19 seconds, and I think I still have room to make it go a bit quicker.

Also good news for everyone: gesv is seeing significantly increased speeds across the board (all sizes, all precisions). And more good news is that we should see some related improvements in a few other routines like gels and posv.

You can expect a service release on this one as early as next week. Thank you all for the very detailed feedback - it was very helpful in finding this.

John

Re:sgesv in 1.1 is slow...

PostPosted: Fri Mar 05, 2010 5:40 pm
by Boxed Cylon
Ah ha! :)

The improvements won't be a game changer for me, but it will be nice to have my application run a little faster. And all is right with the universe again...

Re:sgesv in 1.1 is slow...

PostPosted: Sat Mar 06, 2010 7:48 am
by jpeinado
john wrote:Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS




VERY GOOD NEWS!!! John!!! :) :)


For me, it is very important in my algorithms to have a good dgesv routine :) :)


Now, I have the UJI - CULAPACK sgetrf, but I think CULA routines could get better results. In my algorithms I have AX=B, where A and B are the same size. Then B is as large as A


Please let us know the advances you have


Thank you very much to all the CULA people... and very specially to Boxed Cyclon


Gracias !!!


jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Wed Mar 10, 2010 5:46 am
by cjest
Hi,
is it feasible to put Gesv in a kernel? if Yes, would that be a good way to speedup "\" operation for smaller matrices?

BR

Re:sgesv in 1.1 is slow...

PostPosted: Thu Mar 11, 2010 10:43 am
by jpeinado
cjest wrote:Hi,
is it feasible to put Gesv in a kernel? if Yes, would that be a good way to speedup "" operation for smaller matrices?

BR


Hi:

gesv (CULA) is a kernel. If you want to avoid to send the matrices from CPU to GPU in each iteration, it is possible to do it using CulaDeviceSgesv call.

I think it is not possible to speedup \ for smaller matrices doing what you want to do.

If you want to speedup your computing. For example you have a great loop with \ computing. You need that each \ be independent. Probably (if you are solving for example an iterative method) this is not possible.

jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Mon Mar 22, 2010 3:30 pm
by jpeinado
john wrote:Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS. Currently I am solving the 2048 problem (with B sized at 2048x2048) in 0.19 seconds, and I think I still have room to make it go a bit quicker.

Also good news for everyone: gesv is seeing significantly increased speeds across the board (all sizes, all precisions). And more good news is that we should see some related improvements in a few other routines like gels and posv.

You can expect a service release on this one as early as next week. Thank you all for the very detailed feedback - it was very helpful in finding this.

John


Is there any new about this?

Thanks

jpeinado

Re:sgesv in 1.1 is slow...

PostPosted: Tue Mar 30, 2010 7:07 am
by john
Should be very soon. We have expanded the scope of work to have some far-reaching speedups across many of the CULA routines including gels, getrf, posv, and more.

Re:sgesv in 1.1 is slow...

PostPosted: Sun Apr 04, 2010 5:03 am
by jpeinado
Thank you very much


Jesus

Re:sgesv in 1.1 is slow...

PostPosted: Thu Apr 08, 2010 2:06 pm
by john
1.3 is released. I'm anxious to see the new results from this thread!