Page **5** of **6**

### Re:sgesv in 1.1 is slow...

Posted:

**Thu Mar 04, 2010 1:40 pm**
by **dan**

Boxed Cylon wrote:
So the question is not so much why sgesv is slow, as why is it slow when the 2nd dimension of B is large. In my own application, this dimension is about 1000. This result seems odd - presumably all the computing is in setting up the inverse; I would have thought the timing would be rather independent of the 2nd dimension of B.

[/quote]

Thanks for this analysis. Here are my results for large right-hand sides:

- Code: Select all
`>> A = rand(2048,2048, 'single');`

>> B = rand(2048,64, 'single');

>> tic; A\B; toc;

Elapsed time is 0.468459 seconds.

>> tic; [x y z] = culaGesv(A,B); toc;

Elapsed time is 0.210827 seconds.

>> A = rand(2048,2048, 'single');

>> B = rand(2048,5000, 'single');

>> tic; A\B; toc;

Elapsed time is 1.570242 seconds.

>> tic; [x y z] = culaGesv(A,B); toc;

Elapsed time is 4.637038 seconds.

As you can see, my results are similar to yours. We've finally got some common ground! =)

It appears that this issue is unrelated to other slowdowns I've discovered in Matlab. As you said, it looks like it appears to be slow when the RHS is large. I'll do some more digging into this specific case and let you know what we find.

Dan

### Re:sgesv in 1.1 is slow...

Posted:

**Thu Mar 04, 2010 2:05 pm**
by **jpeinado**

Hi:

Boxed Cyclon and Dan...I am happy to hear that there is a common ground with the problem... I will test my machine in next days to see also if the problem is a large B.

Thank you very much to all of you

By the way, if the problem is that B is large, then (I understand) that the culprit routines must be:

- swapping B (using ipiv)

- Computing the triangular systems.

Anyway, I did several test using sgetrf and also it was slow. Could you please to check if you have problems using sgetrf? I suppose that CULA sgev must be:

sgesv = sgetrf (computing LU) + sgetrs (swapping B, solving triangular systems)

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 1:31 am**
by **cjest**

Here comes the result out of my machine, working on complex precision

>> A = rand(2048,2048,'single')+1i*rand(2048,2048,'single');

>> B = rand(2048,64,'single')+1i*rand(2048,64,'single');

>> tic; A\B; toc

Elapsed time is 1.594184 seconds.

>> tic; x = culasv(A,B ); toc;

Elapsed time is 0.338316 seconds.

>> B = rand(2048,2048,'single')+1i*rand(2048,2048,'single');

>> tic; A\B; toc;

Elapsed time is 2.029847 seconds.

>> tic; x = culasv(A,B ); toc;

Elapsed time is 0.770844 seconds.

>> B = rand(2048,5000,'single')+1i*rand(2048,5000,'single');

>> tic; A\B; toc;

Elapsed time is 3.651626 seconds.

>> tic; x = culasv(A,B ); toc;

Elapsed time is 1.451895 seconds.

Note: culasv is "Cula Solver" using "Cgesv".

Speedup is obtained even where size of B >= size of A.

BR/CJ

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 3:20 am**
by **jpeinado**

My results:

Dan's Routine:

- Code: Select all

A=rand(2048,2048,'single');

A=rand(2048,64,'single');

tic;A\B;toc

Elapsed time is 0.629523 seconds.

>> tic;A\B;toc

>> tic;[X]=culaGesv2(A,B) ;toc

Initializing CULA...

$$$$$$$$$$ 0.144 s

X-top = -9.435948e-02 3.882708e-01 -6.582417e-02

X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01

Elapsed time is 1.398374 seconds.

Changing the Matrix B

- Code: Select all
`>> B=rand(2048,2048,'single');`

>> tic;A\B;toc

Elapsed time is 1.220912 seconds.

>> tic;[X]=culaGesv2(A,B) ;toc

Initializing CULA...

$$$$$$$$$$ 1.848 s

X-top = 8.216147e-02 -1.769290e-01 3.114136e-01

X-bottom = 8.668031e-02 2.901745e-01 3.632181e-01

Elapsed time is 1.906207 seconds.

Using my routine called culaDeviceSgesv (similar to Boxed Cyclon's routine)

- Code: Select all
`A=rand(2048,2048,'single');`

A=rand(2048,64,'single');

tic;A\B;toc

Elapsed time is 0.579672 seconds.

tic;[X]=culaDeviceSgesv(A,B) ;toc

Elapsed time is 0.170306 seconds.

Changing the Matrix B

- Code: Select all
`>> B=rand(2048,2048,'single');`

>> tic;A\B;toc

Elapsed time is 1.322053 seconds.

>> tic;[X]=culaDeviceSgesv(A,B);toc

Elapsed time is 1.893691 seconds.

My results are very similar to yours

I used a new machine with a C2DUO processor and a Geforce GTX280

It would be very important to test if culaSgetrf works correctly. I looked for my tests and I did not this test yet. If culaSgetrf works correctly, then the problem could be in swapping B or solving the triangular systems

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 8:23 am**
by **Boxed Cylon**

jpeinado wrote:My results:

Dan's Routine:

- Code: Select all

A=rand(2048,2048,'single');

A=rand(2048,64,'single');

tic;A\B;toc

Elapsed time is 0.629523 seconds.

>> tic;A\B;toc

>> tic;[X]=culaGesv2(A,B) ;toc

Initializing CULA...

$$$$$$$$$$ 0.144 s

X-top = -9.435948e-02 3.882708e-01 -6.582417e-02

X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01

Elapsed time is 1.398374 seconds.

Changing the Matrix B

- Code: Select all
`>> B=rand(2048,2048,'single');`

>> tic;A\B;toc

Elapsed time is 1.220912 seconds.

>> tic;[X]=culaGesv2(A,B) ;toc

Initializing CULA...

$$$$$$$$$$ 1.848 s

X-top = 8.216147e-02 -1.769290e-01 3.114136e-01

X-bottom = 8.668031e-02 2.901745e-01 3.632181e-01

Elapsed time is 1.906207 seconds.

jpeinado

Its important to run the CUDA test, at least, twice. The second time the device is initialized already, which makes it the better test. In the first test above the time "$$$$$$$$$$ 0.144 s" is the more accurate measure, rather than the "tic;toc" time of 1.398374 seconds. (I suspect you know this already... :) )

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 8:42 am**
by **jpeinado**

Boxed Cylon wrote:
Its important to run the CUDA test, at least, twice. The second time the device is initialized already, which makes it the better test. In the first test above the time "$$$$$$$$$$ 0.144 s" is the more accurate measure, rather than the "tic;toc" time of 1.398374 seconds. (I suspect you know this already... :) )

Yes. In fact, I did before another run, to get the first test time, to avoid to include the init time :)

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 4:04 pm**
by **john**

Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS. Currently I am solving the 2048 problem (with B sized at 2048x2048) in 0.19 seconds, and I think I still have room to make it go a bit quicker.

Also good news for everyone: gesv is seeing significantly increased speeds across the board (all sizes, all precisions). And more good news is that we should see some related improvements in a few other routines like gels and posv.

You can expect a service release on this one as early as next week. Thank you all for the very detailed feedback - it was very helpful in finding this.

John

### Re:sgesv in 1.1 is slow...

Posted:

**Fri Mar 05, 2010 5:40 pm**
by **Boxed Cylon**

Ah ha! :)

The improvements won't be a game changer for me, but it will be nice to have my application run a little faster. And all is right with the universe again...

### Re:sgesv in 1.1 is slow...

Posted:

**Sat Mar 06, 2010 7:48 am**
by **jpeinado**

john wrote:Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS

VERY GOOD NEWS!!! John!!! :) :)

For me, it is very important in my algorithms to have a good dgesv routine :) :)

Now, I have the UJI - CULAPACK sgetrf, but I think CULA routines could get better results. In my algorithms I have AX=B, where A and B are the same size. Then B is as large as A

Please let us know the advances you have

Thank you very much to all the CULA people... and very specially to Boxed Cyclon

Gracias !!!

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Wed Mar 10, 2010 5:46 am**
by **cjest**

Hi,

is it feasible to put Gesv in a kernel? if Yes, would that be a good way to speedup "\" operation for smaller matrices?

BR

### Re:sgesv in 1.1 is slow...

Posted:

**Thu Mar 11, 2010 10:43 am**
by **jpeinado**

cjest wrote:Hi,

is it feasible to put Gesv in a kernel? if Yes, would that be a good way to speedup "" operation for smaller matrices?

BR

Hi:

gesv (CULA) is a kernel. If you want to avoid to send the matrices from CPU to GPU in each iteration, it is possible to do it using CulaDeviceSgesv call.

I think it is not possible to speedup \ for smaller matrices doing what you want to do.

If you want to speedup your computing. For example you have a great loop with \ computing. You need that each \ be independent. Probably (if you are solving for example an iterative method) this is not possible.

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Mon Mar 22, 2010 3:30 pm**
by **jpeinado**

john wrote:Just wanted to update here before the weekend with some good news - we found the problem that was causing the slowdown for the large NRHS. Currently I am solving the 2048 problem (with B sized at 2048x2048) in 0.19 seconds, and I think I still have room to make it go a bit quicker.

Also good news for everyone: gesv is seeing significantly increased speeds across the board (all sizes, all precisions). And more good news is that we should see some related improvements in a few other routines like gels and posv.

You can expect a service release on this one as early as next week. Thank you all for the very detailed feedback - it was very helpful in finding this.

John

Is there any new about this?

Thanks

jpeinado

### Re:sgesv in 1.1 is slow...

Posted:

**Tue Mar 30, 2010 7:07 am**
by **john**

Should be very soon. We have expanded the scope of work to have some far-reaching speedups across many of the CULA routines including gels, getrf, posv, and more.

### Re:sgesv in 1.1 is slow...

Posted:

**Sun Apr 04, 2010 5:03 am**
by **jpeinado**

Thank you very much

Jesus

### Re:sgesv in 1.1 is slow...

Posted:

**Thu Apr 08, 2010 2:06 pm**
by **john**

1.3 is released. I'm anxious to see the new results from this thread!