## sgesv in 1.1 is slow...

### Re:sgesv in 1.1 is slow...

john wrote:

I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.

Hi again John. I am trying to contact with Boxed Cyclon (At his time I cant do it yet). Anyway, Could you explain me how to choose or how you choose leading dimension (Lc) to get the best performance ?

Thank you very much

jpeinado

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.1 is slow...

I too am facing the same problem....

Actually execution of the culaSgesv(...) function is verryy fast when compared to normal matlab X=A\B;

I think this is bcos, the variables (A,B) are by default MATLAB variables.. and when u try to execute the [X]=culasv(single(A)); in MATLAB... it calls CULA gesv function which wants the variables to be in GPU memory.... i think this is the problem....

If we just find a way to somehow make the variables A, B in GPU's memory, then execution of culasv(A) from MATLAB will be very faster......

I'm trying this by combining the use of GPUmat and CULA(in MATLAB)... hope it works.

IF u find some other way to do it then please do post the correction here....

Any suggestion is appreciated...

Actually execution of the culaSgesv(...) function is verryy fast when compared to normal matlab X=A\B;

I think this is bcos, the variables (A,B) are by default MATLAB variables.. and when u try to execute the [X]=culasv(single(A)); in MATLAB... it calls CULA gesv function which wants the variables to be in GPU memory.... i think this is the problem....

If we just find a way to somehow make the variables A, B in GPU's memory, then execution of culasv(A) from MATLAB will be very faster......

I'm trying this by combining the use of GPUmat and CULA(in MATLAB)... hope it works.

IF u find some other way to do it then please do post the correction here....

Any suggestion is appreciated...

- jpmig313
**Posts:**7**Joined:**Sat Dec 26, 2009 6:04 am

### Re:sgesv in 1.1 is slow...

jpmig313 wrote:I too am facing the same problem....

Hi:

Not sure, but I think is not the same problem. Well, if you want, you can to take A and B from CPU to GPU, because there are several CUDA/CUBLAS routines to do this. Then, you can call "CudaDeviceSgesv()", that assumes that you have the data on GPU.

I have an iterative algorithm and I only take A and B, from CPU to GPU in first iteration. Then is no necessary for me to do any communication (this is the best scenario)later, in each iteration. My iterative algorithm calls (and do other things) much times to CudaDeviceSgesv() routine.

My problem is that even not doing any comm, the CudaDeviceSgesv() is slow. I dont know if this routine transfer and computes some data in CPU (hybrid routines do this) because I have no the source code.

Anyway, if I understood to you correctly, I think that is a very different problem.

if culaSgesv() is fast for you CulaDeviceSgesv() must be faster

Greetings

jpeinado

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.1 is slow...

john wrote:

I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.

Hi again John. I am trying to contact with Boxed Cyclon, but I cant do it. Could you please explain me how to choose or how you choose leading dimension (Lc) to get the best performance ?

Thank you very much

jpeinado

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.1 is slow...

Hi sorry to jump on your thread, but I am very interested in the Gigaflops Unit you are using, could you provide a link please? Cheers.

- jcurrie
**Posts:**1**Joined:**Fri Dec 25, 2009 8:51 am

### Re:sgesv in 1.1 is slow...

jcurrie wrote:Hi sorry to jump on your thread, but I am very interested in the Gigaflops Unit you are using, could you provide a link please? Cheers.

Hi:

I dont understand your question well:

You want to know about

- What is a flop?

How to measure an algorithm in flops.

- How to measure the linear system algorithm in flops

Anyway, you can take a look to this book

"Matrix Computations" Golub, Van Loan...

Greetings

jpeinado

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.1 is slow...

Hi,

No post in almost 3 weeks?? and sgesv is still SLOW!!.

I made a new thread for this question but let me copy-paste my question here with some modifications after checking yours.

I was facing almost same problem with Sgesv, even worse because my data (comming from Matlab) is complex precision. I am using culaCgesv to solve a linear system equation with a complex coefficient matrix. Since \ (Backslash) operation for complex precision is not supported in "Jacket" (Accelereyes), i was trying to use CULA to speed up the execution time. (Got a good picture of it after running benchmark example- Sgesv 3x speedup)

I run CULA premium which covers all precisions I need. At the first step, a mex file is written for single/complex precision, using culaCgesv but the execution time is even greater than Matlab? in the best senario the speedup ratio is 0.3 campared to Matlab!!

Computer Specification:

CPU: intel Xeon X5450 3.00Ghz

GPU: GeForce GTX 285

Soft:

WinXp 32

Cula premium

Cuda 2.3

Matlab R2009b

the code is almost similar previous posts, one difference is that i had to separate real/imaginary parts of each matrix.

Should i provide further information? I'm just wondering if it's time for me to move on another toolbox or it is me who has having this problem??

cheers

No post in almost 3 weeks?? and sgesv is still SLOW!!.

I made a new thread for this question but let me copy-paste my question here with some modifications after checking yours.

I was facing almost same problem with Sgesv, even worse because my data (comming from Matlab) is complex precision. I am using culaCgesv to solve a linear system equation with a complex coefficient matrix. Since \ (Backslash) operation for complex precision is not supported in "Jacket" (Accelereyes), i was trying to use CULA to speed up the execution time. (Got a good picture of it after running benchmark example- Sgesv 3x speedup)

I run CULA premium which covers all precisions I need. At the first step, a mex file is written for single/complex precision, using culaCgesv but the execution time is even greater than Matlab? in the best senario the speedup ratio is 0.3 campared to Matlab!!

Computer Specification:

CPU: intel Xeon X5450 3.00Ghz

GPU: GeForce GTX 285

Soft:

WinXp 32

Cula premium

Cuda 2.3

Matlab R2009b

the code is almost similar previous posts, one difference is that i had to separate real/imaginary parts of each matrix.

Should i provide further information? I'm just wondering if it's time for me to move on another toolbox or it is me who has having this problem??

cheers

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.1 is slow...

A rapid reply from Kyle from another thread. let me carry on my question here.

Hi cjest,

Have you tried running the included benchmark program to estimate the performance of your machine? Certain versions of MATLAB are very sensitive when it comes to external memory management in MEX files.

Also, Jacket (which uses CUDA internally) should support 'mldivide' with CULA's CGESV command. Perhaps it's worth asking the Jacket team about that one as well. This function might be slated for their next release.

-Kyle

Hi Kyle, thank you for your rapid answer.

1- yes,

-SGESV benchmark

best case: 3.12 X speedup. that's why i use the Premium version. My matrix is not a huge one, max(size) is 100x100.

2- Jacket doesn't support double/complex precisions. so 'mldivide' and it's family work only for real and single precisions which is not so useful in scientific computing applications. Jacket has poor supported functions for linear algebra/equations.

//CJ

Hi cjest,

Have you tried running the included benchmark program to estimate the performance of your machine? Certain versions of MATLAB are very sensitive when it comes to external memory management in MEX files.

Also, Jacket (which uses CUDA internally) should support 'mldivide' with CULA's CGESV command. Perhaps it's worth asking the Jacket team about that one as well. This function might be slated for their next release.

-Kyle

Hi Kyle, thank you for your rapid answer.

1- yes,

-SGESV benchmark

best case: 3.12 X speedup. that's why i use the Premium version. My matrix is not a huge one, max(size) is 100x100.

2- Jacket doesn't support double/complex precisions. so 'mldivide' and it's family work only for real and single precisions which is not so useful in scientific computing applications. Jacket has poor supported functions for linear algebra/equations.

//CJ

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.1 is slow...

If your matrix size is only 100x100 you are sadly going to see no speeds-ups using CULA for GESV. That matrix size is well under what CULA is optimized for.

In complex arithmetic, a 100x100 matrix is only about 150 KB and takes milliseconds to solve on the CPU. CULA is designed for matrices that are megabytes (or gigabytes) in size and take many seconds (or minutes) to solve on a CPU.

To put it in perspective, a 150 KB matrix can easily be stored in your CPU's super fast cache memory. On the GPU, this matrix will have to be transfered to the GPU's memory and accessed through a number of kernel launches. The overhead of these operations, while very small, does become a bottleneck when the operation only takes a few milliseconds to run!

I hope this answers your question!

-Kyle

In complex arithmetic, a 100x100 matrix is only about 150 KB and takes milliseconds to solve on the CPU. CULA is designed for matrices that are megabytes (or gigabytes) in size and take many seconds (or minutes) to solve on a CPU.

To put it in perspective, a 150 KB matrix can easily be stored in your CPU's super fast cache memory. On the GPU, this matrix will have to be transfered to the GPU's memory and accessed through a number of kernel launches. The overhead of these operations, while very small, does become a bottleneck when the operation only takes a few milliseconds to run!

I hope this answers your question!

-Kyle

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re:sgesv in 1.1 is slow...

Hello,

I'd like to add a couple of random notes to this thread as well.

We have been looking into this problem (slowness in Matlab) with varying results. Some of our systems experience the Matlab slowdown while others do not. Since Matlab is not one of our officially supported integrations (we officially support C, C++, and Fortran) at the time, then to some extent we are relying on the community to help get the data out there. I would reiterate that we are exploring Matlab integration issues, but that we have not yet officially claimed support.

As an aside, if you are a full Premium user (ie non academic) then you will have access to our ticket-based support system at http://www.culatools.com/support. I can't see at the moment whether your account is full premium or whether it is academic premium. We ensure very quick responses there, so if you need an expedited response to an issue such as this then I would recommend using that channel if you have access to it.

Lastly, I will email my contact at Accelereyes to see if their upcoming version will have full support for the mldivide operator.

John

I'd like to add a couple of random notes to this thread as well.

We have been looking into this problem (slowness in Matlab) with varying results. Some of our systems experience the Matlab slowdown while others do not. Since Matlab is not one of our officially supported integrations (we officially support C, C++, and Fortran) at the time, then to some extent we are relying on the community to help get the data out there. I would reiterate that we are exploring Matlab integration issues, but that we have not yet officially claimed support.

As an aside, if you are a full Premium user (ie non academic) then you will have access to our ticket-based support system at http://www.culatools.com/support. I can't see at the moment whether your account is full premium or whether it is academic premium. We ensure very quick responses there, so if you need an expedited response to an issue such as this then I would recommend using that channel if you have access to it.

Lastly, I will email my contact at Accelereyes to see if their upcoming version will have full support for the mldivide operator.

John

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re:sgesv in 1.1 is slow...

cjest wrote:Also, Jacket (which uses CUDA internally) should support 'mldivide' with CULA's CGESV command. Perhaps it's worth asking the Jacket team about that one as well. This function might be slated for their next release.

Kyle and CJ,

I just wanted to let you know that we've been working on mldivide and we expect it to be part of the Jacket 1.3 release. We are aiming for this release around the end of this month or beginning of next month. I hope this helps.

If you have any specific use cases, you are welcome to send us a bit of your MATLAB code demonstrating the use case and we'd be glad to include it in our QA tests.

Brett Lucey

Chief Software Engineer

AccelerEyes, LLC

Brett Lucey

Chief Software Engineer

AccelerEyes, LLC

Chief Software Engineer

AccelerEyes, LLC

- blucey
**Posts:**1**Joined:**Wed Nov 11, 2009 3:33 pm

### Re:sgesv in 1.1 is slow...

Just to add to this dialog, I've just run the CUDA profiler on my application - the image showing the timing data is hopefully attached. Its been a while since I ran the profiler.

As you can see sgesv appears as the calls to KernelCuLargeTrsmLUNN_A1 and KernelCuLargeTrsmLLNU_A, (and perhaps KernelLASWP and KernelScaleTrsm_LUNN). These are called many, many times and these routines take up a little over 20% of the compute time for this application - this is a loop with one call to sgesv and several calls to sgemm. Presumably, the two big time sinks KernelCuLargeTrsmLUNN, KernelCuLargeTrsmLLNU_A are associated with the LU decomposition. The largest matrices in my application are around 10,000X10,000; the first two dimensions of the call to culaDeviceSgesv would be around 100-400 (varies) and 10,000.

I still have not learned how it came to pass that when I first started using this routine (culaDeviceSgesv) it was 3X faster than the cpu/matlab, but after various changes (platform, matlab version, CUDA version, etc.) it is now 3X slower than the cpu/matlab...its a mystery, that will be left unsolved. I gather that the slowness of the routine is commonly experienced by others , in any case.

As you can see sgesv appears as the calls to KernelCuLargeTrsmLUNN_A1 and KernelCuLargeTrsmLLNU_A, (and perhaps KernelLASWP and KernelScaleTrsm_LUNN). These are called many, many times and these routines take up a little over 20% of the compute time for this application - this is a loop with one call to sgesv and several calls to sgemm. Presumably, the two big time sinks KernelCuLargeTrsmLUNN, KernelCuLargeTrsmLLNU_A are associated with the LU decomposition. The largest matrices in my application are around 10,000X10,000; the first two dimensions of the call to culaDeviceSgesv would be around 100-400 (varies) and 10,000.

I still have not learned how it came to pass that when I first started using this routine (culaDeviceSgesv) it was 3X faster than the cpu/matlab, but after various changes (platform, matlab version, CUDA version, etc.) it is now 3X slower than the cpu/matlab...its a mystery, that will be left unsolved. I gather that the slowness of the routine is commonly experienced by others , in any case.

- Boxed Cylon
**Posts:**48**Joined:**Fri Oct 16, 2009 8:57 pm

### Re:sgesv in 1.1 is slow...

This reply I think will only apply to Boxed Cylon.

Those two kernels you cited are used solely for the backsolve portion (ie after LU is complete) and each should be called a number of times roughly equal to the N parameter. Some of those other kernels on your list are ours as well (some gemms, some of the other trsm.) Since your N is small (100-400) and your number of TRSM calls is large, is this being run in a loop of some kind? Can you list for me the exact call parameters shown in that screenshot?

If you are indeed calling in a loop, I just want to note that sgesv factors the matrix each time it is called. If the matrix A doesn't change every time, you can always prefactor via getrf before you enter the loop and then use the factored matrix to solve your system via getrs in the loop.

I will admit that we have not spent much time considering the case where NRHS >> N. I will look into that.

All that said, I don't find your numbers troubling from the profiler results. Those two TRSM kernels are only adding up to 20% of your runtime and they represent the entire backsolve process after the LU factorize.

Here are my benchmarks for sgesv, from our benchmark precompiled example:

(Do note that this is using CULA 1.2, which has a faster factorize and will be available shortly.)

Let's start there. Does this mesh with the numbers you receive from the benchmark example on your machine? If it does, then we know the answer is in the parameters and/or the matlab integrations. If it does not, then the answer could very well be in your hardware or software setup.

Those two kernels you cited are used solely for the backsolve portion (ie after LU is complete) and each should be called a number of times roughly equal to the N parameter. Some of those other kernels on your list are ours as well (some gemms, some of the other trsm.) Since your N is small (100-400) and your number of TRSM calls is large, is this being run in a loop of some kind? Can you list for me the exact call parameters shown in that screenshot?

If you are indeed calling in a loop, I just want to note that sgesv factors the matrix each time it is called. If the matrix A doesn't change every time, you can always prefactor via getrf before you enter the loop and then use the factored matrix to solve your system via getrs in the loop.

I will admit that we have not spent much time considering the case where NRHS >> N. I will look into that.

All that said, I don't find your numbers troubling from the profiler results. Those two TRSM kernels are only adding up to 20% of your runtime and they represent the entire backsolve process after the LU factorize.

Here are my benchmarks for sgesv, from our benchmark precompiled example:

- Code: Select all
`C:\Program Files\CULA\examples\benchmark>benchmark_.exe sgesv`

Initializing CULA...

Initializing MKL...

Benchmarking the following functions:

-------------------------------------

SGESV

-------------------------------------

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.59 1.08 1.8226

5120 0.81 1.98 2.4532

6144 1.20 3.42 2.8499

7168 1.70 5.14 3.0336

8192 2.33 7.88 3.3864

C:\Program Files\CULA\examples\benchmark>

(Do note that this is using CULA 1.2, which has a faster factorize and will be available shortly.)

Let's start there. Does this mesh with the numbers you receive from the benchmark example on your machine? If it does, then we know the answer is in the parameters and/or the matlab integrations. If it does not, then the answer could very well be in your hardware or software setup.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re:sgesv in 1.1 is slow...

I get similar numbers using the benchmark routine:

(AMD 965 on AM3 motherboard and fast RAM vs a GTX260 GPU on Suse linux 11.1)

That said, I am fairly (but not entirely...) sure that sgesv runs fairly slowly compared to the CPU in matlab. It might be that matlab uses fairly efficient BLAS/LAPACK routines; we should test the cpu benchmark sgesv against

matlab's back division "\", perhaps. I posted comparisons before of matlab and sgesv calculations, which find that matlab on the cpu is faster by 2-3X for matrices N=1000-5000 in size; I've just verified this. I'm fairly sure (but not entirely...) that at one point it was the other way around; but perhaps I am merely losing my mind... :) Matlab version R2009b.

The routine is indeed a loop - actually a loop within a loop - but alas the matrix A is different each time. sgesv is actually called just once in each loop. It is a small part of the calculations of each loop - that it takes up 20% is actually fairly large, seems to me. But the whole thing runs o.k., so I'm too lazy to try to recode it as I had it before (which involved employing matlab's back division "\" together with copying the required data off and on the GPU. Perhaps with the new version 1.2 the problem (if there even is a problem) will resolve itself.

- Code: Select all
`./benchmark sgesv`

Initializing CULA...

Initializing MKL...

Benchmarking the following functions:

-------------------------------------

SGESV

-------------------------------------

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.48 1.10 2.2788

5120 0.80 1.38 1.7236

6144 1.17 2.45 2.0849

7168 1.70 3.49 2.0517

8192 2.38 7.39 3.1072

(AMD 965 on AM3 motherboard and fast RAM vs a GTX260 GPU on Suse linux 11.1)

That said, I am fairly (but not entirely...) sure that sgesv runs fairly slowly compared to the CPU in matlab. It might be that matlab uses fairly efficient BLAS/LAPACK routines; we should test the cpu benchmark sgesv against

matlab's back division "\", perhaps. I posted comparisons before of matlab and sgesv calculations, which find that matlab on the cpu is faster by 2-3X for matrices N=1000-5000 in size; I've just verified this. I'm fairly sure (but not entirely...) that at one point it was the other way around; but perhaps I am merely losing my mind... :) Matlab version R2009b.

The routine is indeed a loop - actually a loop within a loop - but alas the matrix A is different each time. sgesv is actually called just once in each loop. It is a small part of the calculations of each loop - that it takes up 20% is actually fairly large, seems to me. But the whole thing runs o.k., so I'm too lazy to try to recode it as I had it before (which involved employing matlab's back division "\" together with copying the required data off and on the GPU. Perhaps with the new version 1.2 the problem (if there even is a problem) will resolve itself.

- Boxed Cylon
**Posts:**48**Joined:**Fri Oct 16, 2009 8:57 pm

### Re:sgesv in 1.1 is slow...

MATLAB's mldivide, aka \, calls LAPACK's GESV for dense, square matrices using Intel's MKL LAPACK implementation. If you were to benchmark MATLAB's mldivide, it should be very close to what's reported as MKL's speed in the CULA benchmark for GESV. Can you try benchmarking mldivide in MATLAB, using tic/toc, to make sure we are comparing apples to apples?

One important thing to consider; MATLAB has smart algorithm selection based on matrix properties for mldivide. If your matrix is sparse, banded, symmetric, or positive definite, a different (and faster) algorithm will be used!

Many more details about MATLAB's mldivide can be found here:

http://www.mathworks.com/access/helpdes ... ivide.html

One important thing to consider; MATLAB has smart algorithm selection based on matrix properties for mldivide. If your matrix is sparse, banded, symmetric, or positive definite, a different (and faster) algorithm will be used!

Many more details about MATLAB's mldivide can be found here:

http://www.mathworks.com/access/helpdes ... ivide.html

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Who is online

Users browsing this forum: No registered users and 1 guest