sgesv in 1.1 is slow...

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Re:sgesv in 1.1 is slow...

Postby jpeinado » Thu Jan 07, 2010 8:42 pm

john wrote:
I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.



Hi again John. I am trying to contact with Boxed Cyclon (At his time I cant do it yet). Anyway, Could you explain me how to choose or how you choose leading dimension (Lc) to get the best performance ?

Thank you very much

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby jpmig313 » Fri Jan 08, 2010 4:02 am

I too am facing the same problem....

Actually execution of the culaSgesv(...) function is verryy fast when compared to normal matlab X=A\B;

I think this is bcos, the variables (A,B) are by default MATLAB variables.. and when u try to execute the [X]=culasv(single(A)); in MATLAB... it calls CULA gesv function which wants the variables to be in GPU memory.... i think this is the problem....


If we just find a way to somehow make the variables A, B in GPU's memory, then execution of culasv(A) from MATLAB will be very faster......

I'm trying this by combining the use of GPUmat and CULA(in MATLAB)... hope it works.


IF u find some other way to do it then please do post the correction here....

Any suggestion is appreciated...
jpmig313
 
Posts: 7
Joined: Sat Dec 26, 2009 6:04 am

Re:sgesv in 1.1 is slow...

Postby jpeinado » Fri Jan 08, 2010 8:31 am

jpmig313 wrote:I too am facing the same problem....


Hi:

Not sure, but I think is not the same problem. Well, if you want, you can to take A and B from CPU to GPU, because there are several CUDA/CUBLAS routines to do this. Then, you can call "CudaDeviceSgesv()", that assumes that you have the data on GPU.

I have an iterative algorithm and I only take A and B, from CPU to GPU in first iteration. Then is no necessary for me to do any communication (this is the best scenario)later, in each iteration. My iterative algorithm calls (and do other things) much times to CudaDeviceSgesv() routine.

My problem is that even not doing any comm, the CudaDeviceSgesv() is slow. I dont know if this routine transfer and computes some data in CPU (hybrid routines do this) because I have no the source code.

Anyway, if I understood to you correctly, I think that is a very different problem.

if culaSgesv() is fast for you CulaDeviceSgesv() must be faster

Greetings

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby jpeinado » Tue Jan 12, 2010 4:49 am

john wrote:
I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.



Hi again John. I am trying to contact with Boxed Cyclon, but I cant do it. Could you please explain me how to choose or how you choose leading dimension (Lc) to get the best performance ?

Thank you very much

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby jcurrie » Wed Jan 20, 2010 1:38 am

Hi sorry to jump on your thread, but I am very interested in the Gigaflops Unit you are using, could you provide a link please? Cheers.
jcurrie
 
Posts: 1
Joined: Fri Dec 25, 2009 8:51 am

Re:sgesv in 1.1 is slow...

Postby jpeinado » Wed Jan 20, 2010 9:28 am

jcurrie wrote:Hi sorry to jump on your thread, but I am very interested in the Gigaflops Unit you are using, could you provide a link please? Cheers.



Hi:

I dont understand your question well:


You want to know about


- What is a flop?

How to measure an algorithm in flops.


- How to measure the linear system algorithm in flops


Anyway, you can take a look to this book


"Matrix Computations" Golub, Van Loan...


Greetings

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby cjest » Wed Feb 10, 2010 9:10 am

Hi,
No post in almost 3 weeks?? and sgesv is still SLOW!!.
I made a new thread for this question but let me copy-paste my question here with some modifications after checking yours.

I was facing almost same problem with Sgesv, even worse because my data (comming from Matlab) is complex precision. I am using culaCgesv to solve a linear system equation with a complex coefficient matrix. Since \ (Backslash) operation for complex precision is not supported in "Jacket" (Accelereyes), i was trying to use CULA to speed up the execution time. (Got a good picture of it after running benchmark example- Sgesv 3x speedup)
I run CULA premium which covers all precisions I need. At the first step, a mex file is written for single/complex precision, using culaCgesv but the execution time is even greater than Matlab? in the best senario the speedup ratio is 0.3 campared to Matlab!!

Computer Specification:
CPU: intel Xeon X5450 3.00Ghz
GPU: GeForce GTX 285
Soft:
WinXp 32
Cula premium
Cuda 2.3
Matlab R2009b

the code is almost similar previous posts, one difference is that i had to separate real/imaginary parts of each matrix.

Should i provide further information? I'm just wondering if it's time for me to move on another toolbox or it is me who has having this problem??

cheers
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby cjest » Wed Feb 10, 2010 9:22 am

A rapid reply from Kyle from another thread. let me carry on my question here.

Hi cjest,

Have you tried running the included benchmark program to estimate the performance of your machine? Certain versions of MATLAB are very sensitive when it comes to external memory management in MEX files.

Also, Jacket (which uses CUDA internally) should support 'mldivide' with CULA's CGESV command. Perhaps it's worth asking the Jacket team about that one as well. This function might be slated for their next release.

-Kyle

Hi Kyle, thank you for your rapid answer.
1- yes,
-SGESV benchmark
best case: 3.12 X speedup. that's why i use the Premium version. My matrix is not a huge one, max(size) is 100x100.

2- Jacket doesn't support double/complex precisions. so 'mldivide' and it's family work only for real and single precisions which is not so useful in scientific computing applications. Jacket has poor supported functions for linear algebra/equations.

//CJ
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby kyle » Wed Feb 10, 2010 9:51 am

If your matrix size is only 100x100 you are sadly going to see no speeds-ups using CULA for GESV. That matrix size is well under what CULA is optimized for.

In complex arithmetic, a 100x100 matrix is only about 150 KB and takes milliseconds to solve on the CPU. CULA is designed for matrices that are megabytes (or gigabytes) in size and take many seconds (or minutes) to solve on a CPU.

To put it in perspective, a 150 KB matrix can easily be stored in your CPU's super fast cache memory. On the GPU, this matrix will have to be transfered to the GPU's memory and accessed through a number of kernel launches. The overhead of these operations, while very small, does become a bottleneck when the operation only takes a few milliseconds to run!

I hope this answers your question!

-Kyle
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re:sgesv in 1.1 is slow...

Postby john » Wed Feb 10, 2010 10:07 am

Hello,
I'd like to add a couple of random notes to this thread as well.

We have been looking into this problem (slowness in Matlab) with varying results. Some of our systems experience the Matlab slowdown while others do not. Since Matlab is not one of our officially supported integrations (we officially support C, C++, and Fortran) at the time, then to some extent we are relying on the community to help get the data out there. I would reiterate that we are exploring Matlab integration issues, but that we have not yet officially claimed support.

As an aside, if you are a full Premium user (ie non academic) then you will have access to our ticket-based support system at http://www.culatools.com/support. I can't see at the moment whether your account is full premium or whether it is academic premium. We ensure very quick responses there, so if you need an expedited response to an issue such as this then I would recommend using that channel if you have access to it.

Lastly, I will email my contact at Accelereyes to see if their upcoming version will have full support for the mldivide operator.

John
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby blucey » Wed Feb 10, 2010 1:40 pm

cjest wrote:Also, Jacket (which uses CUDA internally) should support 'mldivide' with CULA's CGESV command. Perhaps it's worth asking the Jacket team about that one as well. This function might be slated for their next release.


Kyle and CJ,

I just wanted to let you know that we've been working on mldivide and we expect it to be part of the Jacket 1.3 release. We are aiming for this release around the end of this month or beginning of next month. I hope this helps.

If you have any specific use cases, you are welcome to send us a bit of your MATLAB code demonstrating the use case and we'd be glad to include it in our QA tests.

Brett Lucey
Chief Software Engineer
AccelerEyes, LLC
Brett Lucey
Chief Software Engineer
AccelerEyes, LLC
blucey
 
Posts: 1
Joined: Wed Nov 11, 2009 3:33 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Thu Feb 11, 2010 10:33 pm

Just to add to this dialog, I've just run the CUDA profiler on my application - the image showing the timing data is hopefully attached. Its been a while since I ran the profiler.

As you can see sgesv appears as the calls to KernelCuLargeTrsmLUNN_A1 and KernelCuLargeTrsmLLNU_A, (and perhaps KernelLASWP and KernelScaleTrsm_LUNN). These are called many, many times and these routines take up a little over 20% of the compute time for this application - this is a loop with one call to sgesv and several calls to sgemm. Presumably, the two big time sinks KernelCuLargeTrsmLUNN, KernelCuLargeTrsmLLNU_A are associated with the LU decomposition. The largest matrices in my application are around 10,000X10,000; the first two dimensions of the call to culaDeviceSgesv would be around 100-400 (varies) and 10,000.

I still have not learned how it came to pass that when I first started using this routine (culaDeviceSgesv) it was 3X faster than the cpu/matlab, but after various changes (platform, matlab version, CUDA version, etc.) it is now 3X slower than the cpu/matlab...its a mystery, that will be left unsolved. I gather that the slowness of the routine is commonly experienced by others , in any case.

Image
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby john » Fri Feb 12, 2010 1:08 pm

This reply I think will only apply to Boxed Cylon.

Those two kernels you cited are used solely for the backsolve portion (ie after LU is complete) and each should be called a number of times roughly equal to the N parameter. Some of those other kernels on your list are ours as well (some gemms, some of the other trsm.) Since your N is small (100-400) and your number of TRSM calls is large, is this being run in a loop of some kind? Can you list for me the exact call parameters shown in that screenshot?

If you are indeed calling in a loop, I just want to note that sgesv factors the matrix each time it is called. If the matrix A doesn't change every time, you can always prefactor via getrf before you enter the loop and then use the factored matrix to solve your system via getrs in the loop.

I will admit that we have not spent much time considering the case where NRHS >> N. I will look into that.

All that said, I don't find your numbers troubling from the profiler results. Those two TRSM kernels are only adding up to 20% of your runtime and they represent the entire backsolve process after the LU factorize.

Here are my benchmarks for sgesv, from our benchmark precompiled example:

Code: Select all
C:\Program Files\CULA\examples\benchmark>benchmark_.exe sgesv
Initializing CULA...
Initializing MKL...

Benchmarking the following functions:
-------------------------------------
             SGESV
-------------------------------------


     -- SGESV Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.59       1.08    1.8226
5120       0.81       1.98    2.4532
6144       1.20       3.42    2.8499
7168       1.70       5.14    3.0336
8192       2.33       7.88    3.3864

C:\Program Files\CULA\examples\benchmark>


(Do note that this is using CULA 1.2, which has a faster factorize and will be available shortly.)

Let's start there. Does this mesh with the numbers you receive from the benchmark example on your machine? If it does, then we know the answer is in the parameters and/or the matlab integrations. If it does not, then the answer could very well be in your hardware or software setup.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Sat Feb 13, 2010 3:43 am

I get similar numbers using the benchmark routine:
Code: Select all
./benchmark sgesv
Initializing CULA...
Initializing MKL...

Benchmarking the following functions:
-------------------------------------
             SGESV
-------------------------------------


     -- SGESV Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.48       1.10    2.2788
5120       0.80       1.38    1.7236
6144       1.17       2.45    2.0849
7168       1.70       3.49    2.0517
8192       2.38       7.39    3.1072

(AMD 965 on AM3 motherboard and fast RAM vs a GTX260 GPU on Suse linux 11.1)

That said, I am fairly (but not entirely...) sure that sgesv runs fairly slowly compared to the CPU in matlab. It might be that matlab uses fairly efficient BLAS/LAPACK routines; we should test the cpu benchmark sgesv against
matlab's back division "\", perhaps. I posted comparisons before of matlab and sgesv calculations, which find that matlab on the cpu is faster by 2-3X for matrices N=1000-5000 in size; I've just verified this. I'm fairly sure (but not entirely...) that at one point it was the other way around; but perhaps I am merely losing my mind... :) Matlab version R2009b.

The routine is indeed a loop - actually a loop within a loop - but alas the matrix A is different each time. sgesv is actually called just once in each loop. It is a small part of the calculations of each loop - that it takes up 20% is actually fairly large, seems to me. But the whole thing runs o.k., so I'm too lazy to try to recode it as I had it before (which involved employing matlab's back division "\" together with copying the required data off and on the GPU. Perhaps with the new version 1.2 the problem (if there even is a problem) will resolve itself.
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby kyle » Sat Feb 13, 2010 9:23 am

MATLAB's mldivide, aka \, calls LAPACK's GESV for dense, square matrices using Intel's MKL LAPACK implementation. If you were to benchmark MATLAB's mldivide, it should be very close to what's reported as MKL's speed in the CULA benchmark for GESV. Can you try benchmarking mldivide in MATLAB, using tic/toc, to make sure we are comparing apples to apples?

One important thing to consider; MATLAB has smart algorithm selection based on matrix properties for mldivide. If your matrix is sparse, banded, symmetric, or positive definite, a different (and faster) algorithm will be used!

Many more details about MATLAB's mldivide can be found here:

http://www.mathworks.com/access/helpdes ... ivide.html
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

PreviousNext

Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron