sgesv in 1.1 is slow...

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Re:sgesv in 1.1 is slow...

Postby john » Tue Mar 02, 2010 1:14 pm

Just wanted to add that the preferred method of reporting your Matlab version is to copy-paste the output of the Matlab "version" command.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby cjest » Wed Mar 03, 2010 2:18 am

Reporting result based on:
Software:
Matlab: 2009b
Windows XP32.
CULA 1.2 Premium.

Hardware:
GPU: GeForce GTX 285
CPU: intel Xeon X5450 3.00Ghz

Problem: Find x in Ax=b, if you find it, make it faster...

Observation: Speedup is obtained for systems larger than 1500 usnig CULA 1.2 (not CUDA at all). Otherwise Matlab is faster, for small systems (e.g. size(A)<200) Matlab is extremely faster than CULA.

Q: Is it Matlab (R2009b) which slows down the process?
A: I don't think so. I've been trying some other mex functions (other methods running on CPU) for the same problem, I've not seen that overhead cost.


Source code:
Note: Just single/COMPLEX precisions.


// CULASV computes the solution to a system of linear equation. Ax=B on CULA
//
// Input:
// A: Coefficient Matrix: single/complex precision LxL
// B: single/complex precision LxI
// Output:
// X: single/complex precision LxI
//
// Calling from Matlab
// X = culasv(single(A),single(B))

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include "mex.h"
#include "culapack.h"


void checkStatus(culaStatus status) {
if(!status)
return;

if(status == culaArgumentError)
mexPrintf("Invalid value for parameter %d\n", culaGetErrorInfo());
else if(status == culaRuntimeError)
mexPrintf("Runtime error (%d)\n", culaGetErrorInfo());
else
mexPrintf("%s\n", culaGetStatusString(status));

culaShutdown();
mexErrMsgTxt("CULA error!");
}


void mexFunction(int nlhs, mxArray *plhs[],int nrhs, const mxArray *prhs[])
{
// Input parameters: size(A) = LxL; size(B) = LxI
int ii, jj;
int L, I;
const mwSize *dims;
culaFloatComplex* A;
culaFloatComplex* B;
float *Ar, *Ai;
float *Br, *Bi;

// output X = Xr + Xi*i
float *Xr, *Xi;

// CUDA variables
culaInt* ipiv = 0;

culaStatus status;

// check the number of passing data
if (nrhs != 2) {
mexErrMsgTxt("Need two input arguments.");
}
if (nlhs != 1) {
mexErrMsgTxt("Only one output argument allowed.");
}
if (mxGetNumberOfDimensions(prhs[0]) != 2) {
mexErrMsgTxt("2D metrix requaired.");
}

// Get the dimensions
dims = mxGetDimensions(prhs[1]);
L = dims[0];
I = dims[1];

// Get pointers to the real and imaginary parts of the inputs
Ar = (float*)mxGetPr(prhs[0]);
Ai = (float*)mxGetPi(prhs[0]);
Br = (float*)mxGetPr(prhs[1]);
Bi = (float*)mxGetPi(prhs[1]);

A = (culaFloatComplex*)mxMalloc(L*L*sizeof(culaFloatComplex));
B = (culaFloatComplex*)mxMalloc(L*I*sizeof(culaFloatComplex));

// output dimension

// the solution
plhs[0] = mxCreateNumericArray(2, dims, mxSINGLE_CLASS, mxCOMPLEX);
Xr = (float*)mxGetPr(plhs[0]);
Xi = (float*)mxGetPi(plhs[0]);

// Allocate for ipiv - a working matrix used by sgesv
ipiv = (culaInt*)mxMalloc(L*sizeof(culaInt));

//------------------------
status = culaInitialize();
checkStatus(status);

// Set matrix A,B
for(ii = 0; ii < L*L; ii++) {
A[ii].x = Ar[ii];
A[ii].y = Ai[ii];
}
for(ii = 0; ii < L*I; ii++) {
B[ii].x = Br[ii];
B[ii].y = Bi[ii];
}

// Set ipiv to 0
memset(ipiv, 0, L*sizeof(culaInt));

// Calling culaCgesv
status = culaCgesv(L, I, A, L, ipiv, B, L);
checkStatus(status);

// set mex output
for(ii = 0; ii < I; ii++) {
for(jj = 0; jj < L ; jj++) {
*Xr++ = B[jj+L*ii].x;
*Xi++ = B[jj+L*ii].y;
}
}

mxFree(ipiv);
mxFree(A);
mxFree(B);
culaFreeBuffers();
//Shutdown CULA
culaShutdown();
}


Q: why am I doin' this?
A: I need to speedup Ax=b when A is not larger than 200.

Final Q: Any suggestion to solve the problem (size(A)~200) deploying GPU computing? jpeinado would you please give some digits/details what you mean by: CULAPACK (sgetrf) + CUBLAS (triangular systems) = OK

A: leave it for you...

BR //CJ
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Wed Mar 03, 2010 4:32 am

Just to report that Cula 1.2 Sgesv seems to be slow with:
Suse linux 11.1 64-bit
CUDA 3.0 (Yes, I know this is not yet supported...CUDA 2.3 gave a slow result also, I believe)
GTX 260
Matlab 7.9.0.529 (R2009b) 64-bit

Perhaps it is Matlab 32-bit vs. 64-bit that is the issue? This is a strange problem... (Perhaps there is a 32-bit flag that one could set when compiling the mex file?)
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby dan » Wed Mar 03, 2010 8:12 am

Q: why am I doin' this?
A: I need to speedup Ax=b when A is not larger than 200.

Final Q: Any suggestion to solve the problem (size(A)~200) deploying GPU computing?

@cjest,

Unfortunately, this is not something that in this form can be solved by GPU computing. There are just too many overheads involved in getting the data to the GPU and getting it back. For example, for a 256x256 matrix, it takes 4 times as long just to download/upload the data to/from the GPU than the CPU takes the complete the calculation. Even if the GPU were infinitely fast in its calculation, the overall computation would be 4 times longer for this problem size.

The only chance that GPU computing would have for problems of this size is if you needed to solve multiple of these small matrices in parallel. In this case you could share the overhead involved in downloading and uploading and better utilize the GPUs parallel resources to get an overall speedup.

Why is it that you need to speed up the solve of matrices that are so small?

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby dan » Wed Mar 03, 2010 8:16 am

Boxed Cylon wrote:Just to report that Cula 1.2 Sgesv seems to be slow with:
Suse linux 11.1 64-bit
CUDA 3.0 (Yes, I know this is not yet supported...CUDA 2.3 gave a slow result also, I believe)
GTX 260
Matlab 7.9.0.529 (R2009b) 64-bit

Perhaps it is Matlab 32-bit vs. 64-bit that is the issue? This is a strange problem... (Perhaps there is a 32-bit flag that one could set when compiling the mex file?)


@Boxed Cylon

My results show a good result on 64-bit Windows but it's possible that this is a 64-bit issue on Linux. We have a 64-bit Ubuntu machine that I can give this a try on. If we don't see it there it could be a Suse-specific issue.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Wed Mar 03, 2010 8:47 am

Hello:

Well, my results are done in a 64 bit machine using (I must ask this) a CentOS version....

About the MATLAB problem. Yes there is a problem with MATLAB...using the hybrid algorithms because as far as I know MATLAB uses a special 64 bit LAPACK version. And then, hybrid algorithms dont work in MATLAB.


Also, the problem is with CULA, because in results present before, when using CULA, the results are bad, But when not using CULA (the UJI libraries only use CUBLAS) the results are good....

Cjest, thank you very much for your code.. I will test as soon as possible


Dan, thank you very much for your work. About my results, they are done in:

Soft:

CentOS 64 bit
I am working with MATLAB R2009b.
CUDA 2.3

Hard:

Intel E5430 (2.6Ghz)
Quadro FX5800



jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby dan » Wed Mar 03, 2010 10:02 am

jpeinado wrote:About the MATLAB problem. Yes there is a problem with MATLAB...using the hybrid algorithms because as far as I know MATLAB uses a special 64 bit LAPACK version. And then, hybrid algorithms dont work in MATLAB.

@jpeinado

It is true that CULA's routines are hybrid algorithms, but as our results (and some other users') have shown, it is not the case that there is a slowdown in all versions of Matlab. Right now the common thread appears to be 64-bit Linux versions of Matlab. I'm planning on getting this tested ASAP so when I get some results I'll post them here.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Wed Mar 03, 2010 11:53 am

dan wrote:
jpeinado wrote:About the MATLAB problem. Yes there is a problem with MATLAB...using the hybrid algorithms because as far as I know MATLAB uses a special 64 bit LAPACK version. And then, hybrid algorithms dont work in MATLAB.

@jpeinado

It is true that CULA's routines are hybrid algorithms, but as our results (and some other users') have shown, it is not the case that there is a slowdown in all versions of Matlab. Right now the common thread appears to be 64-bit Linux versions of Matlab. I'm planning on getting this tested ASAP so when I get some results I'll post them here.

Dan




Thank you very much Dan

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby dan » Wed Mar 03, 2010 12:58 pm

So, my testing in Matlab 7.9 on Ubuntu 9.10 64-bit has shown no slowdown.

One of the things I've noticed is that the Matlab installer gave you the option of selecting an architecture. I chose x64 (the default) to match the platform; for those that are seeing problems, is it possible you selected something other than this?

Also, when you're doing your runtime link, are you making sure to link against the 64-bit versions of the CULA libs (/usr/local/cula/lib64), as opposed to the 32-bit versions (/usr/local/cula/lib) ?

@jpeinado

Can you find out what version of centos you are using?
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Wed Mar 03, 2010 2:36 pm

dan wrote:So, my testing in Matlab 7.9 on Ubuntu 9.10 64-bit has shown no slowdown.


Happy to hear this...


One of the things I've noticed is that the Matlab installer gave you the option of selecting an architecture. I chose x64 (the default) to match the platform; for those that are seeing problems, is it possible you selected something other than this?

No, in fact, the compiled MEX file changes the name if you uses the nvidia MATLAB plugin, with the corresponding Makefile. The extension is .mexa64


Also, when you're doing your runtime link, are you making sure to link against the 64-bit versions of the CULA libs (/usr/local/cula/lib64), as opposed to the 32-bit versions (/usr/local/cula/lib) ?

I have this pointing to /usr/local/cula/lib64

@jpeinado

Can you find out what version of centos you are using?


Yes, I will talk with the system administrator as soon as possible, and I will tell you this.

By the way, could your put your programs (makefiles, mex, etc) in a zip file to execute them in my system.

Than you very much


jpeinado

P.D. In next days I wil get a machine with a 64 bit Ubuntu version
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Wed Mar 03, 2010 10:21 pm

@Dan

One test we could run is to have you compile the attached code on your Ubuntu 64-bit installation, and have me run it on my machine (who knows?) At least it will determine whether the problem lies in the compiler environment or the runtime environment, maybe.

This code is my debugging/timing routine - it just calculates X=A\B on the GPU. Call it from matlab with [X]= gpu_sgesv(A,B ) ;

[file name=gpu_sgesv-20100303.txt size=3876]http://www.culatools.com/images/fbfiles/files/gpu_sgesv-20100303.txt[/file]

Rename the file to gpu_sgesv.cu ...
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Thu Mar 04, 2010 4:37 am

dan wrote:@jpeinado

Can you find out what version of centos you are using?


5.2

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby cjest » Thu Mar 04, 2010 8:21 am

dan wrote:


Why is it that you need to speed up the solve of matrices that are so small?



I do have a "for-loop" for the system solver with small matrices. Therefor I do need to solve multiple of small matrices in parallel. The exe. time over the entire loop needs to be increased.

Any hint?
Can i use Cgesv then?
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby dan » Thu Mar 04, 2010 11:51 am

Boxed Cylon wrote:One test we could run is to have you compile the attached code on your Ubuntu 64-bit installation, and have me run it on my machine (who knows?) At least it will determine whether the problem lies in the compiler environment or the runtime environment, maybe.

This code is my debugging/timing routine - it just calculates X=A\B on the GPU. Call it from matlab with [X]= gpu_sgesv(A,B ) ;


@Boxed Cylon

I ran your mex file. Here are the results of an example run:

Code: Select all
>> A = rand(2048,2048,'single');
>> B = rand(2048,64,'single');
>> tic; A\B; toc;
Elapsed time is 0.932283 seconds.
>> tic; [x] = gpu_sgesv(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.334 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 0.706791 seconds.
>> tic; [x] = gpu_sgesv(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.403 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 0.429575 seconds.


The first run took 0.71 seconds, while the second took 0.43. The difference between the two is the init time which I've measured at around 0.3-0.4 seconds which these results back up.

With these results, I think it's fair to say that the problem isn't in your mex file. I'm attaching your compiled mex to this message (note that I've called it culaGesv2 to compare it against our homegrown mex file). Let me know how you make out with this.

Dan [file name=culaGesv2.zip size=3722]http://www.culatools.com/images/fbfiles/files/culaGesv2.zip[/file]
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Thu Mar 04, 2010 12:53 pm

Humm...a bit of answer develops:

My original routine, running your set of matlab lines
Code: Select all
A = rand(2048,2048,'single');
B = rand(2048,64,'single');
tic; A\B; toc;
Elapsed time is 0.656199 seconds.
tic; [x] = gpu_sgesv(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.173 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 0.593104 seconds.
tic; [x] = gpu_sgesv(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.179 s
X-top = -9.435948e-02 3.882708e-01 -6.582417e-02
X-bottom = 4.102392e-01 4.056736e-01 -2.996187e-01
Elapsed time is 0.196459 seconds.


Running the same, using your compiled mex:
Code: Select all
tic; A\B; toc;
Elapsed time is 0.664783 seconds.
>> tic; [x] = culaGesv2(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.196 s
X-top = 9.946308e-01 2.093953e+00 -6.212948e-01
X-bottom = -3.948123e-01 -1.171244e+00 1.621502e+00
Elapsed time is 0.211650 seconds.
>> tic; [x] = culaGesv2(A,B); toc;
Initializing CULA...
$$$$$$$$$$  0.189 s
X-top = 9.946308e-01 2.093953e+00 -6.212948e-01
X-bottom = -3.948123e-01 -1.171244e+00 1.621502e+00
Elapsed time is 0.201372 seconds.


That is all fabulous, and I think consistent with what I had before!

My test routine was doing something like this, the only difference being 64 -> 5000:
Code: Select all
A = rand(2048,2048,'single');
B = rand(2048,5000,'single');
tic; A\B; toc;
Elapsed time is 3.444536 seconds.
tic; A\B; toc;
Elapsed time is 3.426751 seconds.
tic; [x] = culaGesv2(A,B); toc;
Initializing CULA...
$$$$$$$$$$  5.432 s
X-top = 3.081349e-01 1.812777e-01 3.940895e-01
X-bottom = 1.791551e+00 -8.022842e-01 3.020534e-02
Elapsed time is 5.505551 seconds.
tic; [x] = culaGesv2(A,B); toc;
Initializing CULA...
$$$$$$$$$$  5.426 s
X-top = 3.081349e-01 1.812777e-01 3.940895e-01
X-bottom = 1.791551e+00 -8.022842e-01 3.020534e-02
Elapsed time is 5.550902 seconds.


So the question is not so much why sgesv is slow, as why is it slow when the 2nd dimension of B is large. In my own application, this dimension is about 1000. This result seems odd - presumably all the computing is in setting up the inverse; I would have thought the timing would be rather independent of the 2nd dimension of B.

Its nice to finally have a bit of a handle on the issue!

Just to complete the story, the graph of T_cpu/T_gpu for the case where the second dimension of B is 64 is:

Image

Incidentally, X-top and X-bottom should be equal to X(1,1:3) and X(end,(end-2:end)) if X=A\B (within numerical error of single precision, etc.).
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

PreviousNext

Return to CULA Dense Support

Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron