LAPACK and CULA

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

LAPACK and CULA

Postby kolonel » Wed Nov 04, 2009 7:28 pm

If anybody has compared CULA's speed on GPU and LAPACK's on CPU? If the comparison of these two is valid?
University of Alberta
kolonel
 
Posts: 14
Joined: Mon Oct 26, 2009 9:18 pm
Location: Canada

Re:LAPACK and CULA

Postby kyle » Wed Nov 04, 2009 8:54 pm

Hi kolonel,

We have published CULA's speed versus various LAPACK implementations.

At [url="http://www.culatools.com/performance"]this page[/url], you can see CULA's performance versus the reference implementation of LAPACK from Netlib. The Netlib code is a single-threaded Fortran code base that has not been optimized for a specific platform. It is, however, free to download and distribute. Compared this implementation, CULA is 20-120x faster.

We also have published performance numbers against Intel's multi-core optimized MKL LAPACK implementation on our [url="http://www.culatools.com/mkl"]MKL performance page[/url]. This software from Intel is not free, but has been highly optimized to perform well on multi-core processors. Most researchers will consider MKL's LAPACK as the "gold standard" for LAPACK performance. For our comparison against MKL, we have benchmarked against a quad-core Intel i7 processor. As you can see, we exhibit a 3-7x speed-up against this modern CPU.

As far as validity, we are implementing the same algorithms featured in LAPACK. There were no shortcuts taken that invalidate the algorithms or make them less accurate or stable.

If you have any other questions, please let us know!
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re:LAPACK and CULA

Postby kolonel » Thu Nov 05, 2009 10:46 am

Hi Kyle,
Thanks for your reply and information,
I am eager to compare CULA and LAPACK for my simulation case too. But, I am really confused which version of LAPACK is appropriate to install and work with C++. My platform is WindowsXP-64 bit, and I use VS 2005 Professional Edition.
Please give me some hints about this.
Thanks.
University of Alberta
kolonel
 
Posts: 14
Joined: Mon Oct 26, 2009 9:18 pm
Location: Canada

Re:LAPACK and CULA

Postby kyle » Thu Nov 05, 2009 11:59 am

If you want to try out a free package you can try using cLAPACK. This distribution is essentially the Fortran LAPACK code run through the 'f2c' Fortran-to-C converter with some additional small tweaks. Luckily, they also provide some pre-compiled libraries for Visual Studio. Note that this package will likely be slow, but it should provide full functionality.

Another free option is AMD's ACML. We don't have a lot of experience with it, so we can't really comment on its performance.

You could also try MKL, but it does not have a free version available. It will be considerably faster than cLAPACK, but of course not as fast as CULA :) .

Hope this helps,

Kyle
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re: LAPACK and CULA

Postby jacek » Sun Jul 11, 2010 10:27 pm

Hi,
I saw your benchmarks for ssyev/dysev of CULA with MKL 10.2. It looks very interesting. The one thing I am puzzled is how did you estimated the cost as 4N^3? Is it so that the cost of dsyev/ssyev is equal (approximately) to two matrix multiplications only? How did get that estimate? Sorry if this a trivial question, but I am really very interested. I am working on some other algorithm on GPU to replace diagonalization and I plan to test CULA too. Hope you can help me.
Regards,
Jacek
jacek
 
Posts: 3
Joined: Sun Jul 11, 2010 10:19 pm

Re: LAPACK and CULA

Postby john » Mon Jul 12, 2010 8:40 am

The work shown is for determination of eigenvalues only, and when combined with the symmetric nature (ie only access half the data that a dense problem would) then we end up with a low-looking estimate like that. It is difficult to determine the exact number of operations in an eigenvalue problem, so we are stuck with a ballpark estimate such as that. Note that we used the same metric for the CPU and both GPUs, so it is still useful to compare the differences in relative performance.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: LAPACK and CULA

Postby jacek » Mon Jul 12, 2010 10:32 am

John,

Thank you. I agree with your statement about comparing relative performance. So Let see if I understand this correctly. For M=4096, your FLOP estimate, 4M^3, gives total 4*(4*1024)^3= 256*(1024)^3=256 GFLOP. Now, form your plots we see that the performance of CULA (for C1060) and MKL 10.2 for ssyev is around 32 GFLOPS. This means that the total time for diagonalization takes about 8 seconds (256 GPLOPS/ 32 GFLOP= 8s) on CPU and on C1060. Am I correct?

I have also some other questions
1) Does the MKL results are for multi-threaded run or single-threaded? How many threads did you use on CPU?
2) Does the GPU results include memory transfers between CPU and GPU?
3) What will happened to you timing when you request also eigenvectors to be calculated (not just eigenvalues)? What will happened if you also request, the eigenvectors to be transferred back to CPU? How much it will increase the total timing?


Jacek
jacek
 
Posts: 3
Joined: Sun Jul 11, 2010 10:19 pm

Re: LAPACK and CULA

Postby kyle » Tue Jul 13, 2010 7:44 am

1) Does the MKL results are for multi-threaded run or single-threaded? How many threads did you use on CPU?

All of our MKL results are using full multi-threading for 4 cores and 8 threads.
2) Does the GPU results include memory transfers between CPU and GPU?

All timings are with the device interface that excludes transfers. For the more computationally complex routines, such as symmetric eigenvalues, this transfer time is about 1% of the total runtime.
3) What will happened to you timing when you request also eigenvectors to be calculated (not just eigenvalues)? What will happened if you also request, the eigenvectors to be transferred back to CPU? How much it will increase the total timing?

If Eigenvectors are requested, we use a completely different algorithm under the hood. This algorithm is still being tweaked by our developers and hasn't reached full GPU potential. As of CULA 2.0, the symmetric Eigenvector routine is only about 1.5-2x faster than their MKL equivalents compared to 3-6x for Eigenvalues. As far as transfer times, they will account for well under 1% of the total runtime for this routine when vectors are requested.
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re: LAPACK and CULA

Postby jacek » Tue Jul 13, 2010 11:55 am

Thanks Kyle.

I understand that my 8 seconds estimate for dsyev (both MKL and CULA C1060) with M=4096 is correct. Am I right?

Is there any paper in which you published your CULA results? If I want to cite your work, how do I do this?

Jacek
jacek
 
Posts: 3
Joined: Sun Jul 11, 2010 10:19 pm

Re: LAPACK and CULA

Postby wave911 » Thu Mar 17, 2011 12:14 am

Hello.
I used CULA to solve Navier-Stocks equations in finite element method and noticed that function culaSgesv work's much slower than sgesv Lapack's function. For example: matrix size is 5661 x 5661 (data type is float) and solving of linear equations takes this time:
culaSgesv 6,74 sec, Lapack sgesv 5,6 sec. GPU Nvidia GTS250, CPU Intel i3 2,6 Ghz. Cuda version 3.2 (Jan. 2011), CULA R10 Basic.

Why CULA is so slow? (((

sorry for pure English.
wave911
 
Posts: 5
Joined: Tue Feb 01, 2011 10:41 pm

Re: LAPACK and CULA

Postby kyle » Thu Mar 17, 2011 5:32 am

wave911 wrote:For example: matrix size is 5661 x 5661 [...] culaSgesv 6,74 sec, Lapack sgesv 5,6 sec. GPU Nvidia GTS250 [...] Why CULA is so slow?

While the GTS250 isn't a very powerful CUDA device, performance should still be much higher. Have you tried the CULA benchmark program to test performance outside of your program?
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re: LAPACK and CULA

Postby wave911 » Thu Mar 17, 2011 9:10 am

Thank you for quick reply!
Yes, have tried the benchmark and in all tests CULA is faster than MKL. And I'm puzzled. When I use culaSgesv(1.125 sec) in my application Lapack is much faster (0.578 sec). Also I tried the same code on GPU GTX275 and LAPACK was faster again. And in benchmark Cula was faster than MKL.
wave911
 
Posts: 5
Joined: Tue Feb 01, 2011 10:41 pm

Re: LAPACK and CULA

Postby john » Wed Mar 23, 2011 2:08 pm

I can't see that LAPACK is correctly solving a 5600 sized system in .578 seconds. Can you post your timing code? I suspect a problem there.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: LAPACK and CULA

Postby wave911 » Thu Apr 14, 2011 1:58 am

This is my code:
void CFEM::LapackSLAU(float *M, float *B, int count, int ValType)
{
int status = 0;
int NRHS = 1;
int* IPIV = NULL;

IPIV = (int*)calloc(count, sizeof(int));
if (!IPIV)
return false;

if ((!M) || (!B))
return false;

int N = count;
int lda = count;
int ldb = count;

sgesv_((integer*)&N, (integer*)&NRHS, M, (integer*)&lda, (integer*)IPIV, B, (integer*)&ldb, (integer*)&status);

free(IPIV);

if (ValType == Pressure)
memcpy(P, B, count * sizeof(float));
if (ValType == V)
memcpy(U, B, count * sizeof(float));
return;
}

-------------------------------------------------------------------------------------------------------------
void CFEM::CulacheckStatus(culaStatus status)
{
char buf[80];

if(!status)
return;

culaGetErrorInfoString(status, culaGetErrorInfo(), buf, sizeof(buf));
printf("%s\n", buf);

culaShutdown();
exit(EXIT_FAILURE);
}

void CFEM::CulaSLAU(float *M, float *B, int count, int ValType)
{
culaStatus status;
int NRHS = 1;
culaInt* IPIV = NULL;

IPIV = (culaInt*)malloc(count * sizeof(culaInt));
if (!IPIV)
return false;

if ((!M) || (!B))
return false;

status = culaSgesv(count, NRHS, M, count, IPIV, B, count);
CulacheckStatus(status);

free(IPIV);

if (ValType == Pressure)
memcpy(P, B, count * sizeof(float));
if (ValType == V)
memcpy(U, B, count * sizeof(float));

return;
};
-------------------------------------------------------------------------------------------------------------
timing code for CULA
start = clock();
fem->CulaSLAU(fem->KKT, fem->FF, fem->pointcount * 3, V);
finish = clock();
printf ("SLAU V time=%f\n", double(finish - start)/CLOCKS_PER_SEC);

timing code for CULA
start = clock();
fem->LapackSLAU(fem->KKT, fem->FF, fem->pointcount * 3, V);
finish = clock();
printf ("SLAU V time=%f\n", double(finish - start)/CLOCKS_PER_SEC);


Thank you a lot for any help!!!

why lapack is much faster then cula?((
wave911
 
Posts: 5
Joined: Tue Feb 01, 2011 10:41 pm

Re: LAPACK and CULA

Postby kyle » Thu Apr 14, 2011 4:42 am

wave911 wrote:why lapack is much faster then cula?((


Your calling code looking OK.

What's your matrix size?

What's your GPU / CPU?
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Next

Return to General CULA Discussion

Who is online

Users browsing this forum: No registered users and 1 guest