## matlab svd (cula_link) fails (sgesdd vs sgesvd)

13 posts
• Page

**1**of**1**### matlab svd (cula_link) fails (sgesdd vs sgesvd)

No one seems to have noticed (or I'm doing something wrong?):

The CULA link interface for Matlab

(as discussed at http://www.culatools.com/blog/category/interfaces/)

fails to engage the GPU when calling the MATLAB svd command.

Looking at the log the interface dumps shows that that matlab svd command is now calling LAPACK sgesdd, instead of sgesvd. Looking at the supported LAPACK svd functions in CULA, it seems that sgesdd is not one of the supported/implemented routines, so it makese sense that the link interface just falls back to the Intel MLK BLAS/LAPACK for sgesdd (on the CPU only -- no speedup using a GPU for MATLAB svd, for now).

It seems that sgesdd is not even supported in CULA premium, which is especially disappointing.

Has anyone else noticed this problem? Is anyone using CULA link with MATLAB?

Given that gsesvd is slower than sgedd, I'm also noticing that the user defined MEX routines (also mentioned on the above interfaces blog page) are not faster than the MATLAB svd command. That is, culaSvd(A) mex/cula routine is not faster than the default MATLAB svd(A) [with link interface not used).

So, there is no version of CULA (link iterface or user mex call to CULA routines) that I know of that is faster than Matlab's (version 2011a) own built in svd (running on 8 cores on dual xeon). Even with high end GPU cards (tried both TELSA 2050 and GTX 580). Is this really true, or am I doing something wrong?

The CULA link interface for Matlab

(as discussed at http://www.culatools.com/blog/category/interfaces/)

fails to engage the GPU when calling the MATLAB svd command.

Looking at the log the interface dumps shows that that matlab svd command is now calling LAPACK sgesdd, instead of sgesvd. Looking at the supported LAPACK svd functions in CULA, it seems that sgesdd is not one of the supported/implemented routines, so it makese sense that the link interface just falls back to the Intel MLK BLAS/LAPACK for sgesdd (on the CPU only -- no speedup using a GPU for MATLAB svd, for now).

It seems that sgesdd is not even supported in CULA premium, which is especially disappointing.

Has anyone else noticed this problem? Is anyone using CULA link with MATLAB?

Given that gsesvd is slower than sgedd, I'm also noticing that the user defined MEX routines (also mentioned on the above interfaces blog page) are not faster than the MATLAB svd command. That is, culaSvd(A) mex/cula routine is not faster than the default MATLAB svd(A) [with link interface not used).

So, there is no version of CULA (link iterface or user mex call to CULA routines) that I know of that is faster than Matlab's (version 2011a) own built in svd (running on 8 cores on dual xeon). Even with high end GPU cards (tried both TELSA 2050 and GTX 580). Is this really true, or am I doing something wrong?

- chester1248
**Posts:**5**Joined:**Tue Oct 06, 2009 6:38 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Are you calling a full SVD, [U,S,V] = svd(A), or a partial one, [~,S,~] = svd(A)? Also, what size are you calling?

I don't have access to MATLAB at the moment, but I know off the top of my head that in MATLAB 2011a, a full SVD will call xGESVD under-the-hood.

I don't have access to MATLAB at the moment, but I know off the top of my head that in MATLAB 2011a, a full SVD will call xGESVD under-the-hood.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Thanks for the quick feedback -- but I do think sgesdd is being called in 2011a:

I don't have access to my Windows7 x64 machine running Matlab 2011a, but I just checked and I get the same result with my Mac OSX (Snow Leopard) machine at home, also running Matlab 2011a. So here are some details (which look identical to what I get on Windows):

Using the cula link interface (using latest CULA R12 Free, for now) when start up matlab:

>> randn('seed',1);A=randn(2000,2000,'single');

>> tic;svd(A);toc

cpu_id: x86 Family 6 Model 5 Stepping 5, GenuineIntel

libmwlapack: trying environment...

libmwlapack: loading libcula_link.dylib

libmwlapack: loaded libcula_link.dylib@0x1305950c0

libmwlapack: libcula_link.dylib is not a compatibility layer.

Elapsed time is 4.572858 seconds.

>> tic;svd(A);toc

Elapsed time is 2.981600 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.100251 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.145978 seconds.

The log dumped by CULA link interface indicates that svd(A) with no nargouts does call sgesvd, but that [u,s,v]=svd(A) calls sgesdd instead (this might be something Mathworks very recently changed -- I think sgesdd is suppose to be faster than sgesvd in many cases, so they probably got wise -- and so CULA needs to provide a sgesvdd to keep up?:

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x107553000, 0x0, 2000, 0x0, 2000)

cula info: issuing to CPU (work query)

cula info: CPU library is lapackcpu.dylib

cula info: work query returned 70000

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x107553000, 0x0, 2000, 0x0, 2000)

cula info: issuing to GPU (over threshold)

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x14337e000, 0x0, 2000, 0x0, 2000)

cula info: issuing to CPU (work query)

cula info: work query returned 70000

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x14337e000, 0x0, 2000, 0x0, 2000)

cula info: issuing to GPU (over threshold)

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: work query returned 12014000

cula info: work query returned 0

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: work query returned 12014000

cula info: work query returned 0

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: done

BTW, without cula link, matlab gives these times:

>> randn('seed',1); A=randn(2000,2000,'single');

>> tic;svd(A);toc

Elapsed time is 3.238710 seconds.

>> tic;svd(A);toc

Elapsed time is 3.228221 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.330816 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.363748 seconds.

So, the GPU (330M, for MBP 2010) [for nargouts=0 call to svd, the only case which actually runs CULA GPU code] gets 2.98 secs, whereas CPU (duo core -- both go to 100%) gets 3.2-- they are about the same speed, which I think is about right, given the peak GFLOPS for the 330M. [On my Windows 8-core dual Xeon machine, the CULA svd (with nargouts=0) running on GPU (most recently, I was testing a GTX 560ti right now -- for which CUBLAS SGEMM gets me about 500 GFlops) is about 3x faster than matlab built-in svd]

This same result occurs for other matrix sizes, on both Windows 7 x64 and OSX -- so I think it is the case that Matlab 2011a *does* now call sgesdd when outputs ARE requested.

Hopefully sgesdd is something that CULA can add in the next/soon release -- I would love a good excuse to buy the CULA Premium, and then for dgesdd as well ... [and then hopefully in the near future see CULA get some spMv routines as well ... ]

--Dennis

I don't have access to my Windows7 x64 machine running Matlab 2011a, but I just checked and I get the same result with my Mac OSX (Snow Leopard) machine at home, also running Matlab 2011a. So here are some details (which look identical to what I get on Windows):

Using the cula link interface (using latest CULA R12 Free, for now) when start up matlab:

>> randn('seed',1);A=randn(2000,2000,'single');

>> tic;svd(A);toc

cpu_id: x86 Family 6 Model 5 Stepping 5, GenuineIntel

libmwlapack: trying environment...

libmwlapack: loading libcula_link.dylib

libmwlapack: loaded libcula_link.dylib@0x1305950c0

libmwlapack: libcula_link.dylib is not a compatibility layer.

Elapsed time is 4.572858 seconds.

>> tic;svd(A);toc

Elapsed time is 2.981600 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.100251 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.145978 seconds.

The log dumped by CULA link interface indicates that svd(A) with no nargouts does call sgesvd, but that [u,s,v]=svd(A) calls sgesdd instead (this might be something Mathworks very recently changed -- I think sgesdd is suppose to be faster than sgesvd in many cases, so they probably got wise -- and so CULA needs to provide a sgesvdd to keep up?:

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x107553000, 0x0, 2000, 0x0, 2000)

cula info: issuing to CPU (work query)

cula info: CPU library is lapackcpu.dylib

cula info: work query returned 70000

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x107553000, 0x0, 2000, 0x0, 2000)

cula info: issuing to GPU (over threshold)

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x14337e000, 0x0, 2000, 0x0, 2000)

cula info: issuing to CPU (work query)

cula info: work query returned 70000

cula info: done

cula info: sgesvd (N, N, 2000, 2000, 0x13407b000, 2000, 0x14337e000, 0x0, 2000, 0x0, 2000)

cula info: issuing to GPU (over threshold)

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: work query returned 12014000

cula info: work query returned 0

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: work query returned 12014000

cula info: work query returned 0

cula info: done

cula info: sgesdd ()

cula info: issuing to CPU (no GPU function available)

cula info: done

BTW, without cula link, matlab gives these times:

>> randn('seed',1); A=randn(2000,2000,'single');

>> tic;svd(A);toc

Elapsed time is 3.238710 seconds.

>> tic;svd(A);toc

Elapsed time is 3.228221 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.330816 seconds.

>> tic;[u,s,v]=svd(A);toc

Elapsed time is 5.363748 seconds.

So, the GPU (330M, for MBP 2010) [for nargouts=0 call to svd, the only case which actually runs CULA GPU code] gets 2.98 secs, whereas CPU (duo core -- both go to 100%) gets 3.2-- they are about the same speed, which I think is about right, given the peak GFLOPS for the 330M. [On my Windows 8-core dual Xeon machine, the CULA svd (with nargouts=0) running on GPU (most recently, I was testing a GTX 560ti right now -- for which CUBLAS SGEMM gets me about 500 GFlops) is about 3x faster than matlab built-in svd]

This same result occurs for other matrix sizes, on both Windows 7 x64 and OSX -- so I think it is the case that Matlab 2011a *does* now call sgesdd when outputs ARE requested.

Hopefully sgesdd is something that CULA can add in the next/soon release -- I would love a good excuse to buy the CULA Premium, and then for dgesdd as well ... [and then hopefully in the near future see CULA get some spMv routines as well ... ]

--Dennis

- chester1248
**Posts:**5**Joined:**Tue Oct 06, 2009 6:38 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

BTW, I also just now tried Kyle's [~,s,~]=svd(A) example (I never realized the convenient "~" output syntax existed in Matlab ...), but the CULA log indicates that in that case Matlab's svd also calls sgesdd().

- chester1248
**Posts:**5**Joined:**Tue Oct 06, 2009 6:38 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Hmm, this must be something new in 2011a.

The "SVDD" is simply the divide-and-conquer method. The great majority of the code is similar to a "normal SVD"; that is the algorithmic flow is the same:

1) Bidiagonalization

2) Orthogonalization

3) Iterative singular value extraction + singular vector (if requested)

In "SVDD" and the 1st and 2nd steps are identical to the "normal SVD". However, more parallelism can be extracted from the 3rd step as, depending on the data, the problem can be broken into independent sub-problems.

So, the point being is that we have the majority of the work done. We'll look into the work required for implementing "divide-and-conquer" portion of step 3.

The "SVDD" is simply the divide-and-conquer method. The great majority of the code is similar to a "normal SVD"; that is the algorithmic flow is the same:

1) Bidiagonalization

2) Orthogonalization

3) Iterative singular value extraction + singular vector (if requested)

In "SVDD" and the 1st and 2nd steps are identical to the "normal SVD". However, more parallelism can be extracted from the 3rd step as, depending on the data, the problem can be broken into independent sub-problems.

So, the point being is that we have the majority of the work done. We'll look into the work required for implementing "divide-and-conquer" portion of step 3.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Also, the scaling of SVD is fairly poor until larger (over 4k) sizes are reached. Below this, memory bound routines like matrix-vector products will dominate the entire runtime. Speed-ups at largest sizes are obtained because the compute bound routines like matrix-matrix products begin to dominate the total runtime.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

It's an interesting design decision from Mathworks, since they have made a change which will result in the users observing different behavior (and possibly different quality of result) from version to version. I'd have preferred a flag or a different routine name, myself, but I guess the routine is called "SVD" not "SVD via GESVD." Interesting find, thanks for writing in.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Hi, Kyle, I am wondering if CULA team has any future plan to give SVDS function as Matlab, which can perform SVD but select number of singular values? So far, I do not find any, because it is really important for us to use this one instead of complete SVD.

Thank you in advance.

---JiFeng

Thank you in advance.

---JiFeng

- yujif
**Posts:**3**Joined:**Wed Nov 02, 2011 5:08 am

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

To my knowledge, there is no LAPACK equivalent of this. My testing, at least in Matlab, is that it's faster to run the full SVD and to then cut down the U,S,V matrices to the number of values that you want than it is to run the SVDS command.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Hi, john, thank you for you concerning and quick reply. Probably, you are talking about SVD for singular value only or small matrix size (say, (100,100)), yes, in this case, full SVD is quicker than SVDS. But if you also need U, V and matrix size is a little large, than full SVD becomes much slower. That's also why we need SVDS indeed. What do you think?

- yujif
**Posts:**3**Joined:**Wed Nov 02, 2011 5:08 am

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

SVDS isn't a LAPACK routine under the hood. I believe it calls an iterative routine in a package called PROPACK. On the other hand, SVD calls the LAPACK routine xGESVD or xGESVDD.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

Yeah, I take a look at the SVDS subroutine, it calls a function EIGS (similar to EIG) from library ARPACK instead of LAPACK, and I will check in more details. Thanks for replies.

- yujif
**Posts:**3**Joined:**Wed Nov 02, 2011 5:08 am

### Re: matlab svd (cula_link) fails (sgesdd vs sgesvd)

The ARPACK library uses a few LAPACK and BLAS3 calls under the hood. It might (or might not) gain speedups if you try the CULA Link Interface.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

13 posts
• Page

**1**of**1**### Who is online

Users browsing this forum: Majestic-12 [Bot] and 2 guests