## sgesv in 1.1 is slow...

### Re:sgesv in 1.1 is slow...

Kyle mostly covered the high points, but I wanted to add that it's good to see that your GPU is functional and getting reasonable numbers out of our gesv. I did want to point out that internally Matlab does use the same gesv as we show in our benchmark (Intel MKL) so we're pretty close apples to apples here. Next up we'll try to see if we can identify slowdown incompatibilities between our library and Matlab - there had been a suggestion I saw somewhere that allocating memory inside a mex routine without using the mex allocator can cause troubles, and we do end up allocating memory inside CULA. That might be what is causing it.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re:sgesv in 1.1 is slow...

I had a similar issue with GELS. It was slower than matlab. On my 295, using only one of the cores, I now have in general a 10x speed increase over matlab for full, non-square matrices. Since I do this alot, the program in general could be twice as fast again if I used both cores.

My code runs a mex file that takes the matrices from matlab. But the mex file calls completely decoupled function written in ANSI C, and passes that pointers to the matlab data. In that function, I allocate (with calloc, malloc would be fine I'm sure but calloc helps me debug) a space the same size as the matlab data and then memmove the data from the matlab space into the calloced space. Then I call GELS on that calloced space, free the memory for the left matrix and return a pointer to the result space. So Gels never accesses memory allocated using mxCalloc or from matlab itself.

I don't know enough about matlab to know if this procedure moves all the data outside the memory space of the matlab process or not. Probably not. I guess I could look. In anycase, CULA never operates on data allocated by matlab or with mex functions.

subsquently, I have a function "mexify_data" or somesuch, that takes a block of memory and row/column dimensions, uses mxCalloc to allocate memory the same size, and memmove my results into that. then, to keep matlab utterly happy, mxCreateNumericArray, and mxSetData to put the mxCalloced data into an mxArray of appropriate dimension, and free the mxCalloced temp.

It sounds convoluted, and problematic in that so much is allocated, freed and moved, but it is fast actually, it works, and it also satisfies one of my design goals which is to loosely couple my C code to matlab, because I want to be able to make it into a shared library unrelated to matlab.

perhaps this approach of attempting to keep separate memory spaces will assist you as well.

My code runs a mex file that takes the matrices from matlab. But the mex file calls completely decoupled function written in ANSI C, and passes that pointers to the matlab data. In that function, I allocate (with calloc, malloc would be fine I'm sure but calloc helps me debug) a space the same size as the matlab data and then memmove the data from the matlab space into the calloced space. Then I call GELS on that calloced space, free the memory for the left matrix and return a pointer to the result space. So Gels never accesses memory allocated using mxCalloc or from matlab itself.

I don't know enough about matlab to know if this procedure moves all the data outside the memory space of the matlab process or not. Probably not. I guess I could look. In anycase, CULA never operates on data allocated by matlab or with mex functions.

subsquently, I have a function "mexify_data" or somesuch, that takes a block of memory and row/column dimensions, uses mxCalloc to allocate memory the same size, and memmove my results into that. then, to keep matlab utterly happy, mxCreateNumericArray, and mxSetData to put the mxCalloced data into an mxArray of appropriate dimension, and free the mxCalloced temp.

It sounds convoluted, and problematic in that so much is allocated, freed and moved, but it is fast actually, it works, and it also satisfies one of my design goals which is to loosely couple my C code to matlab, because I want to be able to make it into a shared library unrelated to matlab.

perhaps this approach of attempting to keep separate memory spaces will assist you as well.

- zatak
**Posts:**1**Joined:**Sat Jan 16, 2010 11:29 am

### Re:sgesv in 1.1 is slow...

The suggestion is interesting - perhaps Matlab's memory management is really the issue. I think I understand the suggestion of separating the code from the mex file as much as possible, but I'm not sure I would know how to do that (or willing to work through the revisions to my code...).

Here is a graph showing what I am getting. I've posted the code and matlab script to generate this earlier in this thread. It compares matlab and culaDeviceSgesv solution using A*X=B for A NXN and B NX5000. These matrices are filled using "randn". I used the host-based culaSgesv as well, with the same result. I've verified that all the compute time is in the single call to "status = culaDeviceSgesv(L,I,ga,L,ipiv,gb,L);", rather than, e.g., host-device copies. I run the code setting maxNumCompThreads(1); at the top of the matlab script so it is using a single cpu - an AMD Phenom II X4 965 Processor in this case, compared to a GTX260.

This result does not agree with the direct CULA benchmark test, of course.

Here is a graph showing what I am getting. I've posted the code and matlab script to generate this earlier in this thread. It compares matlab and culaDeviceSgesv solution using A*X=B for A NXN and B NX5000. These matrices are filled using "randn". I used the host-based culaSgesv as well, with the same result. I've verified that all the compute time is in the single call to "status = culaDeviceSgesv(L,I,ga,L,ipiv,gb,L);", rather than, e.g., host-device copies. I run the code setting maxNumCompThreads(1); at the top of the matlab script so it is using a single cpu - an AMD Phenom II X4 965 Processor in this case, compared to a GTX260.

This result does not agree with the direct CULA benchmark test, of course.

- Boxed Cylon
**Posts:**48**Joined:**Fri Oct 16, 2009 8:57 pm

### Re:sgesv in 1.1 is slow...

Dear Zatak

Culaâ€™s CELS is not faster than MKL, when i am running the culaâ€™s example benchmark, and i think it would be much slower if data comes from Matlab. but 10x speed ups from a mex file sounds great. What is the matrix sizes you are have?

Iâ€™ve tried to have same try with GESV, but my Matlab crashes each time. Do you have memory allocating twice for each array, ones with mxMalloc and another by calloc and copy the fist one to the second one?

perhaps i didn't understand the approach well.

Culaâ€™s CELS is not faster than MKL, when i am running the culaâ€™s example benchmark, and i think it would be much slower if data comes from Matlab. but 10x speed ups from a mex file sounds great. What is the matrix sizes you are have?

Iâ€™ve tried to have same try with GESV, but my Matlab crashes each time. Do you have memory allocating twice for each array, ones with mxMalloc and another by calloc and copy the fist one to the second one?

perhaps i didn't understand the approach well.

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.1 is slow...

Hello folks, I wanted to drop an update here to the community and hit a few different points.

First off, thanks everyone for pitching in with data and stories. We are trying to trace this down in our lab, and we are indeed starting to find some odd behaviors when we're integrating our DLL into Matlab (either directly or via mex.) We are very late into our 1.2 release cycle, but if we find anything curable then we will delay for a day or three and try to integrate it into the 1.2 version.

We have found what appears to be an external force impacting our times - oddly we instrumented a cula dll with timing functions and the cula routines themselves are reporting the proper expected (ie fast) times, but the tic/toc in Matlab then reports significantly longer times. Anyway, that is where we are and we'll see what we can get done. Clearly this is an important topic to our users, so we are listening to the needs. I'm really hoping to come up with something that avoids the mexifying procedure, but at least we have something that appears to be a potential workaround.

Cjest - what is your GPU? I would normally expect the benchmark to return a speedup unless the GPU is weak or maybe there is a conflict on the system. I ran off a single matrix size and here are my results:

First off, thanks everyone for pitching in with data and stories. We are trying to trace this down in our lab, and we are indeed starting to find some odd behaviors when we're integrating our DLL into Matlab (either directly or via mex.) We are very late into our 1.2 release cycle, but if we find anything curable then we will delay for a day or three and try to integrate it into the 1.2 version.

We have found what appears to be an external force impacting our times - oddly we instrumented a cula dll with timing functions and the cula routines themselves are reporting the proper expected (ie fast) times, but the tic/toc in Matlab then reports significantly longer times. Anyway, that is where we are and we'll see what we can get done. Clearly this is an important topic to our users, so we are listening to the needs. I'm really hoping to come up with something that avoids the mexifying procedure, but at least we have something that appears to be a potential workaround.

Cjest - what is your GPU? I would normally expect the benchmark to return a speedup unless the GPU is weak or maybe there is a conflict on the system. I ran off a single matrix size and here are my results:

- Code: Select all
`C:\Program Files\CULA\examples\benchmark>benchmark_ sgels 7000 7001 5`

Initializing CULA...

Initializing MKL...

Benchmarking the following functions:

-------------------------------------

SGELS

-------------------------------------

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

7000 4.50 8.95 1.9880

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re:sgesv in 1.1 is slow...

Hi,

GPU: GeForce GTX 285

CPU: intel Xeon X5450 3.00Ghz

my benchmark result is: (single precision)

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

7168 15.83 6.03 0.3808

Since i work with Sgesv:

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

7168 1.69 3.31 1.9657

GPU: GeForce GTX 285

CPU: intel Xeon X5450 3.00Ghz

my benchmark result is: (single precision)

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

7168 15.83 6.03 0.3808

Since i work with Sgesv:

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

7168 1.69 3.31 1.9657

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.1 is slow...

Hi cjest,

Could you report your benchmarking numbers for a wider range of inputs? Say 4096-8192?

Dan

Could you report your benchmarking numbers for a wider range of inputs? Say 4096-8192?

Dan

- dan
- Administrator
**Posts:**61**Joined:**Thu Jul 23, 2009 2:29 pm

### Re:sgesv in 1.1 is slow...

Fresh benchmark

-------------------------------------

-- SGEQRF Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.98 1.57 0.7901

5120 3.38 1.80 0.5331

6144 2.14 3.05 1.4222

7168 2.91 4.69 1.6142

8192 3.80 6.85 1.8025

-- SGETRF Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.35 0.90 2.5364

5120 0.68 1.20 1.7602

6144 0.90 2.03 2.2639

7168 1.33 2.92 2.1864

8192 1.84 4.19 2.2754

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.66 1.28 0.7712

5120 1.87 1.93 1.0302

6144 2.80 3.12 1.1146

7168 4.03 4.92 1.2208

8192 5.06 7.06 1.3962

-- SGGLSE Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.38 6.44 4.6689

5120 2.68 10.41 3.8888

6144 3.30 15.22 4.6191

7168 4.78 21.25 4.4422

8192 5.95 28.45 4.7815

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.48 1.02 2.1041

5120 0.80 1.26 1.5758

6144 1.19 2.10 1.7742

7168 1.69 3.05 1.8020

8192 2.29 4.31 1.8807

-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 37.30 144.21 3.8658

5120 60.66 270.37 4.4569

-------------------------------------

-- SGEQRF Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.98 1.57 0.7901

5120 3.38 1.80 0.5331

6144 2.14 3.05 1.4222

7168 2.91 4.69 1.6142

8192 3.80 6.85 1.8025

-- SGETRF Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.35 0.90 2.5364

5120 0.68 1.20 1.7602

6144 0.90 2.03 2.2639

7168 1.33 2.92 2.1864

8192 1.84 4.19 2.2754

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.66 1.28 0.7712

5120 1.87 1.93 1.0302

6144 2.80 3.12 1.1146

7168 4.03 4.92 1.2208

8192 5.06 7.06 1.3962

-- SGGLSE Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 1.38 6.44 4.6689

5120 2.68 10.41 3.8888

6144 3.30 15.22 4.6191

7168 4.78 21.25 4.4422

8192 5.95 28.45 4.7815

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.48 1.02 2.1041

5120 0.80 1.26 1.5758

6144 1.19 2.10 1.7742

7168 1.69 3.05 1.8020

8192 2.29 4.31 1.8807

-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 37.30 144.21 3.8658

5120 60.66 270.37 4.4569

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.1 is slow...

@ Boxed Cylon

can u please explain me how to run the CUDA profiler...

i'm using kubuntu 9.03, CUDA 2.3(with driver, toolkit and sdk), CULA rhel 1.1b.

i got to know how to start it.. (by double-clicking it)

now by default in the session settings the programs to be executed are .exe files... but in linux u don't have the concept of .exe.

i'm able to run my CUDA sample programs by using ./<program name> from the konsole.

but how do i make it work in the CUDA profiler. in other words how can i run my linux executable in CUDA profiler.

Please Help

thank u.

can u please explain me how to run the CUDA profiler...

i'm using kubuntu 9.03, CUDA 2.3(with driver, toolkit and sdk), CULA rhel 1.1b.

i got to know how to start it.. (by double-clicking it)

now by default in the session settings the programs to be executed are .exe files... but in linux u don't have the concept of .exe.

i'm able to run my CUDA sample programs by using ./<program name> from the konsole.

but how do i make it work in the CUDA profiler. in other words how can i run my linux executable in CUDA profiler.

Please Help

thank u.

- jpmig313
**Posts:**7**Joined:**Sat Dec 26, 2009 6:04 am

### Re:sgesv in 1.1 is slow...

jpmig313 wrote:@ Boxed Cylon

can u please explain me how to run the CUDA profiler...

i'm using kubuntu 9.03, CUDA 2.3(with driver, toolkit and sdk), CULA rhel 1.1b.

i got to know how to start it.. (by double-clicking it)

now by default in the session settings the programs to be executed are .exe files... but in linux u don't have the concept of .exe.

i'm able to run my CUDA sample programs by using ./<program name> from the konsole.

but how do i make it work in the CUDA profiler. in other words how can i run my linux executable in CUDA profiler.

Please Help

thank u.

The Matlab tutorial has a discussion of how to run the profiler in the context of matlab:

http://forums.nvidia.com/index.php?showtopic=70731

Its straightforward, but a little tricky...

- Boxed Cylon
**Posts:**48**Joined:**Fri Oct 16, 2009 8:57 pm

### Re:sgesv in 1.2 is also slow...

Hi:

I have just upgraded my CULA premium to 1.2. When used with MATLAB sgesv is also slow (same than 1.1)

I read that new version 1.2 has a new faster routine sgetrf.

My question is if this new routine has been used for sgesv. I want to know this because I suppose that sgesv is based in sgetrf. In fact sgesv=sgetrf+sgetrs

Thanks

jpeinado

I have just upgraded my CULA premium to 1.2. When used with MATLAB sgesv is also slow (same than 1.1)

I read that new version 1.2 has a new faster routine sgetrf.

My question is if this new routine has been used for sgesv. I want to know this because I suppose that sgesv is based in sgetrf. In fact sgesv=sgetrf+sgetrs

Thanks

jpeinado

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.2 is also slow...

Yes, the upgraded GETRF will also result in a speedup in GESV (we actually debated whether to note both routines in the patch notes, but in the end only mentioned GETRF because GESV itself received no changes.)

In your case, the Matlab slowdown is probably making it difficult to see the other improvements since the Matlab times are dramatically longer.

We are still examining this one, but it's been strangely elusive to nail down so far. Some versions of Matlab aren't showing the slowdown, but only on some of the machines we have tested. It's very frustrating!

In your case, the Matlab slowdown is probably making it difficult to see the other improvements since the Matlab times are dramatically longer.

We are still examining this one, but it's been strangely elusive to nail down so far. Some versions of Matlab aren't showing the slowdown, but only on some of the machines we have tested. It's very frustrating!

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re:sgesv in 1.2 is also slow...

To get speedup from Gesv the size of A in Ax = b, must be at least 1500x1500. I've tested my mex function in a for loop, to test the data overhead cost, still get non-trivial performance usning matrices bigger than 1500.

- cjest
**Posts:**12**Joined:**Wed Feb 10, 2010 3:01 pm

### Re:sgesv in 1.2 is also slow...

john wrote:Yes, the upgraded GETRF will also result in a speedup in GESV (we actually debated whether to note both routines in the patch notes, but in the end only mentioned GETRF because GESV itself received no changes.)

OK

john wrote:In your case, the Matlab slowdown is probably making it difficult to see the other improvements since the Matlab times are dramatically longer.

No. I am (almost sure) that this is not the problem. I tested other packages to solve linear systems (like CULAPACK (totally based in CUBLAS) from UJI University - Spain) and it works OK with MATLAB.

In fact I did the following tests

CULAPACK (sgetrf) + CUBLAS (triangular systems) = OK

CULAPACK (sgetrf) + CULA (triangular systems) = BAD

CULA (sgesv) = BAD

CULA (sgetrf) + CUBLAS (triangular systems) = BAD

It seems that the CULA library has any problem with MATLAB. If you use CUBLAS routines, all works OK. I dont know how are CULA routines done, but there is a problem with CULA and MATLAB

By other hand, there are more algorithms called hybrid, but they are impossible to execute with MATLAB (it is a MATLAB problem). Anyway, CULA routines (sgetrf and sgetrs) are not hybrid.

john wrote: We are still examining this one, but it's been strangely elusive to nail down so far. Some versions of Matlab aren't showing the slowdown, but only on some of the machines we have tested. It's very frustrating!

Could you be more explicit in versions and machines...? Anyway thanks for you to test all this problem

cjest wrote:To get speedup from Gesv the size of A in Ax = b, must be at least 1500x1500. I've tested my mex function in a for loop, to test the data overhead cost, still get non-trivial performance usning matrices bigger than 1500.

Yes, I have the same results as you, but not with CULA. By the way, could you please publish your mex file?

Thanks

- jpeinado
**Posts:**37**Joined:**Mon Sep 14, 2009 10:48 am

### Re:sgesv in 1.1 is slow...

I'd like to add some data points to the mix.

Windows Vista 64

CULA 1.2

Matlab 2008a (7.6): Slow

Matlab 2009a (7.8): Slow

Matlab 2009b (7.9): Fast

Ubuntu 9.10 32-bit

CULA 1.2

Matlab 2009b (7.9): Fast

From these results above, you can see that for at least 2 systems we've seen no slowdown in Matlab 2009b (7.9). Our analysis has shown that the CUDA runtime's initialization time appears to be extreme in versions earlier than 2009b. There is a small initialization time in 2009b (about 0.4 seconds) but this is to be expected and matches results we've found outside of the Matlab environment.

Unfortunately, the information we've found indicates to us that we can't support any version earlier than 2009b as it appears that The Mathworks has resolved (in some instances, at least) the problems that led to slow execution. With this in mind, we're very interested in hearing from those of you who are using 2009b (7.9) and are still seeing the slowdown. If we can see consistency in the reports, we'll try to match the environment and see if we can see duplicate user reports. Obviously installing many different versions of Matlab on many systems is cumbersome so the more help we can get from our users the easier it will be for us to solve this problem.

When reporting your results, make 100% sure that you're using the version of Matlab that you're reporting results on here (symbolic links might point to older versions and you may not realize it). Also, please only report your results on CULA 1.2 as this will best allow us to debug this problem.

Thank you to everyone who has put work into this already, we appreciate your contribution very much.

Dan

Windows Vista 64

CULA 1.2

Matlab 2008a (7.6): Slow

Matlab 2009a (7.8): Slow

Matlab 2009b (7.9): Fast

Ubuntu 9.10 32-bit

CULA 1.2

Matlab 2009b (7.9): Fast

From these results above, you can see that for at least 2 systems we've seen no slowdown in Matlab 2009b (7.9). Our analysis has shown that the CUDA runtime's initialization time appears to be extreme in versions earlier than 2009b. There is a small initialization time in 2009b (about 0.4 seconds) but this is to be expected and matches results we've found outside of the Matlab environment.

Unfortunately, the information we've found indicates to us that we can't support any version earlier than 2009b as it appears that The Mathworks has resolved (in some instances, at least) the problems that led to slow execution. With this in mind, we're very interested in hearing from those of you who are using 2009b (7.9) and are still seeing the slowdown. If we can see consistency in the reports, we'll try to match the environment and see if we can see duplicate user reports. Obviously installing many different versions of Matlab on many systems is cumbersome so the more help we can get from our users the easier it will be for us to solve this problem.

When reporting your results, make 100% sure that you're using the version of Matlab that you're reporting results on here (symbolic links might point to older versions and you may not realize it). Also, please only report your results on CULA 1.2 as this will best allow us to debug this problem.

Thank you to everyone who has put work into this already, we appreciate your contribution very much.

Dan

- dan
- Administrator
**Posts:**61**Joined:**Thu Jul 23, 2009 2:29 pm

### Who is online

Users browsing this forum: No registered users and 1 guest