sgesv in 1.1 is slow...

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Re:sgesv in 1.1 is slow...

Postby john » Sat Feb 13, 2010 12:27 pm

Kyle mostly covered the high points, but I wanted to add that it's good to see that your GPU is functional and getting reasonable numbers out of our gesv. I did want to point out that internally Matlab does use the same gesv as we show in our benchmark (Intel MKL) so we're pretty close apples to apples here. Next up we'll try to see if we can identify slowdown incompatibilities between our library and Matlab - there had been a suggestion I saw somewhere that allocating memory inside a mex routine without using the mex allocator can cause troubles, and we do end up allocating memory inside CULA. That might be what is causing it.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby zatak » Mon Feb 15, 2010 7:48 am

I had a similar issue with GELS. It was slower than matlab. On my 295, using only one of the cores, I now have in general a 10x speed increase over matlab for full, non-square matrices. Since I do this alot, the program in general could be twice as fast again if I used both cores.

My code runs a mex file that takes the matrices from matlab. But the mex file calls completely decoupled function written in ANSI C, and passes that pointers to the matlab data. In that function, I allocate (with calloc, malloc would be fine I'm sure but calloc helps me debug) a space the same size as the matlab data and then memmove the data from the matlab space into the calloced space. Then I call GELS on that calloced space, free the memory for the left matrix and return a pointer to the result space. So Gels never accesses memory allocated using mxCalloc or from matlab itself.

I don't know enough about matlab to know if this procedure moves all the data outside the memory space of the matlab process or not. Probably not. I guess I could look. In anycase, CULA never operates on data allocated by matlab or with mex functions.

subsquently, I have a function "mexify_data" or somesuch, that takes a block of memory and row/column dimensions, uses mxCalloc to allocate memory the same size, and memmove my results into that. then, to keep matlab utterly happy, mxCreateNumericArray, and mxSetData to put the mxCalloced data into an mxArray of appropriate dimension, and free the mxCalloced temp.

It sounds convoluted, and problematic in that so much is allocated, freed and moved, but it is fast actually, it works, and it also satisfies one of my design goals which is to loosely couple my C code to matlab, because I want to be able to make it into a shared library unrelated to matlab.

perhaps this approach of attempting to keep separate memory spaces will assist you as well.
zatak
 
Posts: 1
Joined: Sat Jan 16, 2010 11:29 am

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Tue Feb 16, 2010 12:38 am

The suggestion is interesting - perhaps Matlab's memory management is really the issue. I think I understand the suggestion of separating the code from the mex file as much as possible, but I'm not sure I would know how to do that (or willing to work through the revisions to my code...).

Here is a graph showing what I am getting. I've posted the code and matlab script to generate this earlier in this thread. It compares matlab and culaDeviceSgesv solution using A*X=B for A NXN and B NX5000. These matrices are filled using "randn". I used the host-based culaSgesv as well, with the same result. I've verified that all the compute time is in the single call to "status = culaDeviceSgesv(L,I,ga,L,ipiv,gb,L);", rather than, e.g., host-device copies. I run the code setting maxNumCompThreads(1); at the top of the matlab script so it is using a single cpu - an AMD Phenom II X4 965 Processor in this case, compared to a GTX260.

This result does not agree with the direct CULA benchmark test, of course.

Image
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby cjest » Tue Feb 16, 2010 1:15 am

Dear Zatak

Cula’s CELS is not faster than MKL, when i am running the cula’s example benchmark, and i think it would be much slower if data comes from Matlab. but 10x speed ups from a mex file sounds great. What is the matrix sizes you are have?

I’ve tried to have same try with GESV, but my Matlab crashes each time. Do you have memory allocating twice for each array, ones with mxMalloc and another by calloc and copy the fist one to the second one?

perhaps i didn't understand the approach well.
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby john » Tue Feb 16, 2010 8:54 am

Hello folks, I wanted to drop an update here to the community and hit a few different points.

First off, thanks everyone for pitching in with data and stories. We are trying to trace this down in our lab, and we are indeed starting to find some odd behaviors when we're integrating our DLL into Matlab (either directly or via mex.) We are very late into our 1.2 release cycle, but if we find anything curable then we will delay for a day or three and try to integrate it into the 1.2 version.

We have found what appears to be an external force impacting our times - oddly we instrumented a cula dll with timing functions and the cula routines themselves are reporting the proper expected (ie fast) times, but the tic/toc in Matlab then reports significantly longer times. Anyway, that is where we are and we'll see what we can get done. Clearly this is an important topic to our users, so we are listening to the needs. I'm really hoping to come up with something that avoids the mexifying procedure, but at least we have something that appears to be a potential workaround.

Cjest - what is your GPU? I would normally expect the benchmark to return a speedup unless the GPU is weak or maybe there is a conflict on the system. I ran off a single matrix size and here are my results:

Code: Select all
C:\Program Files\CULA\examples\benchmark>benchmark_ sgels 7000 7001 5
Initializing CULA...
Initializing MKL...

Benchmarking the following functions:
-------------------------------------
             SGELS
-------------------------------------


     -- SGELS Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
7000       4.50       8.95    1.9880
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby cjest » Wed Feb 17, 2010 12:49 am

Hi,
GPU: GeForce GTX 285
CPU: intel Xeon X5450 3.00Ghz

my benchmark result is: (single precision)

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
7168 15.83 6.03 0.3808

Since i work with Sgesv:

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
7168 1.69 3.31 1.9657
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby dan » Wed Feb 17, 2010 9:32 am

Hi cjest,

Could you report your benchmarking numbers for a wider range of inputs? Say 4096-8192?

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby cjest » Thu Feb 18, 2010 2:50 am

Fresh benchmark
-------------------------------------


-- SGEQRF Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 1.98 1.57 0.7901
5120 3.38 1.80 0.5331
6144 2.14 3.05 1.4222
7168 2.91 4.69 1.6142
8192 3.80 6.85 1.8025

-- SGETRF Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 0.35 0.90 2.5364
5120 0.68 1.20 1.7602
6144 0.90 2.03 2.2639
7168 1.33 2.92 2.1864
8192 1.84 4.19 2.2754

-- SGELS Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 1.66 1.28 0.7712
5120 1.87 1.93 1.0302
6144 2.80 3.12 1.1146
7168 4.03 4.92 1.2208
8192 5.06 7.06 1.3962

-- SGGLSE Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 1.38 6.44 4.6689
5120 2.68 10.41 3.8888
6144 3.30 15.22 4.6191
7168 4.78 21.25 4.4422
8192 5.95 28.45 4.7815

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 0.48 1.02 2.1041
5120 0.80 1.26 1.5758
6144 1.19 2.10 1.7742
7168 1.69 3.05 1.8020
8192 2.29 4.31 1.8807

-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 37.30 144.21 3.8658
5120 60.66 270.37 4.4569
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.1 is slow...

Postby jpmig313 » Thu Feb 18, 2010 6:30 am

@ Boxed Cylon
can u please explain me how to run the CUDA profiler...
i'm using kubuntu 9.03, CUDA 2.3(with driver, toolkit and sdk), CULA rhel 1.1b.

i got to know how to start it.. (by double-clicking it)
now by default in the session settings the programs to be executed are .exe files... but in linux u don't have the concept of .exe.

i'm able to run my CUDA sample programs by using ./<program name> from the konsole.
but how do i make it work in the CUDA profiler. in other words how can i run my linux executable in CUDA profiler.

Please Help

thank u.
jpmig313
 
Posts: 7
Joined: Sat Dec 26, 2009 6:04 am

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Thu Feb 18, 2010 3:21 pm

jpmig313 wrote:@ Boxed Cylon
can u please explain me how to run the CUDA profiler...
i'm using kubuntu 9.03, CUDA 2.3(with driver, toolkit and sdk), CULA rhel 1.1b.

i got to know how to start it.. (by double-clicking it)
now by default in the session settings the programs to be executed are .exe files... but in linux u don't have the concept of .exe.

i'm able to run my CUDA sample programs by using ./<program name> from the konsole.
but how do i make it work in the CUDA profiler. in other words how can i run my linux executable in CUDA profiler.

Please Help

thank u.


The Matlab tutorial has a discussion of how to run the profiler in the context of matlab:
http://forums.nvidia.com/index.php?showtopic=70731
Its straightforward, but a little tricky...
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.2 is also slow...

Postby jpeinado » Fri Feb 26, 2010 2:25 am

Hi:

I have just upgraded my CULA premium to 1.2. When used with MATLAB sgesv is also slow (same than 1.1)

I read that new version 1.2 has a new faster routine sgetrf.

My question is if this new routine has been used for sgesv. I want to know this because I suppose that sgesv is based in sgetrf. In fact sgesv=sgetrf+sgetrs



Thanks

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.2 is also slow...

Postby john » Fri Feb 26, 2010 2:34 pm

Yes, the upgraded GETRF will also result in a speedup in GESV (we actually debated whether to note both routines in the patch notes, but in the end only mentioned GETRF because GESV itself received no changes.)

In your case, the Matlab slowdown is probably making it difficult to see the other improvements since the Matlab times are dramatically longer.

We are still examining this one, but it's been strangely elusive to nail down so far. Some versions of Matlab aren't showing the slowdown, but only on some of the machines we have tested. It's very frustrating!
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.2 is also slow...

Postby cjest » Sun Feb 28, 2010 1:36 pm

To get speedup from Gesv the size of A in Ax = b, must be at least 1500x1500. I've tested my mex function in a for loop, to test the data overhead cost, still get non-trivial performance usning matrices bigger than 1500.
cjest
 
Posts: 12
Joined: Wed Feb 10, 2010 3:01 pm

Re:sgesv in 1.2 is also slow...

Postby jpeinado » Mon Mar 01, 2010 7:24 am

john wrote:Yes, the upgraded GETRF will also result in a speedup in GESV (we actually debated whether to note both routines in the patch notes, but in the end only mentioned GETRF because GESV itself received no changes.)


OK

john wrote:In your case, the Matlab slowdown is probably making it difficult to see the other improvements since the Matlab times are dramatically longer.


No. I am (almost sure) that this is not the problem. I tested other packages to solve linear systems (like CULAPACK (totally based in CUBLAS) from UJI University - Spain) and it works OK with MATLAB.

In fact I did the following tests

CULAPACK (sgetrf) + CUBLAS (triangular systems) = OK
CULAPACK (sgetrf) + CULA (triangular systems) = BAD


CULA (sgesv) = BAD

CULA (sgetrf) + CUBLAS (triangular systems) = BAD


It seems that the CULA library has any problem with MATLAB. If you use CUBLAS routines, all works OK. I dont know how are CULA routines done, but there is a problem with CULA and MATLAB

By other hand, there are more algorithms called hybrid, but they are impossible to execute with MATLAB (it is a MATLAB problem). Anyway, CULA routines (sgetrf and sgetrs) are not hybrid.


john wrote: We are still examining this one, but it's been strangely elusive to nail down so far. Some versions of Matlab aren't showing the slowdown, but only on some of the machines we have tested. It's very frustrating!


Could you be more explicit in versions and machines...? Anyway thanks for you to test all this problem


cjest wrote:To get speedup from Gesv the size of A in Ax = b, must be at least 1500x1500. I've tested my mex function in a for loop, to test the data overhead cost, still get non-trivial performance usning matrices bigger than 1500.


Yes, I have the same results as you, but not with CULA. By the way, could you please publish your mex file?

Thanks
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby dan » Tue Mar 02, 2010 10:21 am

I'd like to add some data points to the mix.

Windows Vista 64
CULA 1.2
Matlab 2008a (7.6): Slow
Matlab 2009a (7.8): Slow
Matlab 2009b (7.9): Fast

Ubuntu 9.10 32-bit
CULA 1.2
Matlab 2009b (7.9): Fast

From these results above, you can see that for at least 2 systems we've seen no slowdown in Matlab 2009b (7.9). Our analysis has shown that the CUDA runtime's initialization time appears to be extreme in versions earlier than 2009b. There is a small initialization time in 2009b (about 0.4 seconds) but this is to be expected and matches results we've found outside of the Matlab environment.

Unfortunately, the information we've found indicates to us that we can't support any version earlier than 2009b as it appears that The Mathworks has resolved (in some instances, at least) the problems that led to slow execution. With this in mind, we're very interested in hearing from those of you who are using 2009b (7.9) and are still seeing the slowdown. If we can see consistency in the reports, we'll try to match the environment and see if we can see duplicate user reports. Obviously installing many different versions of Matlab on many systems is cumbersome so the more help we can get from our users the easier it will be for us to solve this problem.

When reporting your results, make 100% sure that you're using the version of Matlab that you're reporting results on here (symbolic links might point to older versions and you may not realize it). Also, please only report your results on CULA 1.2 as this will best allow us to debug this problem.

Thank you to everyone who has put work into this already, we appreciate your contribution very much.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

PreviousNext

Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 3 guests

cron