Page 1 of 1

compiling cula benchmark

PostPosted: Tue Nov 30, 2010 10:33 pm
by marconioz
Hi guys, a couple of questions, question one is the really crucial one...

I been searching the internet for the compiler message for a couple of days to no avail:

...compiler/lib/intel64/ undefined reference to `pthread_atfork'

although it seems half the world has this trouble their solutions wasn't good for me

I get this on both benchmark and bridge examples. I lost my temper and compiled a version of the benchmark with sequential mkl just for the hell of it.

2) So, I run yesterday the benchmark_ and got not much speedup (0.5 to 1.2) in most routines, apart from SVD. This one gave me a 140 speedup and then the computer run out of memory next size up. Is this speedup expected? Detail, I run this on an 8 cores machine that was heavily in use (6 out of 8 cores). So the benchmark_ ( which I assume is mkl threaded) most likely got lost...I figure..

3) now I am running the sequential mkl, and the benchmark is giving speedups in the range 0.5 to 1.9, and the SVD is about 8-9.5 times, what would be the expected speedup for this card?

Using gforce GTX 460 1GB 256bit
computer is 2x4core Xeon X5472 @ 3.00GHz, 4GB only.

4) does using openGL (such as compiz) affects performance of the card? (I know I can answer that easily, but as I am here already so I am asking if someone tried .... :) )

Re: compiling cula benchmark

PostPosted: Wed Dec 01, 2010 6:07 am
by kyle
1) Linking against MKL can certainly be a pain. That's why we've included the pre-compiled benchmark :) (And yes, it is the mutli-threaded MKL)

2 & 3) Well, to put some of your data into perspective you are comparing a very high end CPU system to an entry level CUDA system. I'd imagine the maximum speed-up you'd see there would only be about 2x before you use up all the extra memory on your GPU which is also driving the main display of your computer. This might jump to about 3x in the next CULA release which has a bunch of Fermi optimization.

I'm not really sure why the SVD is so high to be honest. Under the hood it's using the same core components as all the other tests. Also, CULA does offload some computation to the CPU. If the CPU is heavily loaded this might become a bottleneck.

4) We've seen up to 25% performance gains from running on machines with no video output. If your video card is doing graphics it's certainly robbing some computation power and memory from CUDA programs.

Hope these help!

Re: compiling cula benchmark

PostPosted: Wed Dec 01, 2010 3:11 pm
by marconioz
Hi Kyle,

thanks for your reply.

I repeated the tests and the pattern remain. I still can't compile benchmark threaded, so the comparison I can make is using benchmark sequential and benchmark_ binary. Not sure if there is something weird with the binaries tough, as I again see a huge speedup in SGESVD test:

-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 52.95 4954.23 93.5680
5120 82.72 9691.47 117.1533
6144 118.10 15860.12 134.2883

CULA Error: Insufficient memory to complete this operation

and the speed up using benchmark serial is about 8x to 9.5x (still a lot, compared with the 0.5-2.0 of the others). So without compiling locally a threaded version here I can't rely on this CULA benchmark for my future applications... :(

The point of buying an entry level card is to make sure things work first before you invest into a cuda cluster with thousands teslas 2070 :) perhaps in the year 2070...

When the other processes finish I will try this benchmark again, and see if things change...

Re: compiling cula benchmark

PostPosted: Fri Dec 10, 2010 10:33 am
by john
Hi - were you able to run that second benchmark?

One note on the entry level card purchase is that I would recommend purchasing one copy of the card you intend to build the cluster around and to test on that. The balance of CPU power to GPU power is pretty vital when benchmarking a single node.