compiling cula benchmark

Support for issues specific to the Linux operating systems.

compiling cula benchmark

Postby marconioz » Tue Nov 30, 2010 10:33 pm

Hi guys, a couple of questions, question one is the really crucial one...

1)
I been searching the internet for the compiler message for a couple of days to no avail:

...compiler/lib/intel64/libiomp5.so: undefined reference to `pthread_atfork'

although it seems half the world has this trouble their solutions wasn't good for me


I get this on both benchmark and bridge examples. I lost my temper and compiled a version of the benchmark with sequential mkl just for the hell of it.

2) So, I run yesterday the benchmark_ and got not much speedup (0.5 to 1.2) in most routines, apart from SVD. This one gave me a 140 speedup and then the computer run out of memory next size up. Is this speedup expected? Detail, I run this on an 8 cores machine that was heavily in use (6 out of 8 cores). So the benchmark_ ( which I assume is mkl threaded) most likely got lost...I figure..

3) now I am running the sequential mkl, and the benchmark is giving speedups in the range 0.5 to 1.9, and the SVD is about 8-9.5 times, what would be the expected speedup for this card?

Using gforce GTX 460 1GB 256bit
computer is 2x4core Xeon X5472 @ 3.00GHz, 4GB only.

4) does using openGL (such as compiz) affects performance of the card? (I know I can answer that easily, but as I am here already so I am asking if someone tried .... :) )
marconioz
 
Posts: 2
Joined: Thu Nov 25, 2010 5:55 pm

Re: compiling cula benchmark

Postby kyle » Wed Dec 01, 2010 6:07 am

1) Linking against MKL can certainly be a pain. That's why we've included the pre-compiled benchmark :) (And yes, it is the mutli-threaded MKL)

2 & 3) Well, to put some of your data into perspective you are comparing a very high end CPU system to an entry level CUDA system. I'd imagine the maximum speed-up you'd see there would only be about 2x before you use up all the extra memory on your GPU which is also driving the main display of your computer. This might jump to about 3x in the next CULA release which has a bunch of Fermi optimization.

I'm not really sure why the SVD is so high to be honest. Under the hood it's using the same core components as all the other tests. Also, CULA does offload some computation to the CPU. If the CPU is heavily loaded this might become a bottleneck.

4) We've seen up to 25% performance gains from running on machines with no video output. If your video card is doing graphics it's certainly robbing some computation power and memory from CUDA programs.

Hope these help!
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re: compiling cula benchmark

Postby marconioz » Wed Dec 01, 2010 3:11 pm

Hi Kyle,

thanks for your reply.

I repeated the tests and the pattern remain. I still can't compile benchmark threaded, so the comparison I can make is using benchmark sequential and benchmark_ binary. Not sure if there is something weird with the binaries tough, as I again see a huge speedup in SGESVD test:

-- SGESVD Benchmark --

Size CULA (s) MKL (s) Speedup
------ ---------- ---------- ---------
4096 52.95 4954.23 93.5680
5120 82.72 9691.47 117.1533
6144 118.10 15860.12 134.2883
7168

CULA Error: Insufficient memory to complete this operation

and the speed up using benchmark serial is about 8x to 9.5x (still a lot, compared with the 0.5-2.0 of the others). So without compiling locally a threaded version here I can't rely on this CULA benchmark for my future applications... :(

The point of buying an entry level card is to make sure things work first before you invest into a cuda cluster with thousands teslas 2070 :) perhaps in the year 2070...

When the other processes finish I will try this benchmark again, and see if things change...
Cheers,
marconioz
 
Posts: 2
Joined: Thu Nov 25, 2010 5:55 pm

Re: compiling cula benchmark

Postby john » Fri Dec 10, 2010 10:33 am

Hi - were you able to run that second benchmark?

One note on the entry level card purchase is that I would recommend purchasing one copy of the card you intend to build the cluster around and to test on that. The balance of CPU power to GPU power is pretty vital when benchmarking a single node.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm


Return to Linux Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron