Notes on Cula and Nvidia GTX titan

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

Notes on Cula and Nvidia GTX titan

Postby Boxed Cylon » Sun May 05, 2013 8:32 am

I have upgraded my computer to use an Intel Core i7-3820 cpu (quadcore, 3.6 GHz, socket 2011) and one of the new Nvidia titans; I have just a basic EVGA titan. The motherboard is EVGA X79 SLI and I have 32 GB of fast RAM (I think set to 1600 9-9-9-24; I can never get memory to the speed they advertise). Here are a few notes - I give the results of Cula's "benchmark_64" below. Any comments?

Is there anything I need to know about Cula and this monster? I was somewhat surprised the benchmark numbers weren't a little better, but I suppose the issue is volume vs. speed - perhaps these benchmarks are not using large enough problems to make the most of titan.

I use Suse 12.3 which is a problem in that the gcc version is too new for Cuda - hence I have to compile gcc 4.6.2 and use this older compiler. That was some work since there were bugs.

I am using the titan as a compute-only device - my graphics card for xorg is a GT 440.

It took a bit of tweeking motherboard BIOS settings to get the card to PCIe 3.0. I am not sure what worked in the end, but: Enable the gen 3.0 workarounds, set the PCIe devices to 8X, 16X, and 16X (not 4X/4X, 16X, 8X/8X), set them all to gen3, set MMIOH to 16 GB, and in general turning chipset options on.

The nvidia kernel module needs the option "NVreg_EnablePCIeGen3=1" apparently.

The device bandwidth was initially 2.5-3.0 GB/s, as measured by Cuda's "bandwidthTest", but tweeking the BIOS settings first got this up to 5.0-6.0 GB/s (an honest PCIe 2.0), and then eventually 11.2 GB/s (an honest PCIe 3.0; although this was supposed to top out at 8 GB/s...). There is a lot of confusion about whether this is possible or not, but this works for me - things seem to be stable, but I've not extensively tested for stability yet. Nvidia is nervous about stability, which is why PCIe 3.0 is not by default.

My own Cula Sgesv/matlab benchmark tops out at about 15X that of a single cpu core, which means that the titan is about twice as fast as the gtx 480 it replaced.

The nvidia-settings utility has an option for turning on double precision - I gather this puts the device into a double precision mode. The single precision Cula benchmarks were about 20% faster with this option unchecked. I imaging double precision benchmarks are quite a bit faster with the option checked.

I think these were the main points - this information was all scattered around the web, so I thought a summary in a single place would be useful.

B.C.

Code: Select all
./benchmark_64
Initializing CULA...
Initializing MKL...

Benchmarking the following functions:
-------------------------------------
             SGEQRF
             SGETRF
             SGELS
             SGGLSE
             SGESV
             SGESVD
-------------------------------------


     -- SGEQRF Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.14       0.56    4.0029
5120       0.23       1.03    4.5728
6144       0.33       1.76    5.4286
7168       0.49       2.73    5.5869
8192       0.63       4.22    6.7527

     -- SGETRF Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.08       0.28    3.5232
5120       0.12       0.52    4.2053
6144       0.18       0.88    4.8790
7168       0.25       1.42    5.5865
8192       0.35       2.07    5.9793

     -- SGELS Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.21       0.61    2.9153
5120       0.31       1.12    3.5725
6144       0.47       1.86    3.9493
7168       0.65       2.87    4.4439
8192       0.89       4.21    4.7571

     -- SGGLSE Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.24       1.56    6.4505
5120       0.39       2.60    6.6030
6144       0.55       4.01    7.2359
7168       0.77       5.77    7.5099
8192       0.99       8.22    8.2910

     -- SGESV Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096       0.11       0.29    2.6720
5120       0.15       0.53    3.4292
6144       0.23       0.89    3.8067
7168       0.32       1.43    4.4763
8192       0.43       2.09    4.8877

     -- SGESVD Benchmark  --

Size   CULA (s)    MKL (s)   Speedup
------ ---------- ---------- ---------
4096      14.70      34.65    2.3572
5120      23.58      55.44    2.3510
6144      34.99      95.61    2.7327
7168      49.19     145.55    2.9589
8192      65.83     208.89    3.1732
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re: Notes on Cula and Nvidia GTX titan

Postby coruun » Mon May 06, 2013 1:04 am

From the raw numbers of the GTX Titan and the the GTX 480 I would guess, that your numbers are only slightly off.

The SP performance of these cards are 4.5 TFLOPS and 1.345 TFLOPS, respectively. So you can only get a maximum speedup of ~3.

The nice thing with the GTX Titan is the unlocked DP performance. While the GTX 480 is limited to 0.168 TFLOPS, the GTX Titan can still reach 1.3 TFLOPS which is a speedup of ~8.

Maybe you should compare the results of the DP examples and try to increase the size of the problems.

[Numbers are taken from english and german wikipedia ;)]
coruun
 
Posts: 5
Joined: Wed Mar 27, 2013 8:17 am

Re: Notes on Cula and Nvidia GTX titan

Postby john » Tue May 07, 2013 7:08 am

Note how your speedup factors are all still continuing to grow as you go from 4k -> 8k. As NVIDIA keeps adding cores, we need larger and larger problems to saturate all those cores. You can change the sizes that are run on the command line:

benchmark 1024 16384 512 (etc)
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Notes on Cula and Nvidia GTX titan

Postby Boxed Cylon » Tue May 07, 2013 7:30 pm

Following your advice, I get the figure below comparing the titan to the i7-3820:

cula_bench.jpg
cula_bench.jpg (90.71 KiB) Viewed 6731 times


The jump up at the end for several of the benchmarks I guess is caused by a good match between the titan hardware and 16384 matrix size. The benchmark numbers start to flatten out for matrix sizes of about 10,000 - one gets the most out of one's titan with large arrays!
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm


Return to General CULA Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron