## Matlab link and Quadro 4000

5 posts
• Page

**1**of**1**### Matlab link and Quadro 4000

I just installed R16a on a OpenSuse 12.2 Linux x64 machine with a nVidia Quadro 4000 GPU. The CPU is an Intel Xeon Processor E5-1620 (Quad Core, 3.60GHz Turbo, 10MB Cache). I installed the nVidia driver provided with OpenSuse. Issuing nvidia-smi at the command line gets me:

I set up the environmental variables as specified in this forum:

Using Matlab 2012b, I generated a random matrix A

and, then, this is what I got using the Quadro 4000 GPU:

with the ~/debug.log file containing:

However, when I use the blas and lapack libraries provided with Matlab (and thus the Xeon processor) I get:

Therefor, the Quadro 4000 seems half as fast as the processor, and this is a big disappointment.

The question I have is: Did I make some mistake in the configuration which could have limited the performance of the GPU or, more simply, the Quadro 4000 is slower than the processor (and of what I expected)?

On which kind of problems, if any, the Quadro would shine better than the multicore Xeon?

Thanks!

- Code: Select all
`Thu Jan 24 16:33:02 2013`

+------------------------------------------------------+

| NVIDIA-SMI 4.304.64 Driver Version: 304.64 |

|-------------------------------+----------------------+----------------------+

| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Quadro 4000 | 0000:03:00.0 On | N/A |

| 40% 53C P12 N/A / N/A | 1% 29MB / 2047MB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Compute processes: GPU Memory |

| GPU PID Process name Usage |

|=============================================================================|

| No running compute processes found |

+-----------------------------------------------------------------------------+

I set up the environmental variables as specified in this forum:

- Code: Select all
`export CULA_ROOT="/usr/local/cula"`

export CULA_INC_PATH="$CULA_ROOT/include"

export CULA_LIB_PATH_32="$CULA_ROOT/lib"

export CULA_LIB_PATH_64="$CULA_ROOT/lib64"

if [ -z "${LD_LIBRARY_PATH}" ]

then

LD_LIBRARY_PATH="${CULA_LIB_PATH_64}"; export LD_LIBRARY_PATH

else

LD_LIBRARY_PATH="${CULA_LIB_PATH_64}:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH

fi

export CULA_ILP64=1

export LAPACK_VERBOSITY=1

export CULA_DEBUG_LOG=~/debug.log

export LAPACK_VERSION=$CULA_ROOT/lib64/libcula_lapack_link.so

export BLAS_VERSION=$CULA_ROOT/lib64/libcula_lapack_link.so

Using Matlab 2012b, I generated a random matrix A

- Code: Select all
`A=rand(1000,6000);`

save A.mat A

and, then, this is what I got using the Quadro 4000 GPU:

- Code: Select all
`>> load A.mat`

>> tic; [U,S,V]=svd(A); toc

cpu_id: x86 Family 6 Model 45 Stepping 7, GenuineIntel

libmwlapack: trying environment...

libmwlapack: loading /usr/local/cula/lib64/libcula_lapack_link.so

libmwlapack: loaded /usr/local/cula/lib64/libcula_lapack_link.so@0x7f56dc508d70

libmwlapack: /usr/local/cula/lib64/libcula_lapack_link.so is not a compatibility layer.

Elapsed time is 5.503383 seconds.

with the ~/debug.log file containing:

- Code: Select all
`cula info: dgesdd (A, 1000, 6000, 0x7f9feca27020, 1000, 0x7fa001239020, 0x7fa000a97020, 1000, 0x7f9fef7ee020, 6000)`

cula info: issuing to CPU (work query)

cula info: CPU library is lapackcpu.so

cula info: work query returned 4007000

cula info: work query returned 0

cula info: done

cula info: dgesdd (A, 1000, 6000, 0x7f9feca27020, 1000, 0x7fa001239020, 0x7fa000a97020, 1000, 0x7f9fef7ee020, 6000)

cula info: issuing to GPU (over threshold)

cula info: done

However, when I use the blas and lapack libraries provided with Matlab (and thus the Xeon processor) I get:

- Code: Select all
`>> load A.mat`

>> tic; [U,S,V]=svd(A); toc

Elapsed time is 2.709539 seconds.

Therefor, the Quadro 4000 seems half as fast as the processor, and this is a big disappointment.

The question I have is: Did I make some mistake in the configuration which could have limited the performance of the GPU or, more simply, the Quadro 4000 is slower than the processor (and of what I expected)?

On which kind of problems, if any, the Quadro would shine better than the multicore Xeon?

Thanks!

- robertosassi
**Posts:**3**Joined:**Thu Jan 24, 2013 2:16 am

### Re: Matlab link and Quadro 4000

The Quadro 4000 is only half of the largest Fermi generation GPU (256 cores vs 512). As this is both the last generation of GPUs and one of the smaller GPUs of that generation, I probably wouldn't expect much from it.

You could try single precision, and slightly larger problems. You might find minor gains from this card in those areas.

You could try single precision, and slightly larger problems. You might find minor gains from this card in those areas.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re: Matlab link and Quadro 4000

To soothe the disappointment for the performances of the Quadro 4000, I followed john suggestion, and compared the GPU and the CPU on a larger range of matrices' dimensions.

I adapted the code for GPU bench-marching proposed on the MathWorks web page:

http://www.mathworks.it/it/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html

This picture shows it all:

The GPU is sligthly faster when the matrix is larger than about 9216 x 9216 points (single precision) or 5120 x 5120 (double precision). But the GFlops gain is marginal. At top, about 14%.1 (or 27.1 GFlops) at single precision, and about 20.4% (or 18.9 GFlops) at double precision.

I understand that these numbers are only rough approximations, but they nevertheless confirm that:

In your opinion, how much faster would a GTX 680 would have been?

I adapted the code for GPU bench-marching proposed on the MathWorks web page:

http://www.mathworks.it/it/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html

- Code: Select all
`%% Benchmarking A\b on Matlab`

% This code was adapted from what proposed on the Mathworks web page:

% www.mathworks.it/it/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html

function results = backslashBenchMatlab(maxMemory)

% maxMemory is the number of Gbyte of memory of the card

if nargin == 0

FreeMemory = 2048*2^20; % This is a 2Gbyte GPU card

maxMemory = FreeMemory/1024^3;

end

% reducing it just a bit to be sure the matrix will fit

maxMemory=0.99*maxMemory;

function [A, b] = getData(n, clz)

fprintf('Creating a matrix of size %d-by-%d.\n', n, n);

A = rand(n, n, clz) + 100*eye(n, n, clz);

b = rand(n, 1, clz);

end

function time = timeSolve(A, b)

tic;

x = A\b; %#ok<NASGU> We don't need the value of x.

time = toc;

end

% Declare the matrix sizes to be a multiple of 1024.

maxSizeSingle = floor(sqrt(maxMemory*1024^3/4));

maxSizeDouble = floor(sqrt(maxMemory*1024^3/8));

step = 1024;

if maxSizeDouble/step >= 10

step = step*floor(maxSizeDouble/(10*step));

end

sizeSingle = 1024:step:maxSizeSingle;

sizeDouble = 1024:step:maxSizeDouble;

function gflops = benchFcn(A, b)

numReps = 9;

time = inf;

% We solve the linear system a few times and calculate the Gigaflops

% based on the best time.

for itr = 1:numReps

tcurr = timeSolve(A, b);

time = min(tcurr, time);

end

n = size(A, 1);

flop = 2/3*n^3 + 3/2*n^2;

gflops = flop/time/1e9;

end

function gflops = executeBenchmarks(clz, sizes)

fprintf(['Starting benchmarks with %d different %s-precision ' ...

'matrices of sizes\nranging from %d-by-%d to %d-by-%d.\n'], ...

length(sizes), clz, sizes(1), sizes(1), sizes(end), ...

sizes(end));

gflops = zeros(size(sizes));

for i = 1:length(sizes)

n = sizes(i);

[A, b] = getData(n, clz);

gflops(i) = benchFcn(A, b);

fprintf('Gigaflops: %f\n', gflops(i));

end

end

gigaflops = executeBenchmarks('single', sizeSingle);

results.sizeSingle = sizeSingle;

results.gflopsSingle = gigaflops;

gigaflops = executeBenchmarks('double', sizeDouble);

results.sizeDouble = sizeDouble;

results.gflopsDouble = gigaflops;

end

This picture shows it all:

The GPU is sligthly faster when the matrix is larger than about 9216 x 9216 points (single precision) or 5120 x 5120 (double precision). But the GFlops gain is marginal. At top, about 14%.1 (or 27.1 GFlops) at single precision, and about 20.4% (or 18.9 GFlops) at double precision.

I understand that these numbers are only rough approximations, but they nevertheless confirm that:

- the Quadro 4000 performances are poor and will rarely offer a (computationally efficient) alternative to the Xeon E5-1620 on this machine (I should have saved my money ...).
- the double precision performances are truly about half of the single precision ones (219.4 vs 111.1 GFlops)

In your opinion, how much faster would a GTX 680 would have been?

- robertosassi
**Posts:**3**Joined:**Thu Jan 24, 2013 2:16 am

### Re: Matlab link and Quadro 4000

My 690 is just under 5x my 2600K CPU for the system solve. The 680 is slightly faster than that.

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re: Matlab link and Quadro 4000

Thanks, for the feedback. In single or double precision? If you have Matlab at disposal, would you be so kind to test the code above and post the results here for your GTX 690? It would be very instructive indeed for me (and help me decide if spending money on a new card ...) and it will definitively prove that I did not make a mistake in benchmarking the Quadro 4000.

However, thanks again!

However, thanks again!

- robertosassi
**Posts:**3**Joined:**Thu Jan 24, 2013 2:16 am

5 posts
• Page

**1**of**1**### Who is online

Users browsing this forum: No registered users and 1 guest