Matlab link and Quadro 4000

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Matlab link and Quadro 4000

Postby robertosassi » Thu Jan 24, 2013 8:36 am

I just installed R16a on a OpenSuse 12.2 Linux x64 machine with a nVidia Quadro 4000 GPU. The CPU is an Intel Xeon Processor E5-1620 (Quad Core, 3.60GHz Turbo, 10MB Cache). I installed the nVidia driver provided with OpenSuse. Issuing nvidia-smi at the command line gets me:
Code: Select all
Thu Jan 24 16:33:02 2013
+------------------------------------------------------+
| NVIDIA-SMI 4.304.64   Driver Version: 304.64         |
|-------------------------------+----------------------+----------------------+
| GPU  Name                     | Bus-Id        Disp.  | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap| Memory-Usage         | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro 4000              | 0000:03:00.0      On |                  N/A |
| 40%   53C   P12    N/A /  N/A |   1%   29MB / 2047MB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+


I set up the environmental variables as specified in this forum:
Code: Select all
export CULA_ROOT="/usr/local/cula"
export CULA_INC_PATH="$CULA_ROOT/include"
export CULA_LIB_PATH_32="$CULA_ROOT/lib"
export CULA_LIB_PATH_64="$CULA_ROOT/lib64"
if [ -z "${LD_LIBRARY_PATH}" ]
then
    LD_LIBRARY_PATH="${CULA_LIB_PATH_64}"; export LD_LIBRARY_PATH
else
    LD_LIBRARY_PATH="${CULA_LIB_PATH_64}:${LD_LIBRARY_PATH}"; export LD_LIBRARY_PATH
fi
export CULA_ILP64=1
export LAPACK_VERBOSITY=1
export CULA_DEBUG_LOG=~/debug.log
export LAPACK_VERSION=$CULA_ROOT/lib64/libcula_lapack_link.so
export BLAS_VERSION=$CULA_ROOT/lib64/libcula_lapack_link.so


Using Matlab 2012b, I generated a random matrix A
Code: Select all
A=rand(1000,6000);
save A.mat A

and, then, this is what I got using the Quadro 4000 GPU:
Code: Select all
>> load A.mat
>> tic; [U,S,V]=svd(A); toc
cpu_id: x86 Family 6 Model 45 Stepping 7, GenuineIntel
libmwlapack: trying environment...
libmwlapack: loading /usr/local/cula/lib64/libcula_lapack_link.so
libmwlapack: loaded /usr/local/cula/lib64/libcula_lapack_link.so@0x7f56dc508d70
libmwlapack: /usr/local/cula/lib64/libcula_lapack_link.so is not a compatibility layer.
Elapsed time is 5.503383 seconds.

with the ~/debug.log file containing:
Code: Select all
cula info:  dgesdd (A, 1000, 6000, 0x7f9feca27020, 1000, 0x7fa001239020, 0x7fa000a97020, 1000, 0x7f9fef7ee020, 6000)
cula info:  issuing to CPU (work query)
cula info:  CPU library is lapackcpu.so
cula info:  work query returned 4007000
cula info:  work query returned 0
cula info:  done
cula info:  dgesdd (A, 1000, 6000, 0x7f9feca27020, 1000, 0x7fa001239020, 0x7fa000a97020, 1000, 0x7f9fef7ee020, 6000)
cula info:  issuing to GPU (over threshold)
cula info:  done

However, when I use the blas and lapack libraries provided with Matlab (and thus the Xeon processor) I get:
Code: Select all
>> load A.mat
>> tic; [U,S,V]=svd(A); toc
Elapsed time is 2.709539 seconds.

Therefor, the Quadro 4000 seems half as fast as the processor, and this is a big disappointment.

The question I have is: Did I make some mistake in the configuration which could have limited the performance of the GPU or, more simply, the Quadro 4000 is slower than the processor (and of what I expected)?
On which kind of problems, if any, the Quadro would shine better than the multicore Xeon?

Thanks!
robertosassi
 
Posts: 3
Joined: Thu Jan 24, 2013 2:16 am

Re: Matlab link and Quadro 4000

Postby john » Thu Jan 24, 2013 9:25 am

The Quadro 4000 is only half of the largest Fermi generation GPU (256 cores vs 512). As this is both the last generation of GPUs and one of the smaller GPUs of that generation, I probably wouldn't expect much from it.

You could try single precision, and slightly larger problems. You might find minor gains from this card in those areas.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Matlab link and Quadro 4000

Postby robertosassi » Fri Jan 25, 2013 7:23 am

To soothe the disappointment for the performances of the Quadro 4000, I followed john suggestion, and compared the GPU and the CPU on a larger range of matrices' dimensions.

I adapted the code for GPU bench-marching proposed on the MathWorks web page:
http://www.mathworks.it/it/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html
Code: Select all
%% Benchmarking A\b on Matlab
% This code was adapted from what proposed on the Mathworks web page:
% www.mathworks.it/it/help/distcomp/examples/benchmarking-a-b-on-the-gpu.html

function results = backslashBenchMatlab(maxMemory)
% maxMemory is the number of Gbyte of memory of the card

if nargin == 0
    FreeMemory = 2048*2^20; % This is a 2Gbyte GPU card
    maxMemory = FreeMemory/1024^3;
end

% reducing it just a bit to be sure the matrix will fit
maxMemory=0.99*maxMemory;

function [A, b] = getData(n, clz)
    fprintf('Creating a matrix of size %d-by-%d.\n', n, n);
    A = rand(n, n, clz) + 100*eye(n, n, clz);
    b = rand(n, 1, clz);
end

function time = timeSolve(A, b)
    tic;
    x = A\b; %#ok<NASGU> We don't need the value of x.
    time = toc;
end

% Declare the matrix sizes to be a multiple of 1024.
maxSizeSingle = floor(sqrt(maxMemory*1024^3/4));
maxSizeDouble = floor(sqrt(maxMemory*1024^3/8));
step = 1024;
if maxSizeDouble/step >= 10
    step = step*floor(maxSizeDouble/(10*step));
end
sizeSingle = 1024:step:maxSizeSingle;
sizeDouble = 1024:step:maxSizeDouble;

function gflops = benchFcn(A, b)
    numReps = 9;
    time = inf;
    % We solve the linear system a few times and calculate the Gigaflops
    % based on the best time.
    for itr = 1:numReps
        tcurr = timeSolve(A, b);
        time = min(tcurr, time);
    end
    n = size(A, 1);
    flop = 2/3*n^3 + 3/2*n^2;
    gflops = flop/time/1e9;
end

function gflops = executeBenchmarks(clz, sizes)
    fprintf(['Starting benchmarks with %d different %s-precision ' ...
         'matrices of sizes\nranging from %d-by-%d to %d-by-%d.\n'], ...
            length(sizes), clz, sizes(1), sizes(1), sizes(end), ...
            sizes(end));
    gflops = zeros(size(sizes));
    for i = 1:length(sizes)
        n = sizes(i);
        [A, b] = getData(n, clz);
        gflops(i) = benchFcn(A, b);
        fprintf('Gigaflops: %f\n', gflops(i));
    end
end

gigaflops = executeBenchmarks('single', sizeSingle);
results.sizeSingle = sizeSingle;
results.gflopsSingle = gigaflops;
gigaflops = executeBenchmarks('double', sizeDouble);
results.sizeDouble = sizeDouble;
results.gflopsDouble = gigaflops;

end


This picture shows it all:
Image

The GPU is sligthly faster when the matrix is larger than about 9216 x 9216 points (single precision) or 5120 x 5120 (double precision). But the GFlops gain is marginal. At top, about 14%.1 (or 27.1 GFlops) at single precision, and about 20.4% (or 18.9 GFlops) at double precision.

I understand that these numbers are only rough approximations, but they nevertheless confirm that:
  • the Quadro 4000 performances are poor and will rarely offer a (computationally efficient) alternative to the Xeon E5-1620 on this machine (I should have saved my money ...).
  • the double precision performances are truly about half of the single precision ones (219.4 vs 111.1 GFlops)

In your opinion, how much faster would a GTX 680 would have been?
robertosassi
 
Posts: 3
Joined: Thu Jan 24, 2013 2:16 am

Re: Matlab link and Quadro 4000

Postby john » Fri Jan 25, 2013 8:52 am

My 690 is just under 5x my 2600K CPU for the system solve. The 680 is slightly faster than that.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Matlab link and Quadro 4000

Postby robertosassi » Fri Jan 25, 2013 9:03 am

Thanks, for the feedback. In single or double precision? If you have Matlab at disposal, would you be so kind to test the code above and post the results here for your GTX 690? It would be very instructive indeed for me (and help me decide if spending money on a new card ...) and it will definitively prove that I did not make a mistake in benchmarking the Quadro 4000.

However, thanks again!
robertosassi
 
Posts: 3
Joined: Thu Jan 24, 2013 2:16 am


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 2 guests

cron