sgesv in 1.1 is slow...

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

sgesv in 1.1 is slow...

Postby Boxed Cylon » Tue Dec 01, 2009 2:45 pm

In the new 1.1 (non beta) I am finding that sgesv is painfully slow - slower than the CPU. Any reason why that might be? I have tried the linux and RHEL versions, both 64-bit. And I am using culaDeviceSgesv. Alas, I have deleted my 1.1 beta... I'm pretty sure that with 1.1 beta I was getting speed ups of 6-7X.
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re: sgesv in 1.1 is slow...

Postby dan » Tue Dec 01, 2009 3:33 pm

Hi Boxed Cylon,

To help debug the problem, can you provide us with some information?
- What problem sizes are you running?
- What software package are you testing against?
- What are your performance numbers?

Also, please run the sysinfo.sh script (in examples) and post the output here.
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re: sgesv in 1.1 is slow...

Postby Boxed Cylon » Tue Dec 01, 2009 5:35 pm

I am attaching, I hope the testing routine I am using. This routine is a matlab mex file. The test script I use is a loop of ever increasing array dimensions up to 5000X5000:

Code: Select all
clear all

N=[10 50 100 200 500 750 1000 1250 1500 2000 2500 3000 4000 5000];
ETR=[];

for K=N,
  A=randn(K,K,'double');
  B=randn(K,5000,'double');

  A=single(A);
  B=single(B);

  disp('CPU: 1st call')
  tic
  Lp=A\B;
  toc

  disp(' ')
  disp('CPU: 2nd call')
  tic
  Lp=A\B;
  T1=toc

  Lp(1,1:3)
  Lp(end,(end-2):end)

if 1==2,
  [Q,R]=qr(A); X = R\(Q'*B);
  X13=X(1,1:3)
  Xend=X(end,(end-2):end)
end

if 1==1,
  disp('GPU: 1st call')
  tic
  [X]= gpu_sgesv(A,B);
  toc

  disp(' ')
  disp('GPU: 2nd call')
  tic
  [X]= gpu_sgesv(A,B);
  T2=toc
end

  X(1,1:3)
  X(end,(end-2):end)

  ETR=[ETR T1/T2]

end

plot(N,ETR)


With the new 1.1, I find T1/T2 decreases to 0.6-0.7, that is the GPU calculation is taking longer than the CPU. As I recall with 1.1 beta this ratio increased to 6-7, that is the GPU calculation was 6-7 times faster with the larger array sizes.

It is very strange - perhaps I am doing something wrong, but that would have to be in the compile stage. I am now using CUDA 3.0 beta, since the new 1.1 seems to require it with linux. [file name=gpu_sgesv-20091201.txt size=3457]http://www.culatools.com/images/fbfiles/files/gpu_sgesv-20091201.txt[/file]
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re: sgesv in 1.1 is slow...

Postby kyle » Tue Dec 01, 2009 5:49 pm

Quick question, is GEQRF (or any other non-LU based function) suffering this same performance drop?

Also, this statement is incorrect:
It is very strange - perhaps I am doing something wrong, but that would have to be in the compile stage. I am now using CUDA 3.0 beta, since the new 1.1 seems to require it with linux.

See this post for some more details there.
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Tue Dec 01, 2009 8:35 pm

Most curious - I ran the benchmark example:

Code: Select all
make build64
sh ../checkenvironment.sh
gcc -m64 -o benchmark benchmark.c -DNDEBUG -O3 -I/usr/local/cula/include -I/opt/intel/mkl/10.1.1.019/include/ -L/usr/local/cula/lib64 -L/opt/intel/mkl/10.1.1.019/lib/em64t/ -lmkl_lapack -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lcula -lcublas -lcudart -lpthread
ls
benchmark  benchmark.c  Makefile
./benchmark
Initializing CULA...

     -- SGEQRF Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096       0.69      1.90    2.7654
5120       1.19      2.78    2.3336
6144       1.96      4.85    2.4775
7168       3.01      7.43    2.4698
8192       3.34     15.80    4.7306

     -- SGETRF Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096       0.41      1.33    3.2752
5120       0.69      1.63    2.3678
6144       1.09      2.82    2.5829
7168       1.58      4.08    2.5810
8192       2.23     11.48    5.1368

     -- SGELS Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096       0.90      1.65    1.8366
5120       1.56      2.98    1.9179
6144       2.47      5.01    2.0279
7168       3.69      7.75    2.0992
8192       4.09     11.36    2.7754

     -- SGGLSE Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096       1.01      6.31    6.2335
5120       1.71      8.39    4.9028
6144       2.69     12.85    4.7818
7168       3.96     18.25    4.6065
8192       4.88     42.71    8.7554

     -- SGESVD Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096      37.87    117.42    3.1005
5120      66.72    167.24    2.5066



At this point the test ran ad infinitum...completing no other cases...

I think I will try a reboot, just in case...
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby dan » Tue Dec 01, 2009 9:26 pm

Hi Boxed Cylon,

Do you recall if you ran the benchmark example previously and saw similar speedups or has the performance been reduced for these as well? Depending on your system, the speedups you show here can be very typical.

Also, can please run the sysinfo.sh script (in $CULA_ROOT/examples) and post the output here? It would help a tremendous amount when debugging to know info like your OS, GPU, GPU driver, etc., all of which the sysinfo script collects easily for us.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Wed Dec 02, 2009 5:27 am

The output of sysinfo.sh is below. I don't think I ran the benchmark routine before, but my test routine showed the gpu quite a bit faster (I'm not sure if it was 6-7X now; perhaps only 3X).

Code: Select all
sh sysinfo.sh

System Information Utility
Copyright EM Photonics, Inc.


--------------------------------------------------------------------------------
                               Operating System
--------------------------------------------------------------------------------

    Linux skipjack 2.6.27.39-0.2-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 x86_64 x86_64 GNU/Linux

    Welcome to openSUSE 11.1 - Kernel \r (\l).



--------------------------------------------------------------------------------
                                   CPU Info
--------------------------------------------------------------------------------

    processor   : 0
    vendor_id   : AuthenticAMD
    cpu family  : 16
    model               : 4
    model name  : AMD Phenom(tm) II X4 940 Processor
    stepping    : 2
    cpu MHz             : 800.000
    cache size  : 512 KB
    physical id : 0
    siblings    : 4
    core id             : 0
    cpu cores   : 4
    apicid              : 0
    initial apicid      : 0
    fpu         : yes
    fpu_exception       : yes
    cpuid level : 5
    wp          : yes
    flags               : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
    bogomips    : 6027.42
    TLB size    : 1024 4K pages
    clflush size        : 64
    cache_alignment     : 64
    address sizes       : 48 bits physical, 48 bits virtual
    power management: ts ttp tm stc 100mhzsteps hwpstate

    processor   : 1
    vendor_id   : AuthenticAMD
    cpu family  : 16
    model               : 4
    model name  : AMD Phenom(tm) II X4 940 Processor
    stepping    : 2
    cpu MHz             : 800.000
    cache size  : 512 KB
    physical id : 0
    siblings    : 4
    core id             : 1
    cpu cores   : 4
    apicid              : 1
    initial apicid      : 1
    fpu         : yes
    fpu_exception       : yes
    cpuid level : 5
    wp          : yes
    flags               : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
    bogomips    : 6027.53
    TLB size    : 1024 4K pages
    clflush size        : 64
    cache_alignment     : 64
    address sizes       : 48 bits physical, 48 bits virtual
    power management: ts ttp tm stc 100mhzsteps hwpstate

    processor   : 2
    vendor_id   : AuthenticAMD
    cpu family  : 16
    model               : 4
    model name  : AMD Phenom(tm) II X4 940 Processor
    stepping    : 2
    cpu MHz             : 800.000
    cache size  : 512 KB
    physical id : 0
    siblings    : 4
    core id             : 2
    cpu cores   : 4
    apicid              : 2
    initial apicid      : 2
    fpu         : yes
    fpu_exception       : yes
    cpuid level : 5
    wp          : yes
    flags               : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
    bogomips    : 6027.57
    TLB size    : 1024 4K pages
    clflush size        : 64
    cache_alignment     : 64
    address sizes       : 48 bits physical, 48 bits virtual
    power management: ts ttp tm stc 100mhzsteps hwpstate

    processor   : 3
    vendor_id   : AuthenticAMD
    cpu family  : 16
    model               : 4
    model name  : AMD Phenom(tm) II X4 940 Processor
    stepping    : 2
    cpu MHz             : 800.000
    cache size  : 512 KB
    physical id : 0
    siblings    : 4
    core id             : 3
    cpu cores   : 4
    apicid              : 3
    initial apicid      : 3
    fpu         : yes
    fpu_exception       : yes
    cpuid level : 5
    wp          : yes
    flags               : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
    bogomips    : 6027.62
    TLB size    : 1024 4K pages
    clflush size        : 64
    cache_alignment     : 64
    address sizes       : 48 bits physical, 48 bits virtual
    power management: ts ttp tm stc 100mhzsteps hwpstate


--------------------------------------------------------------------------------
                                  Memory Info
--------------------------------------------------------------------------------

    MemTotal:      8180468 kB
    MemFree:       7048136 kB
    Buffers:        229416 kB
    Cached:         556644 kB
    SwapCached:          0 kB
    Active:         448448 kB
    Inactive:       514380 kB
    SwapTotal:     2104472 kB
    SwapFree:      2104472 kB
    Dirty:             124 kB
    Writeback:           0 kB
    AnonPages:      176756 kB
    Mapped:          93416 kB
    Slab:           107084 kB
    SReclaimable:    89144 kB
    SUnreclaim:      17940 kB
    PageTables:      10076 kB
    NFS_Unstable:        0 kB
    Bounce:              0 kB
    WritebackTmp:        0 kB
    CommitLimit:   6194704 kB
    Committed_AS:   456416 kB
    VmallocTotal: 34359738367 kB
    VmallocUsed:    326452 kB
    VmallocChunk: 34359410171 kB
    HugePages_Total:     0
    HugePages_Free:      0
    HugePages_Rsvd:      0
    HugePages_Surp:      0
    Hugepagesize:     2048 kB
    DirectMap4k:     45952 kB
    DirectMap2M:   2050048 kB
    DirectMap1G:   6291456 kB

--------------------------------------------------------------------------------
                                NVIDIA GPU Info
--------------------------------------------------------------------------------


    NVRM version: NVIDIA UNIX x86_64 Kernel Module  195.17  Mon Oct 26 06:19:11 PST 2009
    GCC version:  gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux)

--------------------------------------------------------------------------------
                                  Environment
--------------------------------------------------------------------------------

    LESSKEY=/etc/lesskey.bin
    NNTPSERVER=news
    MANPATH=/usr/lib64/mpi/gcc/openmpi/share/man:/usr/local/man:/usr/local/share/man:/usr/share/man:/opt/sun/sunstudio12.1/man:/usr/local/GMT/man:/opt/sun/sunstudio12.1/man:/usr/local/GMT/man
    INFODIR=/usr/local/info:/usr/share/info:/usr/info
    SSH_AGENT_PID=4279
    DM_CONTROL=/var/run/xdmctl
    HOSTNAME=skipjack
    XKEYSYMDB=/usr/share/X11/XKeysymDB
    CULA_BIN_PATH_32=/usr/local/cula/bin
    GPG_AGENT_INFO=/tmp/gpg-e0zxSN/S.gpg-agent:4263:1
    TERM=xterm
    SHELL=/bin/bash
    HOST=skipjack
    XDG_SESSION_COOKIE=b700cd53d78bb51b813c3a004769d463-1259725067.625389-1478094789
    XDM_MANAGED=/var/run/xdmctl/xdmctl-:0,maysd,mayfn,sched,rsvd,method=classic,auto
    HISTSIZE=1000
    PROFILEREAD=true
    GTK2_RC_FILES=/etc/gtk-2.0/gtkrc:/usr/share/themes//QtCurve/gtk-2.0/gtkrc:/home/xxxxxx/.gtkrc-2.0-qtengine:/home/xxxxxx/.gtkrc-2.0
    TMPDIR=/tmp
    GS_LIB=/home/xxxxxx/.fonts
    MORE=-sl
    WINDOWID=29360136
    XSESSION_IS_UP=yes
    KDE_FULL_SESSION=true
    USER=xxxxxx
    JRE_HOME=/usr/lib64/jvm/jre
    GMTHOME=/usr/local/GMT
    DESKTOP_LAUNCH=kde-open
    LD_LIBRARY_PATH=/usr/local/cula/lib64:/usr/local/lib:/opt/intel/Compiler/11.0/074/lib/intel64/:/opt/intel/mkl/10.1.1.019/lib/em64t:/usr/local/cuda/lib64:/opt/sun/sunstudio12.1/lib/amd64/:/opt/sun/sunstudio12.1/rtlibs/amd64/:/opt/intel/mkl/10.1.1.019/tools/builder/
    LS_COLORS=no=00:fi=00:di=01;34:ln=00;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=41;33;01:ex=00;32:*.cmd=00;32:*.exe=01;32:*.com=01;32:*.bat=01;32:*.btm=01;32:*.dll=01;32:*.tar=00;31:*.tbz=00;31:*.tgz=00;31:*.rpm=00;31:*.deb=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lzma=00;31:*.zip=00;31:*.zoo=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.tb2=00;31:*.tz2=00;31:*.tbz2=00;31:*.xz=00;31:*.avi=01;35:*.bmp=01;35:*.fli=01;35:*.gif=01;35:*.jpg=01;35:*.jpeg=01;35:*.mng=01;35:*.mov=01;35:*.mpg=01;35:*.pcx=01;35:*.pbm=01;35:*.pgm=01;35:*.png=01;35:*.ppm=01;35:*.tga=01;35:*.tif=01;35:*.xbm=01;35:*.xpm=01;35:*.dl=01;35:*.gl=01;35:*.wmv=01;35:*.aiff=00;32:*.au=00;32:*.mid=00;32:*.mp3=00;32:*.ogg=00;32:*.voc=00;32:*.wav=00;32:
    XNLSPATH=/usr/share/X11/nls
    ENV=/etc/bash.bashrc
    MKL_INC_PATH=/opt/intel/mkl/10.1.1.019/include/
    HOSTTYPE=x86_64
    SSH_AUTH_SOCK=/tmp/ssh-XtZHS3723/agent.3723
    FROM_HEADER=
    SESSION_MANAGER=local/skipjack:@/tmp/.ICE-unix/4674,unix/skipjack:/tmp/.ICE-unix/4674
    PAGER=less
    CSHEDIT=emacs
    XDG_CONFIG_DIRS=/etc/xdg
    MINICOM=-c on
    KONSOLE_DCOP=DCOPRef(konsole-4735,konsole)
    DESKTOP_SESSION=default
    PATH=/usr/local/mpich2/bin/:.:/opt/kde3/bin:/usr/local/mpich2/bin/:.:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib64/jvm/jre/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/intel/Compiler/11.0/074/bin/intel64/:/usr/local/GMT/bin:/opt/sun/sunstudio12.1/bin:/usr/local/cuda/bin:/opt/intel/Compiler/11.0/074/bin/intel64/:/usr/local/GMT/bin:/opt/sun/sunstudio12.1/bin:/usr/local/cuda/bin
    MAIL=/var/spool/mail/xxxxxx
    CPU=x86_64
    QT_IM_MODULE=xim
    KDM_AUTOLOGIN=xxxxxxx
    JAVA_BINDIR=/usr/lib64/jvm/jre/bin
    BC_ENV_ARGS=/home/xxxxxxx/.bcrc
    CULA_LIB_PATH_32=/usr/local/cula/lib
    PWD=/usr/local/cula/examples
    INPUTRC=/home/xxxxxx/.inputrc
    KONSOLE_DCOP_SESSION=DCOPRef(konsole-4735,session-1)
    XMODIFIERS=@im=local
    JAVA_HOME=/usr/lib64/jvm/jre
    KDE_SESSION_UID=1000
    LANG=en_US.UTF-8
    PYTHONSTARTUP=/etc/pythonstart
    SSH_ASKPASS=/usr/lib64/ssh/x11-ssh-askpass
    CULA_BIN_PATH_64=/usr/local/cula/bin64
    SHLVL=3
    HOME=/home/xxxxxxx
    QT_SYSTEM_DIR=/usr/share/desktop-data
    OSTYPE=linux
    LESS_ADVANCED_PREPROCESSOR=no
    XCURSOR_THEME=DMZ
    LS_OPTIONS=-N --color=tty -T 0
    WINDOWMANAGER=/usr/bin/kde
    LOGNAME=xxxxxxx
    MACHTYPE=x86_64-suse-linux
    LESS=-M -I
    G_FILENAME_ENCODING=@locale,UTF-8,ISO-8859-15,CP1252
    CVS_RSH=ssh
    CULA_ROOT=/usr/local/cula
    XDG_DATA_DIRS=/usr/local/share:/usr/share:/etc/opt/kde3/share:/opt/kde3/share
    LESSOPEN=lessopen.sh %s
    MKL_LIB_PATH_64=/opt/intel/mkl/10.1.1.019/lib/em64t/
    USE_FAM=
    INFOPATH=/usr/local/info:/usr/share/info:/usr/info
    DISPLAY=:0
    CULA_INC_PATH=/usr/local/cula/include
    CULA_LIB_PATH_64=/usr/local/cula/lib64
    GTK_IM_MODULE=cedilla
    XAUTHLOCALHOSTNAME=skipjack
    LESSCLOSE=lessclose.sh %s %s
    QT_IM_SWITCHER=imsw-multi
    G_BROKEN_FILENAMES=1
    COLORTERM=
    JAVA_ROOT=/usr/lib64/jvm/jre
    _=/usr/bin/env

--------------------------------------------------------------------------------
                                     Tools
--------------------------------------------------------------------------------

    gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]
    Copyright (C) 2008 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.


    GNU ld (GNU Binutils; openSUSE 11.1) 2.19

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2009 NVIDIA Corporation
    Built on Mon_Oct_26_09:40:14_PDT_2009
    Cuda compilation tools, release 3.0, V0.2.1221
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby john » Wed Dec 02, 2009 7:49 am

Boxed Cylon wrote:
Code: Select all
     -- SGESVD Benchmark  --

Size   CULA (s)   MKL (s)   Speedup
------ ---------- --------- ---------
4096      37.87    117.42    3.1005
5120      66.72    167.24    2.5066



At this point the test ran ad infinitum...completing no other cases...

I think I will try a reboot, just in case...

Unfortunately the SVD test likes to take a rather long time, I'm afraid. The next few lines will each take as long as 5-10 minutes to generate. It's a good time to grab a new cup of coffee; SVD is a rather intense routine. Notably, when you call sgesv (your routine of interest), the sequence is sgetrf and then one additional routine. The sgetrf is the bottleneck of that, and you're seeing good speedups on that part of the benchmark suite, so we're getting closer.

I saw that Kyle has pointed you towards the thread that says we don't require CUDA 3.0, which is indeed the case. Our only requirement is that your driver be capable of CUDA 2.3 execution (roughly 190 or higher.) I see that you have the 195 driver from the CUDA 3.0 release - we might have a conflict between the 2.3 runtime we use and the 3.0 driver. We have tested the 3.0 driver on Windows with no ill effects but haven't thoroughly investigated on on Linux since we won't be officially supporting it until the CUDA 3.0 Beta is complete.

What is your GPU? Our script didn't report it on your OpenSUSE system (but it will next version.)
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Wed Dec 02, 2009 8:35 am

I've installed the 1.1 beta and still had the slowdown. I then downgraded to cuda 2.3 with the 1.1 beta, and still have the slowdown. I'm fairly certain that at one point there was a nice speed up with this routine... It looks like the problem is likely to be on my end somewhere, but I'll be damned if I know what it might be...

Just to recap, when I run the test matlab script above for ever increasing array sizes using the attached *.cu file above, I am finding that the GPU is significantly slower than the CPU.

I'll have to continue to poke around and see if I can sort it out.

My GPU is a GTX 260:

Code: Select all
./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA

Device 0: "GT200"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 939196416 bytes
  Number of multiprocessors:                     24
  Number of cores:                               192
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Test PASSED

Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby john » Wed Dec 02, 2009 11:42 am

Can we try out the CULA host interface quickly?

The replacement would be from this:
Code: Select all
      // Make modulo 32 dimensions  - speeds up the sgemm calculations significantly
      // Just used as an example here.
       Ic=I+(32-I%32);
       Lc=L+(32-L%32);

      cublasInit();
      culaInitialize();

      cublasAlloc (Lc*Lc, sizeof(float), (void**)&ga);
      cublasAlloc (Lc*Ic, sizeof(float), (void**)&gb);
      cudaMemset(ga,0,Lc*Lc*4);  /* zero these since we've padded them */
      cudaMemset(gb,0,Lc*Ic*4);

      cublasSetMatrix (L, L, sizeof(float), A, L, (void*)ga, Lc);
      cublasSetMatrix (L, I, sizeof(float), B, L, (void*)gb, Lc);

    // Allocate for ipiv - a working matrix used by sgesv, and ignored here.
      cublasAlloc (L, sizeof(int), (void**)&ipiv);

    // Ready to go...
    // First numbers L, I pertain only to the non-padded sections of the arrays.
      status = culaDeviceSgesv(L,I,ga,Lc,ipiv,gb,Lc);
    // printf("Runtime error (%d)\n", culaGetErrorInfo());
      checkStatus(status);

    // Get the solution off the GPU
      cublasGetMatrix (L, I, sizeof(float), gb, Lc, X, L);
    // X has the solution we need; now back to matlab after a bit of clean up.

    // Print the first three elements of the first row (debugging)
      printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]); 
    // Print the  last three elements of the  last row (debugging)
      printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]); 

    // Clear the variables to avoid GPU memory leak (and GPU crash!)
      cublasFree (ga);
      cublasFree (gb);
      cublasFree (ipiv);
      culaFreeBuffers();
      cublasShutdown();
      culaShutdown(); 


To:
Code: Select all
culaInitialize();
int *ipiv = malloc(L*sizeof(int)); // I'm not 100% sure on whether this is okay in Mex or not
status = culaSgesv(L,I,A,L,ipiv,B,L);
checkStatus(status);
free(ipiv);
culaShutdown();


It will be quite a bit less code and will likely perform better because we have more freedom to make different decisions regarding allocations, transfers, etc.

Also, did you roll back the driver as well as the CUDA toolkit? We haven't had a chance to test how CULA 1.1 does when the system is equipped with a CUDA 3.0 driver yet.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Wed Dec 02, 2009 12:54 pm

(I did roll back both driver and tools. I've since rolled forward to cuda 3.0/cula 1.1.)

The code to just call culaSgesv is
Code: Select all
#include "mex.h"
#include "cublas.h"
#include "cula.h"
#include "culapackdevice.h"
#include "cuda.h"
#include "sys/time.h"

void checkStatus(culaStatus status)
{
    if(!status)
        return;

    if(status == culaArgumentError)
        printf("Invalid value for parameter %d\n", culaGetErrorInfo());
    else if(status == culaRuntimeError)
        printf("Runtime error (%d)\n", culaGetErrorInfo());
    else
        printf("%s\n", culaGetStatusString(status));

    culaShutdown();
    exit(EXIT_FAILURE);
}

void mexFunction( int nlhs, mxArray *plhs[],
        int nrhs, const mxArray *prhs[])

{
      int I,L;
      int dims0[2];

      // INPUT VARIABLES   %%%%%%%%%%%%%%%%%%%%%%%%%
      // A is dimensioned LXL
      // B is dimensioned LXI
      float *A,*B;
 
      // OUTPUT VARIABLE, X=A\B   %%%%%%%%%%%%%%%%%%
      int i;
      float *X;

      // CUDA/GPU VARIABLES %%%%%%%%%%%%%%%%%%%%%%%%
      int* ipiv = 0;

      culaStatus status;

      if (nrhs != 2) {
         mexErrMsgTxt("gpu_sgesv requires 2 input arguments");
      } else if (nlhs != 1) {
         mexErrMsgTxt("gpu_sgesv requires 1 output argument");
      }

      if ( !mxIsSingle(prhs[0]) || !mxIsSingle(prhs[1]) ) {
           mexErrMsgTxt("Input arrays must be single precision.");
      }


// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Single-precision input arrays */
// Dimensions, and then array data
      L = mxGetN(prhs[0]);
      I = mxGetN(prhs[1]);
     // printf("L = %i\n",L);
     // printf("I = %i\n",I);
      A =   (float*) mxGetData(prhs[0]);
      B =   (float*) mxGetData(prhs[1]);

// Left hand side matrix set up    (the solution) 
      dims0[0]=L;
      dims0[1]=I;
      plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);
      X = (float*) mxGetData(plhs[0]);

      culaInitialize();

    // Allocate for ipiv - a working matrix used by sgesv, and ignored here.
      ipiv = (int *) mxCalloc(L,sizeof(int));

       for(i=0; i<L*I; i++)
        { X[i]=B[i];
        }
    // Ready to go...
    // First numbers L, I pertain only to the non-padded sections of the arrays.
      status = culaSgesv(L,I,A,L,ipiv,X,L);
    // printf("Runtime error (%d)\n", culaGetErrorInfo());
      checkStatus(status);

    // X has the solution we need; now back to matlab after a bit of clean up.

    // Print the first three elements of the first row (debugging)
      printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]); 
    // Print the  last three elements of the  last row (debugging)
      printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]); 

    // Clear the variables to avoid GPU memory leak (and GPU crash!)
      mxFree (ipiv);
      culaFreeBuffers();
      culaShutdown(); 

}

(culaSgesv called in this way overwrites the arrays A, X, so the array A is changed in matlab...)

This makes no difference to what I was getting. To wit, if T1 is the CPU time and T2 is the GPU time then I'm getting:
Code: Select all
           N            T1/T2
         10.00          0.34
         50.00          1.13
        100.00          0.42
        200.00          0.33
        500.00          0.26
        750.00          0.23
       1000.00          0.20
       1250.00          0.20
       1500.00          0.20
       2000.00          0.20
       2500.00          0.20
       3000.00          0.22
       4000.00          0.20
       5000.00          0.22

for A=NXN, and B=NX5000. That is, the quad-core CPU is about 5X as fast as the GTX260... For the N=5000 case, T1=6.6s, T2=30.0s.

I had a thought that somehow I had got stuck in GPU emulation mode (how???) I'm not sure how to check that (I've never used emulation mode).
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby Boxed Cylon » Thu Dec 03, 2009 11:35 pm

Just to note that I am still stuck on this puzzling problem - I am out of ideas of where the fix may be...

I did insert some timing code into the mex file and verified that the call to sgesv was specifically where the slowness was. I gather the benchmarks were o.k., which suggests the problem is specific to this mex file/matlab.

It was all working perfectly a week ago or so! :(

I did upgrade matlab recently; perhaps I should try downgrading (but that's complicated...) Maybe the recent matlab did something to graphics?
Boxed Cylon
 
Posts: 48
Joined: Fri Oct 16, 2009 8:57 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Mon Jan 04, 2010 2:44 am

Hi:

I have the same problem than Boxed Cyclon. I am working with MATLAB and an algorithm that uses much times the linear solver CulaDeviceSgesv and CublasSgemm calls and it is very slow.

My config is different from Boxed Cyclon.

It is a Nvidia Quadro FX 5800 and an Intel Xeon Quadcore E5430

About the soft:

- Linux 64 bits

- MATLAB 7.9

- CUDA 2.3

- Several versions of CULA:

- 1.1 (premium). I use a borrowed account of a friend's research group and
- I also tested 1.1a (basic)

but same bad results


I think the problem is CULA (not sure) because CUDA works OK from MATLAB.

Using my algorithm is a little complex. Then I got a simple example to test this. I got this file to see the CULA performance in MATLAB

http://www.culatools.com/images/fbfiles ... _sgesv.txt


I also used the sgemm routine to test where the problem is. Then I did a MATLAB script and these are the results:

Code: Select all

                Matrix Product (sgemm) Results CUBLAS

Size      SpeedUp    GflopsCPU    GflopsGPU
-----------------------------------------------------------
  128       0.56       10.62         5.94   
  256       1.10       20.83        23.00   
  512       1.46       35.97        52.40   
1024       2.28       47.49       108.19   
2048       2.61       54.88       143.39   
4096       2.95       65.47       192.93   
8192       3.08       73.09       225.25 



                sgesv with CULA -> culaDeviceSgesv A\B

Size      SpeedUp    GflopsCPU    GflopsGPU
-----------------------------------------------------------
  128       0.37        2.73         1.01   
  256       0.32        8.74         2.81   
  512       0.43       13.19         5.63   
1024       0.48       16.98         8.07   
2048       0.42       21.84         9.18   
4096       0.35       27.69         9.65   
8192       0.27       36.65         9.73   




I used also the Gigaflops unit.


You can see that all the tests are OK in CUBLAS but not in CULA



With many thanks in advance


jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Re:sgesv in 1.1 is slow...

Postby john » Tue Jan 05, 2010 12:00 pm

I'm looking into this currently. On my machine I am seeing well over 10x the performance you are noting below (134 Gflops at 8192). I have seen a few reports where mex can be very touchy about performance, but we haven't yet tracked down any reason why this is. We will let you know if we find something, but we don't offer any explicit mex support. In the meantime, it may be worth looking into the Accelereyes Jacket product, which has incorporated CULA into Matlab without problems.

I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re:sgesv in 1.1 is slow...

Postby jpeinado » Tue Jan 05, 2010 3:48 pm

Hi John:

Firstly, thank you very much for your help.



john wrote:I'm looking into this currently. On my machine I am seeing well over 10x the performance you are noting below (134 Gflops at 8192). I have seen a few reports where mex can be very touchy about performance, but we haven't yet tracked down any reason why this is. We will let you know if we find something, but we don't offer any explicit mex support. In the meantime, it may be worth looking into the Accelereyes Jacket product, which has incorporated CULA into Matlab without problems.

It is very strange because as you can see, CUBLAS works OK with mex, but no CULA...I know that mex files are very sensitive to memory


I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.


I try to study this on Thursday 7 ...Today Jan 6, is the Magician Kings day in Spain (sorry I dont know is this is the correct name in english). It is like your Santa Claus day....
I try to study this, looking the code and trying to contact with Boxed Cyclon

Anyway you must note than 3x is comparing with the CPU processor. I have a Xeon 5430 (I think is very poweful). A 5-6x speedup will be almost 400Gflops in Quadro FX 5800, and I think this is not possible....Anyway I suppose that I will have problems when using other problem sizes....


Thank you very much

jpeinado
jpeinado
 
Posts: 37
Joined: Mon Sep 14, 2009 10:48 am

Next

Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron