sgesv in 1.1 is slow...
sgesv in 1.1 is slow...
In the new 1.1 (non beta) I am finding that sgesv is painfully slow - slower than the CPU. Any reason why that might be? I have tried the linux and RHEL versions, both 64-bit. And I am using culaDeviceSgesv. Alas, I have deleted my 1.1 beta... I'm pretty sure that with 1.1 beta I was getting speed ups of 6-7X.
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re: sgesv in 1.1 is slow...
Hi Boxed Cylon,
To help debug the problem, can you provide us with some information?
- What problem sizes are you running?
- What software package are you testing against?
- What are your performance numbers?
Also, please run the sysinfo.sh script (in examples) and post the output here.
To help debug the problem, can you provide us with some information?
- What problem sizes are you running?
- What software package are you testing against?
- What are your performance numbers?
Also, please run the sysinfo.sh script (in examples) and post the output here.
- dan
- Administrator
- Posts: 61
- Joined: Thu Jul 23, 2009 2:29 pm
Re: sgesv in 1.1 is slow...
I am attaching, I hope the testing routine I am using. This routine is a matlab mex file. The test script I use is a loop of ever increasing array dimensions up to 5000X5000:
With the new 1.1, I find T1/T2 decreases to 0.6-0.7, that is the GPU calculation is taking longer than the CPU. As I recall with 1.1 beta this ratio increased to 6-7, that is the GPU calculation was 6-7 times faster with the larger array sizes.
It is very strange - perhaps I am doing something wrong, but that would have to be in the compile stage. I am now using CUDA 3.0 beta, since the new 1.1 seems to require it with linux. [file name=gpu_sgesv-20091201.txt size=3457]http://www.culatools.com/images/fbfiles/files/gpu_sgesv-20091201.txt[/file]
- Code: Select all
clear all
N=[10 50 100 200 500 750 1000 1250 1500 2000 2500 3000 4000 5000];
ETR=[];
for K=N,
A=randn(K,K,'double');
B=randn(K,5000,'double');
A=single(A);
B=single(B);
disp('CPU: 1st call')
tic
Lp=A\B;
toc
disp(' ')
disp('CPU: 2nd call')
tic
Lp=A\B;
T1=toc
Lp(1,1:3)
Lp(end,(end-2):end)
if 1==2,
[Q,R]=qr(A); X = R\(Q'*B);
X13=X(1,1:3)
Xend=X(end,(end-2):end)
end
if 1==1,
disp('GPU: 1st call')
tic
[X]= gpu_sgesv(A,B);
toc
disp(' ')
disp('GPU: 2nd call')
tic
[X]= gpu_sgesv(A,B);
T2=toc
end
X(1,1:3)
X(end,(end-2):end)
ETR=[ETR T1/T2]
end
plot(N,ETR)
With the new 1.1, I find T1/T2 decreases to 0.6-0.7, that is the GPU calculation is taking longer than the CPU. As I recall with 1.1 beta this ratio increased to 6-7, that is the GPU calculation was 6-7 times faster with the larger array sizes.
It is very strange - perhaps I am doing something wrong, but that would have to be in the compile stage. I am now using CUDA 3.0 beta, since the new 1.1 seems to require it with linux. [file name=gpu_sgesv-20091201.txt size=3457]http://www.culatools.com/images/fbfiles/files/gpu_sgesv-20091201.txt[/file]
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re: sgesv in 1.1 is slow...
Quick question, is GEQRF (or any other non-LU based function) suffering this same performance drop?
Also, this statement is incorrect:
See this post for some more details there.
Also, this statement is incorrect:
It is very strange - perhaps I am doing something wrong, but that would have to be in the compile stage. I am now using CUDA 3.0 beta, since the new 1.1 seems to require it with linux.
See this post for some more details there.
- kyle
- Administrator
- Posts: 301
- Joined: Fri Jun 12, 2009 7:47 pm
Re:sgesv in 1.1 is slow...
Most curious - I ran the benchmark example:
At this point the test ran ad infinitum...completing no other cases...
I think I will try a reboot, just in case...
- Code: Select all
make build64
sh ../checkenvironment.sh
gcc -m64 -o benchmark benchmark.c -DNDEBUG -O3 -I/usr/local/cula/include -I/opt/intel/mkl/10.1.1.019/include/ -L/usr/local/cula/lib64 -L/opt/intel/mkl/10.1.1.019/lib/em64t/ -lmkl_lapack -lmkl_core -lmkl_intel_lp64 -lmkl_intel_thread -liomp5 -lcula -lcublas -lcudart -lpthread
ls
benchmark benchmark.c Makefile
./benchmark
Initializing CULA...
-- SGEQRF Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 0.69 1.90 2.7654
5120 1.19 2.78 2.3336
6144 1.96 4.85 2.4775
7168 3.01 7.43 2.4698
8192 3.34 15.80 4.7306
-- SGETRF Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 0.41 1.33 3.2752
5120 0.69 1.63 2.3678
6144 1.09 2.82 2.5829
7168 1.58 4.08 2.5810
8192 2.23 11.48 5.1368
-- SGELS Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 0.90 1.65 1.8366
5120 1.56 2.98 1.9179
6144 2.47 5.01 2.0279
7168 3.69 7.75 2.0992
8192 4.09 11.36 2.7754
-- SGGLSE Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 1.01 6.31 6.2335
5120 1.71 8.39 4.9028
6144 2.69 12.85 4.7818
7168 3.96 18.25 4.6065
8192 4.88 42.71 8.7554
-- SGESVD Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 37.87 117.42 3.1005
5120 66.72 167.24 2.5066
At this point the test ran ad infinitum...completing no other cases...
I think I will try a reboot, just in case...
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re:sgesv in 1.1 is slow...
Hi Boxed Cylon,
Do you recall if you ran the benchmark example previously and saw similar speedups or has the performance been reduced for these as well? Depending on your system, the speedups you show here can be very typical.
Also, can please run the sysinfo.sh script (in $CULA_ROOT/examples) and post the output here? It would help a tremendous amount when debugging to know info like your OS, GPU, GPU driver, etc., all of which the sysinfo script collects easily for us.
Dan
Do you recall if you ran the benchmark example previously and saw similar speedups or has the performance been reduced for these as well? Depending on your system, the speedups you show here can be very typical.
Also, can please run the sysinfo.sh script (in $CULA_ROOT/examples) and post the output here? It would help a tremendous amount when debugging to know info like your OS, GPU, GPU driver, etc., all of which the sysinfo script collects easily for us.
Dan
- dan
- Administrator
- Posts: 61
- Joined: Thu Jul 23, 2009 2:29 pm
Re:sgesv in 1.1 is slow...
The output of sysinfo.sh is below. I don't think I ran the benchmark routine before, but my test routine showed the gpu quite a bit faster (I'm not sure if it was 6-7X now; perhaps only 3X).
- Code: Select all
sh sysinfo.sh
System Information Utility
Copyright EM Photonics, Inc.
--------------------------------------------------------------------------------
Operating System
--------------------------------------------------------------------------------
Linux skipjack 2.6.27.39-0.2-default #1 SMP 2009-11-23 12:57:38 +0100 x86_64 x86_64 x86_64 GNU/Linux
Welcome to openSUSE 11.1 - Kernel \r (\l).
--------------------------------------------------------------------------------
CPU Info
--------------------------------------------------------------------------------
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 940 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 6027.42
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 1
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 940 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 6027.53
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 2
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 940 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 6027.57
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
processor : 3
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 940 Processor
stepping : 2
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 6027.62
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate
--------------------------------------------------------------------------------
Memory Info
--------------------------------------------------------------------------------
MemTotal: 8180468 kB
MemFree: 7048136 kB
Buffers: 229416 kB
Cached: 556644 kB
SwapCached: 0 kB
Active: 448448 kB
Inactive: 514380 kB
SwapTotal: 2104472 kB
SwapFree: 2104472 kB
Dirty: 124 kB
Writeback: 0 kB
AnonPages: 176756 kB
Mapped: 93416 kB
Slab: 107084 kB
SReclaimable: 89144 kB
SUnreclaim: 17940 kB
PageTables: 10076 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 6194704 kB
Committed_AS: 456416 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 326452 kB
VmallocChunk: 34359410171 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 45952 kB
DirectMap2M: 2050048 kB
DirectMap1G: 6291456 kB
--------------------------------------------------------------------------------
NVIDIA GPU Info
--------------------------------------------------------------------------------
NVRM version: NVIDIA UNIX x86_64 Kernel Module 195.17 Mon Oct 26 06:19:11 PST 2009
GCC version: gcc version 4.3.2 [gcc-4_3-branch revision 141291] (SUSE Linux)
--------------------------------------------------------------------------------
Environment
--------------------------------------------------------------------------------
LESSKEY=/etc/lesskey.bin
NNTPSERVER=news
MANPATH=/usr/lib64/mpi/gcc/openmpi/share/man:/usr/local/man:/usr/local/share/man:/usr/share/man:/opt/sun/sunstudio12.1/man:/usr/local/GMT/man:/opt/sun/sunstudio12.1/man:/usr/local/GMT/man
INFODIR=/usr/local/info:/usr/share/info:/usr/info
SSH_AGENT_PID=4279
DM_CONTROL=/var/run/xdmctl
HOSTNAME=skipjack
XKEYSYMDB=/usr/share/X11/XKeysymDB
CULA_BIN_PATH_32=/usr/local/cula/bin
GPG_AGENT_INFO=/tmp/gpg-e0zxSN/S.gpg-agent:4263:1
TERM=xterm
SHELL=/bin/bash
HOST=skipjack
XDG_SESSION_COOKIE=b700cd53d78bb51b813c3a004769d463-1259725067.625389-1478094789
XDM_MANAGED=/var/run/xdmctl/xdmctl-:0,maysd,mayfn,sched,rsvd,method=classic,auto
HISTSIZE=1000
PROFILEREAD=true
GTK2_RC_FILES=/etc/gtk-2.0/gtkrc:/usr/share/themes//QtCurve/gtk-2.0/gtkrc:/home/xxxxxx/.gtkrc-2.0-qtengine:/home/xxxxxx/.gtkrc-2.0
TMPDIR=/tmp
GS_LIB=/home/xxxxxx/.fonts
MORE=-sl
WINDOWID=29360136
XSESSION_IS_UP=yes
KDE_FULL_SESSION=true
USER=xxxxxx
JRE_HOME=/usr/lib64/jvm/jre
GMTHOME=/usr/local/GMT
DESKTOP_LAUNCH=kde-open
LD_LIBRARY_PATH=/usr/local/cula/lib64:/usr/local/lib:/opt/intel/Compiler/11.0/074/lib/intel64/:/opt/intel/mkl/10.1.1.019/lib/em64t:/usr/local/cuda/lib64:/opt/sun/sunstudio12.1/lib/amd64/:/opt/sun/sunstudio12.1/rtlibs/amd64/:/opt/intel/mkl/10.1.1.019/tools/builder/
LS_COLORS=no=00:fi=00:di=01;34:ln=00;36:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=41;33;01:ex=00;32:*.cmd=00;32:*.exe=01;32:*.com=01;32:*.bat=01;32:*.btm=01;32:*.dll=01;32:*.tar=00;31:*.tbz=00;31:*.tgz=00;31:*.rpm=00;31:*.deb=00;31:*.arj=00;31:*.taz=00;31:*.lzh=00;31:*.lzma=00;31:*.zip=00;31:*.zoo=00;31:*.z=00;31:*.Z=00;31:*.gz=00;31:*.bz2=00;31:*.tb2=00;31:*.tz2=00;31:*.tbz2=00;31:*.xz=00;31:*.avi=01;35:*.bmp=01;35:*.fli=01;35:*.gif=01;35:*.jpg=01;35:*.jpeg=01;35:*.mng=01;35:*.mov=01;35:*.mpg=01;35:*.pcx=01;35:*.pbm=01;35:*.pgm=01;35:*.png=01;35:*.ppm=01;35:*.tga=01;35:*.tif=01;35:*.xbm=01;35:*.xpm=01;35:*.dl=01;35:*.gl=01;35:*.wmv=01;35:*.aiff=00;32:*.au=00;32:*.mid=00;32:*.mp3=00;32:*.ogg=00;32:*.voc=00;32:*.wav=00;32:
XNLSPATH=/usr/share/X11/nls
ENV=/etc/bash.bashrc
MKL_INC_PATH=/opt/intel/mkl/10.1.1.019/include/
HOSTTYPE=x86_64
SSH_AUTH_SOCK=/tmp/ssh-XtZHS3723/agent.3723
FROM_HEADER=
SESSION_MANAGER=local/skipjack:@/tmp/.ICE-unix/4674,unix/skipjack:/tmp/.ICE-unix/4674
PAGER=less
CSHEDIT=emacs
XDG_CONFIG_DIRS=/etc/xdg
MINICOM=-c on
KONSOLE_DCOP=DCOPRef(konsole-4735,konsole)
DESKTOP_SESSION=default
PATH=/usr/local/mpich2/bin/:.:/opt/kde3/bin:/usr/local/mpich2/bin/:.:/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/X11R6/bin:/usr/games:/usr/lib64/jvm/jre/bin:/usr/lib/mit/bin:/usr/lib/mit/sbin:/opt/intel/Compiler/11.0/074/bin/intel64/:/usr/local/GMT/bin:/opt/sun/sunstudio12.1/bin:/usr/local/cuda/bin:/opt/intel/Compiler/11.0/074/bin/intel64/:/usr/local/GMT/bin:/opt/sun/sunstudio12.1/bin:/usr/local/cuda/bin
MAIL=/var/spool/mail/xxxxxx
CPU=x86_64
QT_IM_MODULE=xim
KDM_AUTOLOGIN=xxxxxxx
JAVA_BINDIR=/usr/lib64/jvm/jre/bin
BC_ENV_ARGS=/home/xxxxxxx/.bcrc
CULA_LIB_PATH_32=/usr/local/cula/lib
PWD=/usr/local/cula/examples
INPUTRC=/home/xxxxxx/.inputrc
KONSOLE_DCOP_SESSION=DCOPRef(konsole-4735,session-1)
XMODIFIERS=@im=local
JAVA_HOME=/usr/lib64/jvm/jre
KDE_SESSION_UID=1000
LANG=en_US.UTF-8
PYTHONSTARTUP=/etc/pythonstart
SSH_ASKPASS=/usr/lib64/ssh/x11-ssh-askpass
CULA_BIN_PATH_64=/usr/local/cula/bin64
SHLVL=3
HOME=/home/xxxxxxx
QT_SYSTEM_DIR=/usr/share/desktop-data
OSTYPE=linux
LESS_ADVANCED_PREPROCESSOR=no
XCURSOR_THEME=DMZ
LS_OPTIONS=-N --color=tty -T 0
WINDOWMANAGER=/usr/bin/kde
LOGNAME=xxxxxxx
MACHTYPE=x86_64-suse-linux
LESS=-M -I
G_FILENAME_ENCODING=@locale,UTF-8,ISO-8859-15,CP1252
CVS_RSH=ssh
CULA_ROOT=/usr/local/cula
XDG_DATA_DIRS=/usr/local/share:/usr/share:/etc/opt/kde3/share:/opt/kde3/share
LESSOPEN=lessopen.sh %s
MKL_LIB_PATH_64=/opt/intel/mkl/10.1.1.019/lib/em64t/
USE_FAM=
INFOPATH=/usr/local/info:/usr/share/info:/usr/info
DISPLAY=:0
CULA_INC_PATH=/usr/local/cula/include
CULA_LIB_PATH_64=/usr/local/cula/lib64
GTK_IM_MODULE=cedilla
XAUTHLOCALHOSTNAME=skipjack
LESSCLOSE=lessclose.sh %s %s
QT_IM_SWITCHER=imsw-multi
G_BROKEN_FILENAMES=1
COLORTERM=
JAVA_ROOT=/usr/lib64/jvm/jre
_=/usr/bin/env
--------------------------------------------------------------------------------
Tools
--------------------------------------------------------------------------------
gcc (SUSE Linux) 4.3.2 [gcc-4_3-branch revision 141291]
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
GNU ld (GNU Binutils; openSUSE 11.1) 2.19
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Mon_Oct_26_09:40:14_PDT_2009
Cuda compilation tools, release 3.0, V0.2.1221
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re:sgesv in 1.1 is slow...
Boxed Cylon wrote:
- Code: Select all
-- SGESVD Benchmark --
Size CULA (s) MKL (s) Speedup
------ ---------- --------- ---------
4096 37.87 117.42 3.1005
5120 66.72 167.24 2.5066
At this point the test ran ad infinitum...completing no other cases...
I think I will try a reboot, just in case...
Unfortunately the SVD test likes to take a rather long time, I'm afraid. The next few lines will each take as long as 5-10 minutes to generate. It's a good time to grab a new cup of coffee; SVD is a rather intense routine. Notably, when you call sgesv (your routine of interest), the sequence is sgetrf and then one additional routine. The sgetrf is the bottleneck of that, and you're seeing good speedups on that part of the benchmark suite, so we're getting closer.
I saw that Kyle has pointed you towards the thread that says we don't require CUDA 3.0, which is indeed the case. Our only requirement is that your driver be capable of CUDA 2.3 execution (roughly 190 or higher.) I see that you have the 195 driver from the CUDA 3.0 release - we might have a conflict between the 2.3 runtime we use and the 3.0 driver. We have tested the 3.0 driver on Windows with no ill effects but haven't thoroughly investigated on on Linux since we won't be officially supporting it until the CUDA 3.0 Beta is complete.
What is your GPU? Our script didn't report it on your OpenSUSE system (but it will next version.)
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
Re:sgesv in 1.1 is slow...
I've installed the 1.1 beta and still had the slowdown. I then downgraded to cuda 2.3 with the 1.1 beta, and still have the slowdown. I'm fairly certain that at one point there was a nice speed up with this routine... It looks like the problem is likely to be on my end somewhere, but I'll be damned if I know what it might be...
Just to recap, when I run the test matlab script above for ever increasing array sizes using the attached *.cu file above, I am finding that the GPU is significantly slower than the CPU.
I'll have to continue to poke around and see if I can sort it out.
My GPU is a GTX 260:
Just to recap, when I run the test matlab script above for ever increasing array sizes using the attached *.cu file above, I am finding that the GPU is significantly slower than the CPU.
I'll have to continue to poke around and see if I can sort it out.
My GPU is a GTX 260:
- Code: Select all
./deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is 1 device supporting CUDA
Device 0: "GT200"
CUDA Driver Version: 2.30
CUDA Runtime Version: 2.30
CUDA Capability Major revision number: 1
CUDA Capability Minor revision number: 3
Total amount of global memory: 939196416 bytes
Number of multiprocessors: 24
Number of cores: 192
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 16384
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 262144 bytes
Texture alignment: 256 bytes
Clock rate: 1.30 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Test PASSED
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re:sgesv in 1.1 is slow...
Can we try out the CULA host interface quickly?
The replacement would be from this:
To:
It will be quite a bit less code and will likely perform better because we have more freedom to make different decisions regarding allocations, transfers, etc.
Also, did you roll back the driver as well as the CUDA toolkit? We haven't had a chance to test how CULA 1.1 does when the system is equipped with a CUDA 3.0 driver yet.
The replacement would be from this:
- Code: Select all
// Make modulo 32 dimensions - speeds up the sgemm calculations significantly
// Just used as an example here.
Ic=I+(32-I%32);
Lc=L+(32-L%32);
cublasInit();
culaInitialize();
cublasAlloc (Lc*Lc, sizeof(float), (void**)&ga);
cublasAlloc (Lc*Ic, sizeof(float), (void**)&gb);
cudaMemset(ga,0,Lc*Lc*4); /* zero these since we've padded them */
cudaMemset(gb,0,Lc*Ic*4);
cublasSetMatrix (L, L, sizeof(float), A, L, (void*)ga, Lc);
cublasSetMatrix (L, I, sizeof(float), B, L, (void*)gb, Lc);
// Allocate for ipiv - a working matrix used by sgesv, and ignored here.
cublasAlloc (L, sizeof(int), (void**)&ipiv);
// Ready to go...
// First numbers L, I pertain only to the non-padded sections of the arrays.
status = culaDeviceSgesv(L,I,ga,Lc,ipiv,gb,Lc);
// printf("Runtime error (%d)\n", culaGetErrorInfo());
checkStatus(status);
// Get the solution off the GPU
cublasGetMatrix (L, I, sizeof(float), gb, Lc, X, L);
// X has the solution we need; now back to matlab after a bit of clean up.
// Print the first three elements of the first row (debugging)
printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]);
// Print the last three elements of the last row (debugging)
printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]);
// Clear the variables to avoid GPU memory leak (and GPU crash!)
cublasFree (ga);
cublasFree (gb);
cublasFree (ipiv);
culaFreeBuffers();
cublasShutdown();
culaShutdown();
To:
- Code: Select all
culaInitialize();
int *ipiv = malloc(L*sizeof(int)); // I'm not 100% sure on whether this is okay in Mex or not
status = culaSgesv(L,I,A,L,ipiv,B,L);
checkStatus(status);
free(ipiv);
culaShutdown();
It will be quite a bit less code and will likely perform better because we have more freedom to make different decisions regarding allocations, transfers, etc.
Also, did you roll back the driver as well as the CUDA toolkit? We haven't had a chance to test how CULA 1.1 does when the system is equipped with a CUDA 3.0 driver yet.
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
Re:sgesv in 1.1 is slow...
(I did roll back both driver and tools. I've since rolled forward to cuda 3.0/cula 1.1.)
The code to just call culaSgesv is
(culaSgesv called in this way overwrites the arrays A, X, so the array A is changed in matlab...)
This makes no difference to what I was getting. To wit, if T1 is the CPU time and T2 is the GPU time then I'm getting:
for A=NXN, and B=NX5000. That is, the quad-core CPU is about 5X as fast as the GTX260... For the N=5000 case, T1=6.6s, T2=30.0s.
I had a thought that somehow I had got stuck in GPU emulation mode (how???) I'm not sure how to check that (I've never used emulation mode).
The code to just call culaSgesv is
- Code: Select all
#include "mex.h"
#include "cublas.h"
#include "cula.h"
#include "culapackdevice.h"
#include "cuda.h"
#include "sys/time.h"
void checkStatus(culaStatus status)
{
if(!status)
return;
if(status == culaArgumentError)
printf("Invalid value for parameter %d\n", culaGetErrorInfo());
else if(status == culaRuntimeError)
printf("Runtime error (%d)\n", culaGetErrorInfo());
else
printf("%s\n", culaGetStatusString(status));
culaShutdown();
exit(EXIT_FAILURE);
}
void mexFunction( int nlhs, mxArray *plhs[],
int nrhs, const mxArray *prhs[])
{
int I,L;
int dims0[2];
// INPUT VARIABLES %%%%%%%%%%%%%%%%%%%%%%%%%
// A is dimensioned LXL
// B is dimensioned LXI
float *A,*B;
// OUTPUT VARIABLE, X=A\B %%%%%%%%%%%%%%%%%%
int i;
float *X;
// CUDA/GPU VARIABLES %%%%%%%%%%%%%%%%%%%%%%%%
int* ipiv = 0;
culaStatus status;
if (nrhs != 2) {
mexErrMsgTxt("gpu_sgesv requires 2 input arguments");
} else if (nlhs != 1) {
mexErrMsgTxt("gpu_sgesv requires 1 output argument");
}
if ( !mxIsSingle(prhs[0]) || !mxIsSingle(prhs[1]) ) {
mexErrMsgTxt("Input arrays must be single precision.");
}
// %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
// Single-precision input arrays */
// Dimensions, and then array data
L = mxGetN(prhs[0]);
I = mxGetN(prhs[1]);
// printf("L = %i\n",L);
// printf("I = %i\n",I);
A = (float*) mxGetData(prhs[0]);
B = (float*) mxGetData(prhs[1]);
// Left hand side matrix set up (the solution)
dims0[0]=L;
dims0[1]=I;
plhs[0] = mxCreateNumericArray(2,dims0,mxSINGLE_CLASS,mxREAL);
X = (float*) mxGetData(plhs[0]);
culaInitialize();
// Allocate for ipiv - a working matrix used by sgesv, and ignored here.
ipiv = (int *) mxCalloc(L,sizeof(int));
for(i=0; i<L*I; i++)
{ X[i]=B[i];
}
// Ready to go...
// First numbers L, I pertain only to the non-padded sections of the arrays.
status = culaSgesv(L,I,A,L,ipiv,X,L);
// printf("Runtime error (%d)\n", culaGetErrorInfo());
checkStatus(status);
// X has the solution we need; now back to matlab after a bit of clean up.
// Print the first three elements of the first row (debugging)
printf("X-top = %e %e %e\n",X[0],X[L],X[L+L]);
// Print the last three elements of the last row (debugging)
printf("X-bottom = %e %e %e\n",X[L*(I-2)-1],X[L*(I-1)-1],X[L*I-1]);
// Clear the variables to avoid GPU memory leak (and GPU crash!)
mxFree (ipiv);
culaFreeBuffers();
culaShutdown();
}
(culaSgesv called in this way overwrites the arrays A, X, so the array A is changed in matlab...)
This makes no difference to what I was getting. To wit, if T1 is the CPU time and T2 is the GPU time then I'm getting:
- Code: Select all
N T1/T2
10.00 0.34
50.00 1.13
100.00 0.42
200.00 0.33
500.00 0.26
750.00 0.23
1000.00 0.20
1250.00 0.20
1500.00 0.20
2000.00 0.20
2500.00 0.20
3000.00 0.22
4000.00 0.20
5000.00 0.22
for A=NXN, and B=NX5000. That is, the quad-core CPU is about 5X as fast as the GTX260... For the N=5000 case, T1=6.6s, T2=30.0s.
I had a thought that somehow I had got stuck in GPU emulation mode (how???) I'm not sure how to check that (I've never used emulation mode).
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re:sgesv in 1.1 is slow...
Just to note that I am still stuck on this puzzling problem - I am out of ideas of where the fix may be...
I did insert some timing code into the mex file and verified that the call to sgesv was specifically where the slowness was. I gather the benchmarks were o.k., which suggests the problem is specific to this mex file/matlab.
It was all working perfectly a week ago or so! :(
I did upgrade matlab recently; perhaps I should try downgrading (but that's complicated...) Maybe the recent matlab did something to graphics?
I did insert some timing code into the mex file and verified that the call to sgesv was specifically where the slowness was. I gather the benchmarks were o.k., which suggests the problem is specific to this mex file/matlab.
It was all working perfectly a week ago or so! :(
I did upgrade matlab recently; perhaps I should try downgrading (but that's complicated...) Maybe the recent matlab did something to graphics?
- Boxed Cylon
- Posts: 48
- Joined: Fri Oct 16, 2009 8:57 pm
Re:sgesv in 1.1 is slow...
Hi:
I have the same problem than Boxed Cyclon. I am working with MATLAB and an algorithm that uses much times the linear solver CulaDeviceSgesv and CublasSgemm calls and it is very slow.
My config is different from Boxed Cyclon.
It is a Nvidia Quadro FX 5800 and an Intel Xeon Quadcore E5430
About the soft:
- Linux 64 bits
- MATLAB 7.9
- CUDA 2.3
- Several versions of CULA:
- 1.1 (premium). I use a borrowed account of a friend's research group and
- I also tested 1.1a (basic)
but same bad results
I think the problem is CULA (not sure) because CUDA works OK from MATLAB.
Using my algorithm is a little complex. Then I got a simple example to test this. I got this file to see the CULA performance in MATLAB
http://www.culatools.com/images/fbfiles ... _sgesv.txt
I also used the sgemm routine to test where the problem is. Then I did a MATLAB script and these are the results:
I used also the Gigaflops unit.
You can see that all the tests are OK in CUBLAS but not in CULA
With many thanks in advance
jpeinado
I have the same problem than Boxed Cyclon. I am working with MATLAB and an algorithm that uses much times the linear solver CulaDeviceSgesv and CublasSgemm calls and it is very slow.
My config is different from Boxed Cyclon.
It is a Nvidia Quadro FX 5800 and an Intel Xeon Quadcore E5430
About the soft:
- Linux 64 bits
- MATLAB 7.9
- CUDA 2.3
- Several versions of CULA:
- 1.1 (premium). I use a borrowed account of a friend's research group and
- I also tested 1.1a (basic)
but same bad results
I think the problem is CULA (not sure) because CUDA works OK from MATLAB.
Using my algorithm is a little complex. Then I got a simple example to test this. I got this file to see the CULA performance in MATLAB
http://www.culatools.com/images/fbfiles ... _sgesv.txt
I also used the sgemm routine to test where the problem is. Then I did a MATLAB script and these are the results:
- Code: Select all
Matrix Product (sgemm) Results CUBLAS
Size SpeedUp GflopsCPU GflopsGPU
-----------------------------------------------------------
128 0.56 10.62 5.94
256 1.10 20.83 23.00
512 1.46 35.97 52.40
1024 2.28 47.49 108.19
2048 2.61 54.88 143.39
4096 2.95 65.47 192.93
8192 3.08 73.09 225.25
sgesv with CULA -> culaDeviceSgesv A\B
Size SpeedUp GflopsCPU GflopsGPU
-----------------------------------------------------------
128 0.37 2.73 1.01
256 0.32 8.74 2.81
512 0.43 13.19 5.63
1024 0.48 16.98 8.07
2048 0.42 21.84 9.18
4096 0.35 27.69 9.65
8192 0.27 36.65 9.73
I used also the Gigaflops unit.
You can see that all the tests are OK in CUBLAS but not in CULA
With many thanks in advance
jpeinado
- jpeinado
- Posts: 37
- Joined: Mon Sep 14, 2009 10:48 am
Re:sgesv in 1.1 is slow...
I'm looking into this currently. On my machine I am seeing well over 10x the performance you are noting below (134 Gflops at 8192). I have seen a few reports where mex can be very touchy about performance, but we haven't yet tracked down any reason why this is. We will let you know if we find something, but we don't offer any explicit mex support. In the meantime, it may be worth looking into the Accelereyes Jacket product, which has incorporated CULA into Matlab without problems.
I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.
I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
Re:sgesv in 1.1 is slow...
Hi John:
Firstly, thank you very much for your help.
It is very strange because as you can see, CUBLAS works OK with mex, but no CULA...I know that mex files are very sensitive to memory
I try to study this on Thursday 7 ...Today Jan 6, is the Magician Kings day in Spain (sorry I dont know is this is the correct name in english). It is like your Santa Claus day....
I try to study this, looking the code and trying to contact with Boxed Cyclon
Anyway you must note than 3x is comparing with the CPU processor. I have a Xeon 5430 (I think is very poweful). A 5-6x speedup will be almost 400Gflops in Quadro FX 5800, and I think this is not possible....Anyway I suppose that I will have problems when using other problem sizes....
Thank you very much
jpeinado
Firstly, thank you very much for your help.
john wrote:I'm looking into this currently. On my machine I am seeing well over 10x the performance you are noting below (134 Gflops at 8192). I have seen a few reports where mex can be very touchy about performance, but we haven't yet tracked down any reason why this is. We will let you know if we find something, but we don't offer any explicit mex support. In the meantime, it may be worth looking into the Accelereyes Jacket product, which has incorporated CULA into Matlab without problems.
It is very strange because as you can see, CUBLAS works OK with mex, but no CULA...I know that mex files are very sensitive to memory
I would like to note that the choice of leading dimension (Lc) isn't well suited to strong performance from either CUBLAS or CULA (I think I mailed Boxed Cylon offline about this topic.) You can observe this in the GEMM results - on my machine with a different LD I see more like a 5-6x speedup at large sizes whereas you are observing a 3x.
I try to study this on Thursday 7 ...Today Jan 6, is the Magician Kings day in Spain (sorry I dont know is this is the correct name in english). It is like your Santa Claus day....
I try to study this, looking the code and trying to contact with Boxed Cyclon
Anyway you must note than 3x is comparing with the CPU processor. I have a Xeon 5430 (I think is very poweful). A 5-6x speedup will be almost 400Gflops in Quadro FX 5800, and I think this is not possible....Anyway I suppose that I will have problems when using other problem sizes....
Thank you very much
jpeinado
- jpeinado
- Posts: 37
- Joined: Mon Sep 14, 2009 10:48 am
Who is online
Users browsing this forum: No registered users and 1 guest