Page 1 of 1

Poor CULA dtrsm performance

PostPosted: Fri Dec 14, 2012 8:22 am
by jhogg
Hi,

I'm just benchmarking CULA dtrsm with a single right-hand-side against CUBLAS, MAGMA, Host MKL and my own code to solve Lx=b. CULA seems to be coming off really badly, as evidenced by the numbers below.

I'm using the following code:

culaInitialize();
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME, &tp1);
culaDeviceDtrsm('L', 'L', 'N', 'U', n, 1, double(1.0), a_gpu, lda, x_gpu, n);
cudaThreadSynchronize();
clock_gettime(CLOCK_REALTIME, &tp2);
culaShutdown();

Am I doing something wrong?

Thanks,

Jonathan.

========================
Results:

n=100
CPU BLAS took 0.000045
CUBLAS BLAS took 0.000114
CULA Dense BLAS took 0.000267
My BLAS took 0.000052

n=1000
CPU BLAS took 0.000596
CUBLAS BLAS took 0.001731
CULA Dense BLAS took 0.001925
My BLAS took 0.000854

n=10000
CPU BLAS took 0.027538
CUBLAS BLAS took 0.028302
CULA Dense BLAS took 0.079307
My BLAS took 0.004455

n=16000
CPU BLAS took 0.067040
CUBLAS BLAS took 0.049763
CULA Dense BLAS took 0.183970
My BLAS took 0.009597

Re: Poor CULA dtrsm performance

PostPosted: Fri Dec 14, 2012 3:44 pm
by john
For starters you don't need a cudaThreadSync after CULA routines - we have a synchronize inside our routines. We'll continue to consider the remainders. As of R17, our routines will be all fall throughs to CUBLAS, so there will no longer be custom CULA code there.

Re: Poor CULA dtrsm performance

PostPosted: Mon Dec 17, 2012 3:02 am
by jhogg
CudaThreadSync was just there to ensure timing was fair with other routines, which do need it there as they don't autoamtically cause a host-gpu sync so timings would otherwise be inaccurate. Noted re R17. NVIDIA are looking at adopting my code for CUBLAS , but I need to compare against other implementations if they are available.

Regards,

Jonathan.