### Poor CULA dtrsm performance

Posted:

**Fri Dec 14, 2012 8:22 am**Hi,

I'm just benchmarking CULA dtrsm with a single right-hand-side against CUBLAS, MAGMA, Host MKL and my own code to solve Lx=b. CULA seems to be coming off really badly, as evidenced by the numbers below.

I'm using the following code:

culaInitialize();

cudaThreadSynchronize();

clock_gettime(CLOCK_REALTIME, &tp1);

culaDeviceDtrsm('L', 'L', 'N', 'U', n, 1, double(1.0), a_gpu, lda, x_gpu, n);

cudaThreadSynchronize();

clock_gettime(CLOCK_REALTIME, &tp2);

culaShutdown();

Am I doing something wrong?

Thanks,

Jonathan.

========================

Results:

n=100

CPU BLAS took 0.000045

CUBLAS BLAS took 0.000114

CULA Dense BLAS took 0.000267

My BLAS took 0.000052

n=1000

CPU BLAS took 0.000596

CUBLAS BLAS took 0.001731

CULA Dense BLAS took 0.001925

My BLAS took 0.000854

n=10000

CPU BLAS took 0.027538

CUBLAS BLAS took 0.028302

CULA Dense BLAS took 0.079307

My BLAS took 0.004455

n=16000

CPU BLAS took 0.067040

CUBLAS BLAS took 0.049763

CULA Dense BLAS took 0.183970

My BLAS took 0.009597

