Page 1 of 1

Is it a terrible bug of CULA?

PostPosted: Sun Dec 22, 2013 12:47 am
by xhsh
I am calling the "zgesv" subroutine in CULA, namely, cula_zgesv with MPI. If I use one MPI process with one K20c card, it takes 0.7435620 seconds for a 3000*3000 matrix. However, if I use two MPI processes with two K20c cards, it takes 7.932089 seconds. It is ten times longer. For a 2000*2000 matrix, it takes 0.28 seconds and 5.6 seconds respectively. Why is there such a big difference? Is it a terriible bug or just I have done something wrong? In the following, I paste my code(It is a very simple code):

Code: Select all

PROGRAM cula_test

use cudafor
use cula_status
use cula_lapack
use cula_lapack_device_pgfortran

include 'mpif.h'
complex*16, allocatable::A(:,:),U(:,:)
integer I,J,info,MPIerror,node,Nnodes
real*8 c,d
real*4 t1,t2

external cula_initialize
external cula_shutdown
external cudasetdevice

call MPI_Init( MPIerror )
call MPI_Comm_Rank( MPI_Comm_World, Node, MPIerror )
call MPI_Comm_Size( MPI_Comm_World, Nnodes, MPIerror )

if(node.eq.0) info=cudasetdevice(0)
if(node.eq.1) info=cudasetdevice(1)

info = cula_initialize()
n = 3000

ALLOCATE(A(n,n),U(n,n), ipiv(n))

do I = 1,N
   do J = 1, N
      call random_number(c)
      call random_number(d)

do I = 1, N

call cpu_time(t1)
info= cula_zgesv(n,n,A,n,ipiv,U,n)
call cpu_time(t2)
print *,'GPU: ', U(1,1),t2-t1


call cula_shutdown()


Since there are not any communications between the different MPI processes, I think the time should be approximately the same no matter ONE or TWO processes are usded. In fact, the time is the same when I call the "zgemm" subroutine in CUBLAS with one or two MPI processes.

So, could anybody tell me why I see such a problem in CULA but not in CUBLAS and how to deal with it? I have been confused about it for several months.

Re: Is it a terrible bug of CULA?

PostPosted: Sun Dec 22, 2013 12:53 am
by xhsh
Following the previous post, for one MPI process and one K20c card, the output is:

Code: Select all
GPU:   (-8.6050432536450713E-002,-0.1393513431401034)   0.7435620

For two MPI processes and two K20c card, the output is:

Code: Select all
GPU:   (-8.6050432536450713E-002,-0.1393513431401034)    7.932089
GPU:   (-8.6050432536450713E-002,-0.1393513431401034)    8.396360