Is it a terrible bug of CULA?
2 posts
• Page 1 of 1
Is it a terrible bug of CULA?
I am calling the "zgesv" subroutine in CULA, namely, cula_zgesv with MPI. If I use one MPI process with one K20c card, it takes 0.7435620 seconds for a 3000*3000 matrix. However, if I use two MPI processes with two K20c cards, it takes 7.932089 seconds. It is ten times longer. For a 2000*2000 matrix, it takes 0.28 seconds and 5.6 seconds respectively. Why is there such a big difference? Is it a terriible bug or just I have done something wrong? In the following, I paste my code(It is a very simple code):
Since there are not any communications between the different MPI processes, I think the time should be approximately the same no matter ONE or TWO processes are usded. In fact, the time is the same when I call the "zgemm" subroutine in CUBLAS with one or two MPI processes.
So, could anybody tell me why I see such a problem in CULA but not in CUBLAS and how to deal with it? I have been confused about it for several months.
- Code: Select all
PROGRAM cula_test
use cudafor
use cula_status
use cula_lapack
use cula_lapack_device_pgfortran
IMPLICIT NONE
include 'mpif.h'
INTEGER :: n
complex*16, allocatable::A(:,:),U(:,:)
integer,allocatable::ipiv(:)
integer I,J,info,MPIerror,node,Nnodes
real*8 c,d
real*4 t1,t2
external cula_initialize
external cula_shutdown
external cudasetdevice
call MPI_Init( MPIerror )
call MPI_Comm_Rank( MPI_Comm_World, Node, MPIerror )
call MPI_Comm_Size( MPI_Comm_World, Nnodes, MPIerror )
if(node.eq.0) info=cudasetdevice(0)
if(node.eq.1) info=cudasetdevice(1)
info = cula_initialize()
n = 3000
ALLOCATE(A(n,n),U(n,n), ipiv(n))
do I = 1,N
do J = 1, N
call random_number(c)
call random_number(d)
A(I,J)=dcmplx(c,d)
enddo
enddo
U(:,:)=(0.d0,0.d0)
do I = 1, N
U(I,I)=(1.d0,0.d0)
enddo
call cpu_time(t1)
info= cula_zgesv(n,n,A,n,ipiv,U,n)
call cpu_time(t2)
print *,'GPU: ', U(1,1),t2-t1
deallocate(A,U,ipiv)
call cula_shutdown()
call MPI_FINALIZE(MPIerror)
END
Since there are not any communications between the different MPI processes, I think the time should be approximately the same no matter ONE or TWO processes are usded. In fact, the time is the same when I call the "zgemm" subroutine in CUBLAS with one or two MPI processes.
So, could anybody tell me why I see such a problem in CULA but not in CUBLAS and how to deal with it? I have been confused about it for several months.
Last edited by xhsh on Sun Dec 22, 2013 12:55 am, edited 2 times in total.
- xhsh
- Posts: 8
- Joined: Wed Feb 23, 2011 5:42 pm
Re: Is it a terrible bug of CULA?
Following the previous post, for one MPI process and one K20c card, the output is:
For two MPI processes and two K20c card, the output is:
- Code: Select all
GPU: (-8.6050432536450713E-002,-0.1393513431401034) 0.7435620
For two MPI processes and two K20c card, the output is:
- Code: Select all
GPU: (-8.6050432536450713E-002,-0.1393513431401034) 7.932089
GPU: (-8.6050432536450713E-002,-0.1393513431401034) 8.396360
- xhsh
- Posts: 8
- Joined: Wed Feb 23, 2011 5:42 pm
2 posts
• Page 1 of 1
Who is online
Users browsing this forum: No registered users and 2 guests