performance of cula drops when runs on multiCPU+multiGPU

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

performance of cula drops when runs on multiCPU+multiGPU

Postby cding » Wed Aug 31, 2011 3:29 pm

I found a problem in the CULA test result either in portland group engineer--mkcolg's or mine. It's might be a bug in CULA or device set up issue.

The performance of cula_sgesv and cula_device_sgesv on one CPU with one GPU should be around 100 Gflops and 350 Gflops.

Everyone would think if when it runs on multiple cores with multiple GPUs, the performance of the cula host and device routines should keep the same with that when they are executed on one core with one GPU.

However, from mkcolg's test result, I see the performances of host cula routine on CPU 0, 1 ,2 ,3 are 43.9, 42.8, 137.2, 99.9 Gflops. They are really far from the performance when running one one GPU. Why the performance drops so much? They're supposed to keep.

mkcolg's thread on PGI forum:http://www.pgroup.com/userforum/viewtopic.php?t=2703&postdays=0&postorder=asc&start=5

mkcolg's cula test result on 4 CPUs with 4GPUs
Code: Select all
% mpirun -machinefile machines.LINUX -np 4 test_cula.out
Process             0  of             4  took GPU:             0
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
starting cpu test...
   runtime:    33464.90     ms
   gflops:    19.72215   
   error:   4.8381262E-03
 
starting cula (host interface) test...
   runtime:    15007.96     ms
   gflops:    43.97665   
   error:   4.5987219E-03
 
starting cula (device interface) test...
   runtime:    1733.560     ms
   gflops:    380.7195   
   error:   4.5987219E-03
 
Process             3  of             4  took GPU:             1
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
starting cpu test...
   runtime:    33225.06     ms
   gflops:    19.86452   
   error:   4.8381262E-03
 
starting cula (host interface) test...
   runtime:    6606.524     ms
   gflops:    99.90125   
   error:   4.5987219E-03
 
starting cula (device interface) test...
   runtime:    8241.342     ms
   gflops:    80.08405   
   error:   4.5987219E-03
 
Process             1  of             4  took GPU:             1
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
starting cpu test...
   runtime:    33354.10     ms
   gflops:    19.78767   
   error:   4.8381262E-03
 
starting cula (host interface) test...
   runtime:    15385.24     ms
   gflops:    42.89825   
   error:   4.5987219E-03
 
starting cula (device interface) test...
   runtime:    1710.793     ms
   gflops:    385.7860   
   error:   4.5987219E-03
 
Process             2  of             4  took GPU:             0
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1
 
starting cpu test...
   runtime:    33443.48     ms
   gflops:    19.73479   
   error:   4.8381262E-03
 
starting cula (host interface) test...
   runtime:    4807.774     ms
   gflops:    137.2777   
   error:   4.5987219E-03
 
starting cula (device interface) test...
   runtime:    8166.234     ms
   gflops:    80.82061   
   error:   4.5987219E-03


Do you think it's the problem of the cula routines or the devices?

I also did some test on CULA with the similar code that mkcolg's use, and found the same problem and even worse performance dropping.

But I did some test on similar routines on MAGMA as well. Not performance dropping problem when it runs on multiple GPUs.

Below is my test result on 4 GPUs compared with it runs one 1 CPU with 1 GPU.

My test code is slightly different from mkcolg's , I put the MPI initialization module in a separate file. I compile mpi file with mpif90 (pf90 and openmpi implementation), compile the cuf code with pgfrotran.

1 CPU + 1 GPU:
Code: Select all
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

starting cpu test...
   runtime:    36437.05     ms
   gflops:    18.11343
   error:   5.0552888E-03

starting cula (host interface) test...
   runtime:    4183.666     ms
   gflops:    157.7564
   error:   1.2658446E-03

starting cula (device interface) test...
   runtime:    3822.783     ms
   gflops:    172.6491
   error:   1.2658446E-03



4 CPUs + 4 GPUs:
Code: Select all
Process             0  of             4  took GPU:             0
Process             3  of             4  took GPU:             3
Process             2  of             4  took GPU:             2
Process             1  of             4  took GPU:             1
cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

cula + pgfortran test (matrix solve)
   array size:         10000  by         10000
   right hand sides:             1

starting cpu test...
starting cpu test...
starting cpu test...
starting cpu test...
   runtime:    35684.69     ms
   gflops:    18.49533
   error:   5.0552888E-03

   runtime:    35932.07     ms
   gflops:    18.36799
   runtime:    35987.90     ms
   gflops:    18.33949
   error:   5.0552888E-03

   error:   5.0552888E-03

   runtime:    37404.25     ms
   gflops:    17.64506
   error:   5.0552888E-03

starting cula (host interface) test...
starting cula (host interface) test...
starting cula (host interface) test...
starting cula (host interface) test...
   runtime:    7521.451     ms
   gflops:    87.74902
   error:   1.2658446E-03

   runtime:    170592.8     ms
   gflops:    3.868863
   error:   1.2658446E-03

   runtime:    178027.4     ms
   runtime:    178026.6     ms

   gflops:    3.707310
   gflops:    3.707294
   error:   1.2658446E-03

   error:   1.2658446E-03

starting cula (device interface) test...
starting cula (device interface) test...
starting cula (device interface) test...
starting cula (device interface) test...
   runtime:    241970.9     ms
   gflops:    2.727601
   error:   1.2658446E-03

   runtime:    242817.9     ms
   gflops:    2.718086
   error:   1.2658446E-03

   runtime:    245411.0     ms
   gflops:    2.689365
   error:   1.2658446E-03

   runtime:    245853.4     ms
   gflops:    2.684527
   error:   1.2658446E-03


Compared with 157.75 and 172.64 Gflops on single CPU with single GPU, when it runs on 4 GPUs, it drops to 3 Gflops and 2 Gflops for host and device cula sgesv routines.

I'm working on a project which'll use this cula routines with MPI+multiple GPUs. Since this issue, this definitely won't get any advantage to use CULA routines instead of same routines in MKL (The MKL routines I test is 18 Gflops).

Do you have any experience on this? I'll really appreciate if you can give me some advice.

My test code:
test_cula.cuf
Code: Select all
module cula_test
     
            use cudafor
            contains
                         
            ! gpu error reporting routine
            subroutine check_status(status)
           
                integer status
                integer info
                integer cula_geterrorinfo
                info = cula_geterrorinfo()
                if (status .ne. 0) then
                    if (status .eq. 7) then
                        write(*,*) 'invalid value for parameter ', info
                    else if (status .eq. 8) then
                        write(*,*) 'data error (', info ,')'
                    else if (status .eq. 9) then
                        write(*,*) 'blas error (', info ,')'
                    else if (status .eq. 10) then
                        write(*,*) 'runtime error (', info ,')'
                    else
                        call cula_getstatusstring(status)
                    endif
                    stop 1
                end if
               
            end subroutine
           
            ! cpu test (baseline)
            subroutine do_cpu_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1, cr, cm )
                print *, 'starting cpu test...'
                               
                ! call lapack solver
                call sgesv(n,nrhs,a,n,ipiv,b,n,info)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cpu_test
           
            ! cula test (host interface)
            subroutine do_cula_host_test(n,nrhs,ain,bin)
               
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer,dimension(:),allocatable :: ipiv
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ipiv(n), ans(n,nrhs) )
                a = ain
                b = bin               
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (host interface) test...'
                               
                ! call cula solver (host interface)
                status = cula_sgesv(n,nrhs,a,n,ipiv,b,n)
                call check_status(status)
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans = bin;
                call sgemm('n','n',n,nrhs,n,1.0,ain,n,b,n,-1.0,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ipiv,ans)
               
            end subroutine do_cula_host_test
                       
            ! cula test (device interface)
            subroutine do_cula_device_test(n,nrhs,ain,bin)
           
                ! input
                real,dimension(:,:) :: ain,bin
               
                ! allocations (all on host)
                real,dimension(:,:),allocatable :: a,b,ans
                integer n,nrhs,status
                integer c1,c2,cr,cm
                real norm
               
                ! gpu memory
                real,device,dimension(:,:),allocatable :: a_dev,b_dev
                integer,device,dimension(:),allocatable :: ipiv_dev
               
                ! back up input for reconstruction test
                allocate( a(n,n), b(n,nrhs), ans(n,nrhs) )
                a(1:n,1:n) = ain
                b(1:n,1:nrhs) = bin               
               
                ! allocate gpu memory
                allocate( a_dev(n,n), b_dev(n,nrhs), ipiv_dev(n) )
               
                ! start test
                call system_clock( c1,cr,cm )
                print *, 'starting cula (device interface) test...'
               
                ! copy memory to gpu
                a_dev = a
                b_dev = b
               
                ! call cula solver (device interface)
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
               
                ! copy answer to host
                b = b_dev
               
                ! stop test
                call system_clock( count=c2 )
                print *, '  runtime:', 1.e3*real(c2-c1) / real(cr), 'ms'
                print *, '  gflops:', (0.66*n**3.) / (real(c2-c1) / real(cr)) / (1.e9)
               
                ! check answer
                ans(1:n,1:nrhs) = bin;
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
                print *, '  error:', norm
                print *, ''
               
                ! cleanup
                deallocate(a,b,ans)
                deallocate(a_dev,b_dev,ipiv_dev)
               
            end subroutine do_cula_device_test
           
        end module cula_test
       
        ! main program

        program cula

            use cula_test
            use more_mpi

           
            ! Host memory
            real,dimension(:,:),allocatable :: a, b
            integer n, info, i, j, status
            integer :: rc, mydev, numdev

           
            call init
   
            ierr = cudaGetDeviceCount(numdev)
            mydev =  mod(cpuid,numdev)
   
            print *, "Process ", cpuid, " of ", numprocs, " took GPU: ", mydev
            !ierr = cudaSetDevice(mydev)
            ierr = cula_selectdevice(mydev)
            call check_status(ierr)

            n = 10000
            nrhs = 1

            print *,'cula + pgfortran test (matrix solve)'
            print *,'  array size: ', n, ' by ', n
            print *,'  right hand sides: ', nrhs
            print *,''
            allocate( a(n,n), b(n,nrhs) )
                                   
            ! intialize a and b
            call random_number(a)
            call random_number(b)
           
            ! Make sure a() isn't singular
            do i=1,n
                a(i,i) = 10. * a(i,i) + 10.
            enddo
           
            call MPI_Barrier(mpi_comm_world,ierr)
           
            ! initialize cula
            status = cula_initialize()
            call check_status(status)
           
            ! do cpu test (baseline)
            call do_cpu_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)
                               
            ! do gpu test (host interface)
            call do_cula_host_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)
           
            ! do gpu test (device interface)
            call do_cula_device_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr)

            call MPI_FINALIZE(rc)
             
        end program cula


init.f

Code: Select all
subroutine init
   use more_mpi
   call mpi_init(ierr)
   call mpi_comm_rank(mpi_comm_world,cpuid,ierr)
   call mpi_comm_size(mpi_comm_world,numprocs,ierr)
   call mpi_get_processor_name(processor_name,namelen,ierr)
end subroutine init


more_mpi.f
Code: Select all
module more_mpi
   include 'mpif.h'
   integer :: ierr,cpuid,numprocs,namelen !mpi
   character(len=100) processor_name
end module


makefile
Code: Select all
.SUFFIXES: .cuf .o

L1= test_cula.o more_mpi.o init.o
L2=

CULAINCLUDES= -I${CULA_INC_PATH}
CULALIBPATH64= -L${CULA_LIB_PATH_64}


CUDAINCLUDES= -I${CUDA_INC_PATH}
CUDALIBPATH64= -L${CUDA_LIB_PATH_64}

CUDALIBS= -lcudart -lcuda


GPULIBS= -lcula_pgfortran #-lcula -lcublas -lcudart

PGFLAGS= -Mfree -O3

#CUDA= -ta=nvidia -Mcuda
CUDA=


SOPT=
LINK1=  /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_scalapack_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_intel_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_blacs_openmpi_lp64.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_sequential.a \
       /opt/intel/Compiler/11.1/069/mkl/lib/em64t/libmkl_core.a \
       -lpthread

#LINK_CU=  /opt/pgi/linux86-64/10.6/lib/libcudafor.a
LINK_CU=  /opt/pgi/linux86-64/11.5/lib/libcudafor.a



PF90= mpif90

PGFOR= pgfortran


PGA_EX= cula_test_mgpus

darwin: $(L1) $(L2)
     $(PF90) $(SOPT) $(PGFLAGS) $(L1) $(L2) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) $(LINK1) $(LINK_CU) -o $(PGA_EX)

.f.o:
     $(PF90) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS)$(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

.cuf.o:
     $(PGFOR) $(SOPT) $(PGFLAGS) $(CUDAINCLUDES) $(CUDALIBPATH64) $(CUDALIBS) $(CULAINCLUDES) $(CULALIBPATH64) $(GPULIBS) -c $<

test_cula.o: test_cula.cuf init.o more_mpi.o

more_mpi.o: more_mpi.f

init.o: init.f more_mpi.o


clean:
     /bin/rm -f *o *mod $(L1b) $(L2b) $(PGA_EX)
   
del:
     rm -f *.mio.mines.edu *.000 *.001 *.002 *.003

move:
     mv *.edu ./result


Thank you in advance.
cding
 
Posts: 15
Joined: Tue Sep 14, 2010 8:25 pm

Re: performance of cula drops when runs on multiCPU+multiGPU

Postby john » Thu Sep 01, 2011 7:09 am

There are a couple of things here which I think are competing for your CPU resources. I see that your checking routines are in the same barrier periods as the CULA solvers. A GEMM+LANGE call at 10k is not trivial, and MKL will use at least 4 threads when calculating those. So what I see is that your first CULA call finishes in 7s and then the rest take quite a bit longer because that first thread has started its GEMM+LANGE sequence. The other CULA calls are then competing with that for CPU time.

I would advise against counting transfer times in your device interface timers. It then becomes impossible to tell whether any increase in duration is due to the transfers or due to CULA. Plus, the only benefit to calling the CULA Device interface is because you want to minimize transfers, and so replicating the Host interface is redundant. If you want to count in transfer times, then I would advise calling the Host interface or amortizing transfer time over several CULA calculation invocations.

Lastly, you might see better results with an OpenMP or threaded approach rather than MPI. In the MPI approach, each of your processes is spawning several worker threads (MKL is similarly optimized to use all the available CPU resources) which are causing contention. If you use threads instead, then the number of worker threads will be capped and so less contention will occur. An alternative is to cap the number of threads manually yourself, but again keep in mind that CULA is optimized assuming that it has full access to your CPU resources.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: performance of cula drops when runs on multiCPU+multiGPU

Postby cding » Tue Sep 06, 2011 9:58 am

Dear John,

Thank you for your reply. But I need to clarify something about the code.

john wrote:There are a couple of things here which I think are competing for your CPU resources. I see that your checking routines are in the same barrier periods as the CULA solvers.


I don't think the mpi barriers set after the test subroutines would affect a lot because those barriers are called only after the 3 test subroutines, that means it's after the solution check part in the subroutine. Only the performance of cula_sgesv and cula_gpu_sgesv is measured, which are executed before those barriers very much. Also, I had tried without this barrier, the issue was there from the beginning. The reason I set those barriers is that I want the output to be formatted and readable. Otherwise, the output is so messed up that I couldn't tell which number is the performance of which.

A GEMM+LANGE call at 10k is not trivial, and MKL will use at least 4 threads when calculating those. So what I see is that your first CULA call finishes in 7s and then the rest take quite a bit longer because that first thread has started its GEMM+LANGE sequence. The other CULA calls are then competing with that for CPU time.


I command out the do_cpu_test and do_cula_host_test and only test do_cula_device_test to avoid that GEMM+LANGE sequence started by MKL to compete with cula_gpu_sgesv for CPU time. But the issue is still there.

And, if we assume it's correct that
the rest take quite a bit longer because that first thread has started its GEMM+LANGE sequence
. but why when I tested the same code in 1 CPU + 1 GPU, the issue is not there. the performance of CULA seems exciting.

Even when I run it on multiple CPUs with multiple GPUs, Since, each CPU has its own GPU to get the cula routine executed, and each CPU executed the same code as the code got run on single core, not communication time between cores, I would think, each CPU has the same resources to use as running on single core.

So, no offense, I don't think
the rest take quite a bit longer because that first thread has started its GEMM+LANGE sequence
is correct explanation for my issue.

I would advise against counting transfer times in your device interface timers.


I agree with you. So I move the data transfer in front of when timing starts. But, still, the performance drops when running on multi-cores.

Lastly, you might see better results with an OpenMP or threaded approach rather than MPI.


I would try OpenMP and post the result here later. Thank you for this advice.


An alternative is to cap the number of threads manually yourself, but again keep in mind that CULA is optimized assuming that it has full access to your CPU resources.


Why CULA has full access to CPU resources when I run the MPI code on single CPU with single GPU, But it fail to fully access to CPU resources when it runs one multi-cores?

And I'm a little bit confused that CULA device routines execute the code on GPU, when the data is ready on device, once the device routines get executed, it shouldn't matter with CPU side.

Thank you so much.

With Regards,

Chong
cding
 
Posts: 15
Joined: Tue Sep 14, 2010 8:25 pm

Re: performance of cula drops when runs on multiCPU+multiGPU

Postby john » Thu Sep 08, 2011 1:23 pm

cding wrote:So, no offense, I don't think
the rest take quite a bit longer because that first thread has started its GEMM+LANGE sequence
is correct explanation for my issue.


This is the code that is being performed:
Code: Select all
            subroutine do_cula_device_test(n,nrhs,ain,bin)
...
                status = cula_device_sgesv(n,nrhs,a_dev,n,ipiv_dev,b_dev,n)
...
                call sgemm('n','n',n,nrhs,n,1.,ain,n,b,n,-1.,ans,n)
                norm = slange('1',n,nrhs,ans,n,work) / real(n)
...
            end subroutine do_cula_device_test

============

            call MPI_Barrier(mpi_comm_world,ierr)
            call do_cula_device_test(n,nrhs,a,b)
            call MPI_Barrier(mpi_comm_world,ierr

This is what I mean by GEMM+LANGE being in the same barrier period, so GEMM+LANGE from the process that finishes GESV first is stealing resources. Disable all these calls and I think you'll see a somewhat different result.

Remember that CULA is a hybrid library. GESV uses CPU resources during its processing, as does GEMM (GEMM uses quite a lot). I think they are competing heavily. CULA is performance-balanced for 1 multicore CPU + 1 GPU. I really think we're overwhelming your CPU here.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron