Weirdness or BUG of CULA Function culaDeviceDsyev

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Weirdness or BUG of CULA Function culaDeviceDsyev

Postby john_silver » Thu Jun 06, 2013 7:13 am

Hello,
I've just fund a problem or perhaps a bug of the function culaDeviceDsyev.

For those matrix of the same dememsion, for the first time of using this function, it takes much less time, this is very unnormal. For a matrix of dimension 16384, this function takes about 240s for the first time, but the second time it takes more than 800s, the third time it takes about 1000s, and then it takes alwas about 1000s.

As I'm doing some analyses by comparing the performance of the GPU Tesla M2050 and of the CPU Xeon 2550. I measure the execution time of culaDeviceDsyev() on the GPU.

Here is my code:
Code: Select all
#include <cstdlib>
#include <cstdio>
#include "unistd.h"
#include <sys/times.h>
#include "mkl_lapack.h"
#include <cula_lapack_device.h>
#include <cuda_runtime.h>
#include <cula.h>
#include <cuda.h>

#define NB_BOUCLE 5

void checkStatus(culaStatus status)
{
    char buf[256];

    if(!status)
        return;

    culaGetErrorInfoString(status, culaGetErrorInfo(), buf, sizeof(buf));
    printf("%s\n", buf);

    culaShutdown();
    exit(EXIT_FAILURE);
}


void checkCudaError(cudaError_t err)
{
    if(!err)
        return;

    printf("%s\n", cudaGetErrorString(err));

    culaShutdown();
    exit(EXIT_FAILURE);
}

struct event_pair
{
cudaEvent_t start;
cudaEvent_t end;
};

inline void start_timer(event_pair *p)
{
cudaEventCreate(&p->start);
cudaEventCreate(&p->end);
cudaEventRecord(p->start,0);
}

inline void stop_timer(event_pair* p, float* elapsed_time)
{
cudaEventRecord(p->end,0);
cudaEventSynchronize(p->end);

cudaEventElapsedTime(elapsed_time,p->start,p->end);
cudaEventDestroy(p->start);
cudaEventDestroy(p->end);
}

int main(int argc, const char* argv[] ) {

    int max =16384;                  //taille maximale de la matrice

    clock_t clock_start;
    clock_t clock_end;
    struct tms cpu_start;
    struct tms cpu_end;

    double delayreal=0;
    double delayuser=0;
    double delaysyst=0;

    event_pair timer;

    float temps_diago;
    float temps_g2c;
    float temps_c2g;

    //random number
    int iseed[4];
    iseed[0] = 2001;
    iseed[1] = 2003;
    iseed[2] = 2005;
    iseed[3] = 2007;
 
    int size=64;
    int nb = NB_BOUCLE;
    printf("number of loops:  %d\n",nb);
    printf("  Matrix size | Real time (sec) | User time (sec) | Syst  time (sec)|  Temps_c2g(ms)  | Temps_g2c(ms)   | Temps_diago(ms)\n");

    cudaError_t err;

    // pointer to host memory
    double* A = NULL;
    double* vals =NULL;

    // pointer to device memory
    double* Ad = NULL;
    double* valsd = NULL;
   
//Initialize the GPU
   culaStatus s;
   s = culaInitialize();
   checkStatus(s);

    while(size<=max) {         
   
   A = new double[size*size];
   vals = new double[size];
//   err = cudaMallocHost((void**)&A,size*size*sizeof(double));
//   checkCudaError(err);
//   err = cudaMallocHost((void**)&vals,size*sizeof(double));
//   checkCudaError(err);
   err = cudaMalloc((void**)&Ad,size*size*sizeof(double));
   checkCudaError(err);
   err = cudaMalloc((void**)&valsd,size*sizeof(double));
   checkCudaError(err);

for(int count=0;count<NB_BOUCLE;count++){

    //Assigning values to the symmetric matrix A
    int i =0;
    int j =0;
    for( i;i<size;i++) {               
      for( j;j<size;j++) {
      double random;
       int ii=1;
        dlarnv_(&ii,iseed,&ii,&random);

        A[i*size+j] = random;
       A[j*size+i] = random;
     }
   }
      
   clock_start = times(&cpu_start);   

   start_timer(&timer);
   err = cudaMemcpy(Ad,A,size*size*sizeof(double),cudaMemcpyHostToDevice);
   checkCudaError(err);
   stop_timer(&timer,&temps_c2g);

   start_timer(&timer);
   culaDeviceDsyev('V','U',size,Ad,size,valsd);    //device memory
   stop_timer(&timer,&temps_diago[0]);

   start_timer(&timer);
   err = cudaMemcpy(A,Ad,size*size*sizeof(double),cudaMemcpyDeviceToHost);
   checkCudaError(err);
   err = cudaMemcpy(vals,valsd,size*sizeof(double),cudaMemcpyDeviceToHost);
   checkCudaError(err);
   stop_timer(&timer,&temps_g2c);   
   
   clock_end = times(&cpu_end);
   double clockstosec = (double)sysconf(_SC_CLK_TCK);
   
   delayreal = (clock_end-clock_start)/clockstosec;
   delayuser = (cpu_end.tms_utime-cpu_start.tms_utime)/clockstosec;
   delaysyst = (cpu_end.tms_stime-cpu_start.tms_stime)/clockstosec;

   printf("%13d | %15.3f | %15.3f | %15.3f | %15.3f | %15.3f | %15.3f\n",size,delayreal,delayuser,delaysyst,temps_c2g,temps_g2c,temps_diago);
      }

   cudaFree(Ad);
   cudaFree(valsd);
   delete[] A;
   delete[] vals;   
   size *= 2;   
    }

    culaShutdown();
    return EXIT_SUCCESS;
}


As you can see in this code, I measure the time of execution of culaDeviceDsyev, for every dimension from 64 to 16384, the function is called 5 times, and everytime I print the execution time.

Here is the result:
Matrix size|Real time (sec)|User time (sec)|Syst time (sec)|T_c2g(ms)|T_g2c(ms)| T_diago(ms)
64 | 0.010 | 0.000 | 0.000 | 0.029 | 0.047 | 7.868
64 | 0.010 | 0.010 | 0.000 | 0.030 | 0.046 | 9.375
64 | 0.010 | 0.010 | 0.000 | 0.030 | 0.046 | 10.009
64 | 0.010 | 0.010 | 0.000 | 0.029 | 0.046 | 10.185
64 | 0.010 | 0.010 | 0.010 | 0.029 | 0.046 | 10.183
128 | 0.020 | 0.010 | 0.000 | 0.081 | 0.079 | 20.496
128 | 0.030 | 0.030 | 0.000 | 0.076 | 0.090 | 24.777
128 | 0.020 | 0.020 | 0.000 | 0.075 | 0.085 | 24.257
128 | 0.030 | 0.030 | 0.000 | 0.075 | 0.096 | 24.082
128 | 0.020 | 0.020 | 0.000 | 0.069 | 0.095 | 24.617
256 | 0.050 | 0.050 | 0.000 | 0.225 | 0.249 | 49.629
256 | 0.070 | 0.060 | 0.000 | 0.248 | 0.254 | 64.530
256 | 0.060 | 0.070 | 0.000 | 0.214 | 0.268 | 65.439
256 | 0.070 | 0.070 | 0.000 | 0.211 | 0.252 | 66.838
256 | 0.070 | 0.060 | 0.010 | 0.216 | 0.249 | 66.678
512 | 0.060 | 0.060 | 0.000 | 0.649 | 0.664 | 57.305
512 | 0.220 | 0.210 | 0.000 | 0.614 | 0.667 | 218.714
512 | 0.240 | 0.240 | 0.010 | 0.615 | 0.662 | 240.389
512 | 0.250 | 0.260 | 0.000 | 0.660 | 0.673 | 252.154
512 | 0.250 | 0.250 | 0.000 | 0.616 | 0.673 | 251.102
1024 | 0.180 | 0.180 | 0.000 | 1.847 | 2.546 | 175.330
1024 | 0.800 | 0.790 | 0.000 | 1.958 | 2.081 | 794.697
1024 | 0.900 | 0.900 | 0.010 | 1.939 | 2.098 | 895.780
1024 | 0.920 | 0.930 | 0.000 | 1.974 | 2.081 | 925.136
1024 | 0.940 | 0.930 | 0.000 | 1.912 | 2.088 | 929.584
2048 | 1.290 | 1.290 | 0.000 | 7.546 | 8.385 | 1274.798
2048 | 3.720 | 3.710 | 0.000 | 7.207 | 11.734 | 3698.643
2048 | 4.210 | 4.210 | 0.000 | 11.031 | 11.759 | 4181.293
2048 | 4.030 | 4.030 | 0.010 | 11.096 | 7.804 | 4016.865
2048 | 4.290 | 4.280 | 0.000 | 7.346 | 7.742 | 4271.743
4096 | 6.520 | 6.520 | 0.000 | 28.207 | 30.059 | 6461.695
4096 | 17.860 | 17.860 | 0.010 | 27.247 | 59.392 | 17780.424
4096 | 20.340 | 20.340 | 0.000 | 54.974 | 60.237 | 20229.109
4096 | 20.900 | 20.900 | 0.000 | 55.384 | 62.078 | 20793.055
4096 | 20.900 | 20.900 | 0.010 | 57.699 | 61.001 | 20784.197
8192 | 34.500 | 34.500 | 0.010 | 221.095 | 167.371 | 34122.828
8192 | 127.290 | 127.310 | 0.000 | 224.520 | 164.142 | 126928.305
8192 | 150.460 | 150.470 | 0.010 | 222.225 | 165.905 | 150095.641
8192 | 153.340 | 153.340 | 0.020 | 225.617 | 164.135 | 152968.891
8192 | 153.800 | 153.820 | 0.010 | 224.054 | 158.536 | 153450.719
16384 | 236.930 | 235.970 | 0.970 | 884.820 | 751.582 | 235330.203
16384 | 849.930 | 843.380 | 6.560 | 883.704 | 731.846 | 848448.438
16384 | 1018.780 | 1016.570 | 2.350 | 871.646 | 579.166 | 1017499.812
16384 | 1030.250 | 1030.320 | 0.100 | 659.949 | 578.773 | 1029187.062
16384 | 1028.290 | 1028.390 | 0.090 | 775.764 | 628.317 | 1027066.375

The things we care are the T_diago(execution time of culaDeviceDsyev measured by GPU) and the Real time(measured by CPU), you can see obviously, for the big dimension matrix, this function takes much less time for the first time we call it.

Then my qustion is: whether the first time, it execute correctely, and if yes why after that it take much more time for the same dimension matrix.
john_silver
 
Posts: 2
Joined: Thu Jun 06, 2013 5:46 am

Re: Weirdness or BUG of CULA Function culaDeviceDsyev

Postby john » Thu Jun 06, 2013 9:02 am

Two quick suggestions, from the top of my head:
1) Make sure to check the output of the CULA routine - there's a chance that some of these cases are erroring and therefore exiting early.
2) Try reusing the same A matrix each time you repeat the same test size. Eigenvalue convergence is highly data dependent, so it's hard to make guesses when the data varies each call.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Weirdness or BUG of CULA Function culaDeviceDsyev

Postby john_silver » Mon Jun 10, 2013 8:15 am

john wrote:Two quick suggestions, from the top of my head:
1) Make sure to check the output of the CULA routine - there's a chance that some of these cases are erroring and therefore exiting early.
2) Try reusing the same A matrix each time you repeat the same test size. Eigenvalue convergence is highly data dependent, so it's hard to make guesses when the data varies each call.


I've followed your advice using the same A matrix. For the same A matrix, the calculate time is much more stable.
I think the problem comes from the matrix A, every time it should be the same matrix for analysing the calculate time.
Thanks very much. :D :D
john_silver
 
Posts: 2
Joined: Thu Jun 06, 2013 5:46 am


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron