Page 1 of 1

CULA Device

PostPosted: Wed Jul 17, 2013 8:32 am
by jezz0r
Hi,
I'm getting much slower results when launching culaDeviceSgesv than culaSgesv on very similar matrices/RHSs. Why is this?

Re: CULA Device

PostPosted: Wed Jul 17, 2013 8:42 am
by jezz0r
Actually, on identical systems, I just checked, the problem is the same: the device version is about three times as slow

Re: CULA Device

PostPosted: Wed Jul 17, 2013 8:53 am
by john
The host interface takes certain liberties with how the data sits on card (since we own that data, not the user). Sometimes it can work out to a decent little speed boost, but not usually 3x. It's hard to say without more detail from you.

Re: CULA Device

PostPosted: Thu Jul 18, 2013 6:07 am
by jezz0r
Ok, so here's my (test version of) solution function:
Code: Select all
template<int N, int M>
bool matrix<N,M>::CULASolve(float *b)
{
#ifdef USECUDA
   int IPIV[N];

   float *dTemp;
   float *temp = new float[N*N];
   memcpy(temp,values,sizeof(float)*N*N);
   cudaMalloc((void**)&dTemp,N*N*sizeof(float));
   cudaMemcpy(dTemp, values, N*N*sizeof(float), cudaMemcpyHostToDevice);

   int *dIPIV;
   cudaMalloc((void**)&dIPIV, (N)*sizeof(int));

   float *dB;
   cudaMalloc((void**)&dB, N*sizeof(float));
   cudaMemcpy(dB,b,N*sizeof(float), cudaMemcpyHostToDevice);

   time_t newtime,oldtime;
   time(&oldtime);

   checkStatus(culaDeviceSgesv(N,1,dTemp,N,dIPIV,dB,N));

   time(&newtime);
   printf("Device time: %d \n",(int)difftime(newtime, oldtime));
   oldtime=newtime;

   checkStatus(culaSgesv(N,1,temp,N,IPIV,b,N));

   time(&newtime);
   printf("Host time: %d \n", (int)difftime(newtime, oldtime));

   delete [] temp;
   cudaFree(dIPIV);
   cudaFree(dB);
   cudaFree(dTemp);
   return false;
#else
   return LUPsolve(b);
#endif
}

Not the prettiest code in the world, but I think you get the idea.

My output is like this:

Host time: 5
Device time: 18
Host time: 5
Device time: 18
Host time: 5
Device time: 18
Host time: 5
Device time: 17
Host time: 6
Device time: 17

The system is 5145x5145, with a RHS of small numbers and zeroes, and a matrix that is mostly low numbers, except for a largish block of zeroes in bottom right hand corner, and one or two number several orders of magnitude larger. I can give you an example of it for a 30x30 version if you really want.

Re: CULA Device

PostPosted: Thu Jul 18, 2013 9:56 am
by john
Be sure to include a "warmup" run in your testing. The first hit to the GPU will cause things like kernels being loaded down to the card.

Re: CULA Device

PostPosted: Fri Jul 19, 2013 1:44 am
by jezz0r
To be clear, that is the output from one program, and does not even include the first run. It solves this thing repeatedly, and updates certain values from the result.

Re: CULA Device

PostPosted: Fri Jul 19, 2013 5:39 am
by john
It's impossible to really be helpful without a complete test program with data, but I can keep giving one-off suggestions. You should try padding your matrix to an even multiple of 16, 32, 64, etc (try a few different ones to see what your GPU likes.) Just remember to make the padded portions into the identity matrix rather than just zeroes (all zeroes would be singular.) You could also just pad LDA, but I find it easier to do N=LDA.