CULA Device

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

CULA Device

Postby jezz0r » Wed Jul 17, 2013 8:32 am

Hi,
I'm getting much slower results when launching culaDeviceSgesv than culaSgesv on very similar matrices/RHSs. Why is this?
jezz0r
 
Posts: 5
Joined: Tue Jul 02, 2013 6:39 am

Re: CULA Device

Postby jezz0r » Wed Jul 17, 2013 8:42 am

Actually, on identical systems, I just checked, the problem is the same: the device version is about three times as slow
jezz0r
 
Posts: 5
Joined: Tue Jul 02, 2013 6:39 am

Re: CULA Device

Postby john » Wed Jul 17, 2013 8:53 am

The host interface takes certain liberties with how the data sits on card (since we own that data, not the user). Sometimes it can work out to a decent little speed boost, but not usually 3x. It's hard to say without more detail from you.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: CULA Device

Postby jezz0r » Thu Jul 18, 2013 6:07 am

Ok, so here's my (test version of) solution function:
Code: Select all
template<int N, int M>
bool matrix<N,M>::CULASolve(float *b)
{
#ifdef USECUDA
   int IPIV[N];

   float *dTemp;
   float *temp = new float[N*N];
   memcpy(temp,values,sizeof(float)*N*N);
   cudaMalloc((void**)&dTemp,N*N*sizeof(float));
   cudaMemcpy(dTemp, values, N*N*sizeof(float), cudaMemcpyHostToDevice);

   int *dIPIV;
   cudaMalloc((void**)&dIPIV, (N)*sizeof(int));

   float *dB;
   cudaMalloc((void**)&dB, N*sizeof(float));
   cudaMemcpy(dB,b,N*sizeof(float), cudaMemcpyHostToDevice);

   time_t newtime,oldtime;
   time(&oldtime);

   checkStatus(culaDeviceSgesv(N,1,dTemp,N,dIPIV,dB,N));

   time(&newtime);
   printf("Device time: %d \n",(int)difftime(newtime, oldtime));
   oldtime=newtime;

   checkStatus(culaSgesv(N,1,temp,N,IPIV,b,N));

   time(&newtime);
   printf("Host time: %d \n", (int)difftime(newtime, oldtime));

   delete [] temp;
   cudaFree(dIPIV);
   cudaFree(dB);
   cudaFree(dTemp);
   return false;
#else
   return LUPsolve(b);
#endif
}

Not the prettiest code in the world, but I think you get the idea.

My output is like this:

Host time: 5
Device time: 18
Host time: 5
Device time: 18
Host time: 5
Device time: 18
Host time: 5
Device time: 17
Host time: 6
Device time: 17

The system is 5145x5145, with a RHS of small numbers and zeroes, and a matrix that is mostly low numbers, except for a largish block of zeroes in bottom right hand corner, and one or two number several orders of magnitude larger. I can give you an example of it for a 30x30 version if you really want.
jezz0r
 
Posts: 5
Joined: Tue Jul 02, 2013 6:39 am

Re: CULA Device

Postby john » Thu Jul 18, 2013 9:56 am

Be sure to include a "warmup" run in your testing. The first hit to the GPU will cause things like kernels being loaded down to the card.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: CULA Device

Postby jezz0r » Fri Jul 19, 2013 1:44 am

To be clear, that is the output from one program, and does not even include the first run. It solves this thing repeatedly, and updates certain values from the result.
jezz0r
 
Posts: 5
Joined: Tue Jul 02, 2013 6:39 am

Re: CULA Device

Postby john » Fri Jul 19, 2013 5:39 am

It's impossible to really be helpful without a complete test program with data, but I can keep giving one-off suggestions. You should try padding your matrix to an even multiple of 16, 32, 64, etc (try a few different ones to see what your GPU likes.) Just remember to make the padded portions into the identity matrix rather than just zeroes (all zeroes would be singular.) You could also just pad LDA, but I find it easier to do N=LDA.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 2 guests

cron