Perform parallel culaZgemm in resource constaind setup ...

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Perform parallel culaZgemm in resource constaind setup ...

Postby samarawickrama » Wed Apr 03, 2013 5:03 pm

Hi,

I have implemented culaZgemm and now I need to perform this operation (i.e., culaZgemm) in 256 parallel computations.

1) Is there a way to specify how many blocks should utilize for this function?
2) What is the best kernel template to perform 256 culaZgemm operations in parallel? (My matrix size is around 100x100)
3) Please see following template:

++++++++++++++++++++++++++++++++++++++++++++++++

//Device
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, ...)
{
//Perform culaZgemm(...);
}

//Host
matrixMulCUDA<...><<< grid, threads >>>(c, a, b, ...);

+++++++++++++++++++++++++++++++++++++++++++++++++

Can't I use template like above to constrain the number of blocks utilize for culaZgemm?

4) Is there any sample program/reference design of this kind?

Thank you.
samarawickrama
 
Posts: 1
Joined: Mon Mar 25, 2013 6:58 pm

Re: Perform parallel culaZgemm in resource constaind setup .

Postby john » Thu Apr 04, 2013 6:07 am

Hello, you should consider checking out the "batched" calls in NVIDIA CUBLAS. I think you'll find what you're looking for there.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 3 guests

cron