Page 1 of 1

Perform parallel culaZgemm in resource constaind setup ...

PostPosted: Wed Apr 03, 2013 5:03 pm
by samarawickrama
Hi,

I have implemented culaZgemm and now I need to perform this operation (i.e., culaZgemm) in 256 parallel computations.

1) Is there a way to specify how many blocks should utilize for this function?
2) What is the best kernel template to perform 256 culaZgemm operations in parallel? (My matrix size is around 100x100)
3) Please see following template:

++++++++++++++++++++++++++++++++++++++++++++++++

//Device
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, ...)
{
//Perform culaZgemm(...);
}

//Host
matrixMulCUDA<...><<< grid, threads >>>(c, a, b, ...);

+++++++++++++++++++++++++++++++++++++++++++++++++

Can't I use template like above to constrain the number of blocks utilize for culaZgemm?

4) Is there any sample program/reference design of this kind?

Thank you.

Re: Perform parallel culaZgemm in resource constaind setup .

PostPosted: Thu Apr 04, 2013 6:07 am
by john
Hello, you should consider checking out the "batched" calls in NVIDIA CUBLAS. I think you'll find what you're looking for there.