I have implemented culaZgemm and now I need to perform this operation (i.e., culaZgemm) in 256 parallel computations.

1) Is there a way to specify how many blocks should utilize for this function?

2) What is the best kernel template to perform 256 culaZgemm operations in parallel? (My matrix size is around 100x100)

3) Please see following template:

++++++++++++++++++++++++++++++++++++++++++++++++

//Device

template <int BLOCK_SIZE> __global__ void

matrixMulCUDA(float *C, float *A, float *B, ...)

{

//Perform culaZgemm(...);

}

//Host

matrixMulCUDA<...><<< grid, threads >>>(c, a, b, ...);

+++++++++++++++++++++++++++++++++++++++++++++++++

Can't I use template like above to constrain the number of blocks utilize for culaZgemm?

4) Is there any sample program/reference design of this kind?

Thank you.

