Page 1 of 1

culaDeviceMalloc v. cudaMalloc

PostPosted: Wed Jul 04, 2012 5:27 am
by modelsciences
Since the majority of my work involves dense linear algebra I use the culaDeviceMalloc as the defaul t allocator for arrays. I have noticed that the largest array I can allocate using this falls well below the available memory on the device as returned by cudaMemGetInfo. Before I investigate this more thoroughly I wonder what are the advantages/disadvantages of each (since array organisation has nothing to do with allocation - (except if you care about pitch - but that's a higher level of abstraction and should be taken care of w.r.t. the leading dimensions in the BLAS/LAPACK calls).

1. Would the configuration of a matrix (m,n and element size) constrain the total size of array it is possible to allocate - apart from the obvious (m*n*elsize) calculation?

2. Could I substitute cudaMalloc with impunity?

-- Simon

Re: culaDeviceMalloc v. cudaMalloc

PostPosted: Thu Jul 05, 2012 8:14 am
by modelsciences
Well its not as bad as I thought - but maybe beween 2-3 MB difference between cudaMalloc and cualDeviceMalloc in terms of the largest array that can be allocated. Not a big deal.

On a more pressing note however - I notice something going on - that after a failed allocation there will be a number of spurious "errors" for subsequent calls.

Could this due to some asynchronous behaviour behind the scenes? It there a way to ensure that the error status can be reset (I guess cudaGetLastError does this - the idea of global state in the library and asynchronous (and I guess) threading and streaming kernels behind the scenes keeps me awake at night.

I also get "spurious" errors from culaSelectDevice after a memory allocation failure.

-- Simon

Re: culaDeviceMalloc v. cudaMalloc

PostPosted: Thu Jul 05, 2012 8:21 am
by modelsciences
I guess all this would be simply solved if get cudaMemGetInfo returned the largest contiguous block of free memory as a third parameter - but that's one for nVidia. A free memory stat without that information is useless.

Any ideas?

Re: culaDeviceMalloc v. cudaMalloc

PostPosted: Thu Jul 05, 2012 8:29 am
by modelsciences
Well to complete the posting with a workaround - if I do a device reset after an allocation failure I can get my optimistic allocator to work and return the bigest chunk of contiguous memory. I use this to calibrate my sharding of jobs across multiple gpus. I bet I am not the only one doing this... and this solution seems very inelegant.

-- Simon

Re: culaDeviceMalloc v. cudaMalloc

PostPosted: Tue Jul 24, 2012 11:49 am
by john
culaDeviceMalloc is very much a passthrough to cudaMalloc, with an exception being that the padding (if any) is chosen such that we get higher performance. The padding numbers tend to be fairly small, so overhead is low.

But I personally have run into the situation you described, where fragmentation leads to being unable to allocate large pieces of memory. The best advice I can offer is to perform big allocations before small, and to reuse them if possible.