Page 1 of 1

Multi-GPU CULA

PostPosted: Thu Jul 08, 2010 11:52 am
by wgomez
We've been trying to get a multi-GPU program working but have been running into inconsistent errors with CULA. We run into problems if we try to run the program with 6 or 10 threads to create. It consistently completes if asked to do 5 threads, but will crash on the fifth or sixth thread when asked to do 6 threads. For some reason it also completes fine when asked to do 9 threads, but crashes with 10. The error is a culaRuntimeError with error code 17.

We are using CULA 1.2 and calling syev on the device. We've tried moving data around, switching to the host interface, and moving our calls to culaInitialize() and culaShutdown(). Our machine is using 4 Tesla C1060s. We also checked that the input data is correct.

Since we got the most stability when calling the initialization and shutdown immediately before and after the culaDeviceDsyev function call, we were wondering what the best practices for using the initialization and shutdown functions are. When and where should they be used? We found that if a thread called the shutdown function during another thread's run, the thread would return a culaNotInitialized error. Shouldn't the calls be thread specific?

Any help would be appreciated. Really frustrated at this point.

Re: Multi-GPU CULA

PostPosted: Thu Jul 08, 2010 1:13 pm
by kyle
Have you tried the multi-gpu example provided with CULA? By default it's configured to launch one thread per device. You can easily configure it to run multiple threads per device though; just change line 99 to have a scalar like 1.5 or 2 threads per GPU.

Here is the results of me running 5 threads on 2 devices.

Code: Select all
Found 2 devices, will launch 5 threads

Thread 0 - Launched
Thread 0 - Binding to device 0
Thread 1 - Launched
Thread 1 - Binding to device 1
Thread 2 - Launched
Thread 2 - Binding to device 0
Thread 4 - Launched
Thread 4 - Binding to device 0
Thread 3 - Launched
Thread 3 - Binding to device 1
Thread 0 - Allocating matrices
Thread 0 - Initializing CULA
Thread 2 - Allocating matrices
Thread 2 - Initializing CULA
Thread 1 - Allocating matrices
Thread 1 - Initializing CULA
Thread 3 - Allocating matrices
Thread 4 - Allocating matrices
Thread 3 - Initializing CULA
Thread 4 - Initializing CULA
Thread 4 - Calling culaSgeqrf
Thread 2 - Calling culaSgeqrf
Thread 1 - Calling culaSgeqrf
Thread 3 - Calling culaSgeqrf
Thread 0 - Calling culaSgeqrf
Thread 3 - Shutting down CULA
Thread 1 - Shutting down CULA
Thread 2 - Shutting down CULA
Thread 4 - Shutting down CULA
Thread 0 - Shutting down CULA


Your assumptions about the calls being thread specific are correct; you should have to "initialize --> bind --> call --> shutdown" from every thread.

We are trying to replicate your "culaNotInitialized" error at this point. I'll let you know if we find anymore information out.

Re: Multi-GPU CULA

PostPosted: Thu Jul 08, 2010 3:09 pm
by kyle
We have narrowed down the error to a context management bug in culaShutdown(). A fix is in the works, but in the mean time you can most likely ignore culaShutdown() without problem.

Re: Multi-GPU CULA

PostPosted: Fri Jul 09, 2010 9:04 am
by john
kyle wrote:We have narrowed down the error to a context management bug in culaShutdown(). A fix is in the works, but in the mean time you can most likely ignore culaShutdown() without problem.

Indeed, you'll leak a small amount of memory this way, but unless the program is running continuously (and constantly launching / killing CULA threads) this shouldn't be an issue for now. Next release will correct culaShutdown.

Re: Multi-GPU CULA

PostPosted: Fri Jul 09, 2010 12:23 pm
by wgomez
Thanks for the replies.

We got our code to work without errors for multiple runs. We ended up adding a mutex that made sure that only one thread was using CULA at a time, and it hasn't crashed since. Every thread that is about to call a CULA function locks the mutex, initializes CULA, calls CULA, shuts down CULA, and then frees the mutex. Not a big fan of the solution, but for now it works.

We tried taking out the culaShutdown call at one point, but that didn't fix our particular problem. Unfortunately, we are planning on consistently creating and shutting down threads that will use CULA, so just ignoring the shutdown call won't work.

One more question, is there a reason that a call to culaDeviceDsyev() would spawn several extra threads? It appears to spawn 4 threads the first time it is called (whatever thread gets there first), and then 3 threads the first time a subsequent thread calls it. I don't think it's causing a problem, but I was wondering why those threads are appearing.

Re: Multi-GPU CULA

PostPosted: Mon Jul 12, 2010 8:46 am
by john
Thank you for the reply and thank you also for the input. It has been quite valuable. We believe that we have fixed the issue sufficiently and that in the next CULA version you will not need to apply such workarounds. In future releases, each thread should call culaInitialize/culaShutdown as appropriate and you will not see CUBLAS errors. Please keep in mind that CULA uses CUBLAS internally, so cublasShutdown should not be called if you have any upcoming CULA calls.

For the DSYEV question, please keep in mind that CULA is a hybrid CPU/GPU library and as such we use both the CPU and GPU for certain portions of the code. In the case of a multicore CPU, we will also attempt to use as many cores as necessary. That extra thread you have observed is likely to be for one-time bookkeeping and allocation.

Re: Multi-GPU CULA

PostPosted: Mon Jul 12, 2010 10:15 am
by wgomez
We're glad we could help improve CULA.

Just to be complete, our solution to the problem still throws a cudaError sometime during each culaInitialize() call. The call itself returns culaNoError and our program functions correctly, though, so we are moving forward.

Thanks for the reply about our DSYEV question. We expected it to have to do with having a multicore CPU, but we weren't sure if they were meant to stick around until the entire program completes.

Re: Multi-GPU CULA

PostPosted: Tue Jul 13, 2010 7:40 am
by dan
Hi wgomez,

With regard to the exception being thrown, this is perfectly fine because no exception will ever propagate beyond the API boundary. Microsoft Visual Studio does list every exception that it sees (ever if it's not in your code) and this is likely what you're observing. This is actually expected behavior -- it's actually the CUDA runtime that is throwing the exception, not us.

Thanks for your input,

Dan

Re: Multi-GPU CULA

PostPosted: Tue Jul 13, 2010 8:31 am
by john
Just a small comment to add to Dan's post, which was that we once looked into this exception because we noticed it just as you did. It seems that it's thrown and handled several times by CUDA during normal operations. I'd only worry if one was unhandled, but that won't occur from CULA.