Initialisation problems

General CULA Dense (LAPACK & BLAS) support and troubleshooting. Use this forum if you are having a general problem or have encountered a bug.

Initialisation problems

Postby d_hastie » Tue Mar 15, 2011 9:30 am

Hi

I have been using CULA on a desktop machine with no problems for a while.

We have recently bought licences for the department HPC. Each node on the HPC is set up with two Tesla GPUs, each configured in compute-exclusive mode.

Unfortunately trying to run things on the HPC we run into the following problem. Suppose there are no jobs running on the HPC. If a I submit a single job, it runs no problem. However, if I submit 2 jobs, one job runs and the second job fails.

Here is the relevant code segment:
Code: Select all
culaStatus culaStat;

//Initialise CUDA
culaStat=culaInitialize();
cout << culaGetStatusString(culaStat) << endl;
cout << culaGetErrorInfo() << endl;

if(culaStat!=culaNoError){
     cout << "Unable to initialise GPU" << endl;
     cublasShutdown();
     culaShutdown();
     exit(1);
}


The results from the first job are:
Code: Select all
No errors
0

and everything proceeds nicely.

On the other hand the results from the second job (if submitted at almost the same time as the first job) are:
Code: Select all
Blas error, see culaGetErrorInfo for error code
1
Unable to initialise GPU

and the code exits as instructed.

From what I can tell, once the job starts running, culaInitialize always seems to be trying to select device 0. If there is nothing running this is fine, but if 2 jobs are submitted to the same node at the same time (or soon after each other) and they both try to grab device 0, the second job will fail, because the GPUs are in compute-exclusive mode. If two jobs are placed on the same node, I need one job to grab device 0 and the second to grab device 1.

Initially I thought that the solution was to explicity use culaSelectDevice to set which device to use. I thought I could try to set device 0, then if it failed, try to set device 1. But from what I have been reading culaSelectDevice can only be called once from the program (assuming the program is single threaded, which my code is). Wondering then how to proceed, it seemed to me from what I read that actually the expected behaviour of culaInitialize when culaSelectDevice was not called, was to determine that the device 0 was busy, and automatically assign to device 1. This does not appear to be happening, despite the fact that culaGetDeviceCount recognises that there are 2 devices.

As I said above, each job works on its own, so I am at a bit of a loss. I am not sure if I am doing something wrong (I am guessing this is the most likely explanantion), or if there is possibly a bug in culaInitialise. Without seeing the source code it is very hard to determine exactly what that function does. Any suggestions would be welcome.

Many thanks
Dave
d_hastie
 
Posts: 3
Joined: Tue Nov 30, 2010 10:14 am

Re: Initialisation problems

Postby d_hastie » Wed Mar 16, 2011 5:00 am

Just an update on this. Despite what I tried I couldn't seem to get round this just using culaInitialze. In the end however, by using a combination of culaSelectDevice, cudaThreadExit, and error checking, I was able to amend my code to manually make sure that the code was assigned to the right gpu.

Code: Select all
int nDevices;
culaStatus culaStat;
culaStat = culaGetDeviceCount(&nDevices);
if(culaStat!=culaNoError){
    cout << "Error detecting how many devices" << endl;
    exit(-1);
}else{
    cout << "Detected " << nDevices << " gpu devices" << endl;
}

cudaThreadExit();
for(unsigned int i=0;i<nDevices;i++){
    culaStat=culaSelectDevice(i);
    if(culaStat!=culaNoError){
        cout << culaGetStatusString(culaStat) << endl;
        cout << culaGetErrorInfo() << endl;
        cudaThreadExit();
        continue;
    }
    culaStat=culaInitialize();
    if(culaStat!=culaNoError){
       cout << culaGetStatusString(culaStat) << endl;
       cout << culaGetErrorInfo() << endl;
       cublasShutdown();
       culaShutdown();
       cudaThreadExit();
       continue;
    }

    break;
}

if(culaStat!=culaNoError){
    cout << "Unable to initialise GPU" << endl;
    cublasShutdown();
    culaShutdown();
    exit(1);
}else{
    int device;
    culaGetExecutingDevice(&device);
    cout << "Successfully initialised GPU" << endl;
    cout << "Using device " << device << endl;
}
d_hastie
 
Posts: 3
Joined: Tue Nov 30, 2010 10:14 am

Re: Initialisation problems

Postby john » Wed Mar 16, 2011 8:44 am

Hi Dave,
We're going to take a day to try to reproduce and we'll get back to this thread with our thoughts. I think we have enough information here, but we will reply with questions if we don't.

John
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Initialisation problems

Postby d_hastie » Wed Mar 16, 2011 9:23 am

Hi John

Much appreciated.

Cheers
Dave
d_hastie
 
Posts: 3
Joined: Tue Nov 30, 2010 10:14 am

Re: Initialisation problems

Postby dan » Mon Mar 21, 2011 7:48 am

Hi Dave,

I think the reason you're running into some errors here is because of the nature of how CUDA 3.2 and earlier assign threads to a device. CUDA imposes a restriction where a single thread can only be bound to one device, and once bound, you cannot change the device to which it is bound, unless you call cudaThreadExit (note that in CUDA 4 this restriction will be lifted completely). This leads to the requirement that multi-GPU programs be multi-threaded.

I don't think that the cudaThreadExit solution that you outlined will work robustly, because once you call cudaThreadExit you've destroyed the context that you had previously created for the currently bound device, which in turn leads to zero concurrency between the two devices.

The best way to solve this problem is to create multiple threads for each job that you want to issue. The multigpu example in our SDK shows how to call culaSelectDevice and culaInitialize in each thread. You could also wait until CUDA 4 (and the corresponding CULA version) is released, but this solution will solve your problem right now.

Dan
dan
Administrator
 
Posts: 61
Joined: Thu Jul 23, 2009 2:29 pm


Return to CULA Dense Support

Who is online

Users browsing this forum: No registered users and 1 guest

cron