Does CULA allow only one thread per GPU at any instant?

This is my first foray into GPU programming via CULA, so I apologise in advance if the answer to this is obvious...
I have a multi-threaded application that uses all 8 cores (total CPU usage is typically 95-100%) on a hyper-threaded quad-core i7. Most of the threads in the application (of which there are many) use maths kernel library BLAS/LAPACK functions, so there can be (and frequently are) many threads executing BLAS/LAPACK routines in parallel. I am interested in using the GPU to accelerate this application. My first attempt at acceleration using CULA has failed. There is a bottleneck when multiple application threads attempt to use GPU accelerated versions of the BLAS/LAPACK routines in parallel. So, I have 2 questions:
1) With CULA, is it true that only one application thread can be active on a given GPU at any one time?
2) If the answer to (1) is 'yes', is this a restriction imposed by CULA, or is it a basic architectural restriction associated with CUDA?
I have a multi-threaded application that uses all 8 cores (total CPU usage is typically 95-100%) on a hyper-threaded quad-core i7. Most of the threads in the application (of which there are many) use maths kernel library BLAS/LAPACK functions, so there can be (and frequently are) many threads executing BLAS/LAPACK routines in parallel. I am interested in using the GPU to accelerate this application. My first attempt at acceleration using CULA has failed. There is a bottleneck when multiple application threads attempt to use GPU accelerated versions of the BLAS/LAPACK routines in parallel. So, I have 2 questions:
1) With CULA, is it true that only one application thread can be active on a given GPU at any one time?
2) If the answer to (1) is 'yes', is this a restriction imposed by CULA, or is it a basic architectural restriction associated with CUDA?