The big news from NVIDIA last week was the release of the first Kepler card, the GeForce GTX 680. This card features a radical expansion of cores, from 512 to 1536 (3x!), although each core is clocked slower than previous generations. Since there is only a GeForce part available, this isn't a compute-oriented release, but CUDA still runs nicely on gaming parts in single precision. We have our 680 in-house and have started working with it, and hope to post some performance results in the near future.
The natural question from our users is "so when will CULA support Kepler?" As per our normal release cadence, we will release a Kepler-enabled CULA as soon as possible after the supporting version of CUDA goes final. Note that this doesn't include any RC versions of CUDA which traditionally become available prior to the final release.
The only downside we see to the new chip here is that the double precision performance is quite low (as is traditional for gaming chips), but the single precision numbers are exciting, and many of our users do their work primarily in single precision. It's been some time since we got a new chip, so we're diving in, tuning up our solves, and seeing what kind of results we can get! We look forward to future blog posts where we detail the performance of this new generation.
This week, I'd like to share a CUDA debugging story that some might find helpful or insightful.
Recently, while testing the latest nightly CULA build on our buildfarm, we noticed we were running into an "CUDA out of GPU memory" error after a large number of tests on certain GPUs. This was a new occurrence, and was happening on test sizes that should have easily fit within the memory of the GPU. Our first reaction was to believe that we had erroneously introduced a memory leak into the system. However, when running the cudaMemGetInfo() routine we saw that we still had well over 98% of our GPU memory available. Certainly more investigation needed to be done...
After some more testing, we quickly discovered that NVIDIA's CUBLAS helper routine, cublasDestroy_v2() was leaking a few KB of memory with every call. We promptly submitted a bug report to NVIDIA which was confirmed as an issue and is slated for a fix in the next CUDA release - pweh. However, that doesn't explain why we were still getting an "out of memory" error when it would take literally millions of calls to the leaky NVIDIA routine before we exhausted the memory on a Tesla C2070 with 6 GB.
After some more debugging, we determined we were running into the somewhat rare problem know as memory fragmentation where the GPU could not find a contiguous portion of memory to store the requested block of memory despite having a large amount of free memory overall.
Due to the way one of our tests was structured, we'd create a context, allocate a large chunk of memory, create another context, and then allocate another large chunk of memory. Upon freeing these resources, because of the NVIDIA leak, we had the contexts still floating around - one of which was now located in the middle of our memory space. After repeating this process a number of times, we ended up with a handful of small zombie contexts scattered about memory space. Because of this, it eventually became impossible to allocate a large chunk of contiguous memory.
As a work around, we were able to alleviate this problem by allocating the contexts when there was nothing else in the memory space. This caused the contexts to leak on the edge of memory rather than scattered in the middle.
This just goes to show that memory leaks of any size can be incredibly dangerous! I'd also like to note that CULA users should unaffected by this NVIDIA bug if you are using the interface properly. In the rare event you are running for multiple days, initializing the library millions of times, a call to cudaDeviceReset() will free the memory caused by the bug.
Let's start by saying that CUDA and Fortran aren't the best of friends out of the box. CUDA is a C library without true Fortran support, and Fortran isn't naturally attuned to C's value semantics. Since our users want to use our CULA Device interface routines to avoid transfers between the host and the GPU, those users also need to be able to allocate device memory. The best and easiest way, in our findings, is to use the Portland Groups's Fortran compiler, with the CUDA-Fortran language extensions. This makes CUDA a first-class citizen and so running CULA's Device interface is quite simple. Keep an eye on the upcoming issues of the Portland Group's newsletter, because we will be revising our old article about CULA + PGI integration there.
Now for those without the PGI compiler, the answer is the ISO_C_BINDING method for module writing, which allows Fortran to call into C code using the C types for pointers and with value semantics. Most newer Fortran compilers support this, and as of CULA R15 there will be available a cula_lapack_device module that takes advantage of this mode. That said, CUDA does not publish a formal module for ISO_C_BINDING integration, so you will need to write your own. Here are some sample definitions which can be pretty easily copied to produce the definitions for the CUDA routines you need.
MODULE CUDA_CONSTANTS USE ISO_C_BINDING ENUM, BIND(C) ENUMERATOR :: cudaMemcpyHostToHost=0, & cudaMemcpyHostToDevice, & cudaMemcpyDeviceToHost, & cudaMemcpyDeviceToDevice, & cudaNotUsedInFortran END ENUM END MODULE MODULE CUDA_MEMORY_MANAGEMENT IMPLICIT NONE INTERFACE INTEGER(C_INT) FUNCTION CUDA_MALLOC(BUFFER, SZ) & BIND(C,NAME="cudaMalloc") USE ISO_C_BINDING TYPE (C_PTR) :: BUFFER INTEGER (C_SIZE_T), VALUE :: SZ END FUNCTION END INTERFACE INTERFACE FUNCTION CUDA_MEMCPY(DST,SRC,CO,KI) RESULT(R) & BIND(C,NAME="cudaMemcpy") USE CUDA_CONSTANTS INTEGER (C_INT) :: R TYPE (C_PTR), VALUE :: DST TYPE (C_PTR), VALUE :: SRC INTEGER (C_SIZE_T), VALUE :: CO INTEGER (C_INT), VALUE :: KI END FUNCTION END INTERFACE END MODULE
Using the module to allocate GPU memory, transfer data to that memory, and then to run a CULA routine is as simple as;
USE CULA_LAPACK_DEVICE USE CUDA_MEMORY_MANAGEMENT TYPE(C_PTR) :: A_DEVICE REAL, ALLOCATABLE, DIMENSION(:,:), TARGET :: A SIZE_A = M*N*SIZEOF(A(1,1)) STATUS = CUDA_MALLOC(A_DEVICE,SIZE_A) STATUS = CUDA_MEMCPY(TAU_DEVICE,C_LOC(TAU),& SIZE_TAU,cudaMemcpyHostToDevice) STATUS = CULA_DEVICE_SGEQRF(M, N, A_DEVICE, M, TAU_DEVICE)
With these examples, you can start integrating your CUDA and Fortran codes much more easily. PGI is still our preferred method, but this one works well enough for Intel and GNU Fortran compilers. The upcoming CULA R15 release will feature the publication of the modules that will allow you to integrate the CULA Device interface with this programming style.