Debugging a CUDA Memory Leak

by Kyle

This week, I'd like to share a CUDA debugging story that some might find helpful or insightful.

Recently, while testing the latest nightly CULA build on our buildfarm, we noticed we were running into an "CUDA out of GPU memory" error after a large number of tests on certain GPUs. This was a new occurrence, and was happening on test sizes that should have easily fit within the memory of the GPU. Our first reaction was to believe that we had erroneously introduced a memory leak into the system. However, when running the cudaMemGetInfo() routine we saw that we still had well over 98% of our GPU memory available. Certainly more investigation needed to be done...

After some more testing, we quickly discovered that NVIDIA's CUBLAS helper routine, cublasDestroy_v2() was leaking a few KB of memory with every call. We promptly submitted a bug report to NVIDIA which was confirmed as an issue and is slated for a fix in the next CUDA release - pweh. However, that doesn't explain why we were still getting an "out of memory" error when it would take literally millions of calls to the leaky NVIDIA routine before we exhausted the memory on a Tesla C2070 with 6 GB.

After some more debugging, we determined we were running into the somewhat rare problem know as memory fragmentation where the GPU could not find a contiguous portion of memory to store the requested block of memory despite having a large amount of free memory overall.

Due to the way one of our tests was structured, we'd create a context, allocate a large chunk of memory, create another context, and then allocate another large chunk of memory. Upon freeing these resources, because of the NVIDIA leak, we had the contexts still floating around - one of which was now located in the middle of our memory space. After repeating this process a number of times, we ended up with a handful of small zombie contexts scattered about memory space. Because of this, it eventually became impossible to allocate a large chunk of contiguous memory.

As a work around, we were able to alleviate this problem by allocating the contexts when there was nothing else in the memory space. This caused the contexts to leak on the edge of memory rather than scattered in the middle.

This just goes to show that memory leaks of any size can be incredibly dangerous!  I'd also like to note that CULA users should unaffected by this NVIDIA bug if you are using the interface properly. In the rare event you are running for multiple days, initializing the library millions of times, a call to cudaDeviceReset() will free the memory caused by the bug.


CUDA and Fortran

by John

Let's start by saying that CUDA and Fortran aren't the best of friends out of the box. CUDA is a C library without true Fortran support, and Fortran isn't naturally attuned to C's value semantics. Since our users want to use our CULA Device interface routines to avoid transfers between the host and the GPU, those users also need to be able to allocate device memory. The best and easiest way, in our findings, is to use the Portland Groups's Fortran compiler, with the CUDA-Fortran language extensions. This makes CUDA a first-class citizen and so running CULA's Device interface is quite simple. Keep an eye on the upcoming issues of the Portland Group's newsletter, because we will be revising our old article about CULA + PGI integration there.

Now for those without the PGI compiler, the answer is the ISO_C_BINDING method for module writing, which allows Fortran to call into C code using the C types for pointers and with value semantics. Most newer Fortran compilers support this, and as of CULA R15 there will be available a cula_lapack_device module that takes advantage of this mode. That said, CUDA does not publish a formal module for ISO_C_BINDING integration, so you will need to write your own. Here are some sample definitions which can be pretty easily copied to produce the definitions for the CUDA routines you need.

          ENUM, BIND(C)
              ENUMERATOR :: cudaMemcpyHostToHost=0, &
              cudaMemcpyHostToDevice, &
              cudaMemcpyDeviceToHost, &
              cudaMemcpyDeviceToDevice, &
          END ENUM
                  USE ISO_C_BINDING
                  TYPE (C_PTR) :: BUFFER
                  INTEGER (C_SIZE_T), VALUE :: SZ
              END FUNCTION
                  USE CUDA_CONSTANTS
                  INTEGER (C_INT) :: R
                  TYPE (C_PTR), VALUE :: DST
                  TYPE (C_PTR), VALUE :: SRC
                  INTEGER (C_SIZE_T), VALUE :: CO
                  INTEGER (C_INT), VALUE :: KI
              END FUNCTION

Using the module to allocate GPU memory, transfer data to that memory, and then to run a CULA routine is as simple as;

        TYPE(C_PTR) :: A_DEVICE
        SIZE_A = M*N*SIZEOF(A(1,1))

With these examples, you can start integrating your CUDA and Fortran codes much more easily. PGI is still our preferred method, but this one works well enough for Intel and GNU Fortran compilers. The upcoming CULA R15 release will feature the publication of the modules that will allow you to integrate the CULA Device interface with this programming style.


Debugging with CULA Sparse

by Dan

CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.

Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a previous post, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.

In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.

We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, try out the demo and see how our combination or performance and ease-of-use work for you!