CULA 1.1

by Dan

Today we're very excited to announce the release of CULA 1.1.  This new version of CULA includes improvements from every angle.  Here is a subset of the improvements that have made it into this release:

  • Exciting new functions including general Eigensolver (Premium Feature)
  • Bridge interface for migrating currently existing LAPACK/MKL code
  • Better documentation including a full API reference
  • New examples constructed from user feedback
  • More performance!
  • Mac OS X support (Preview)

In this new version of CULA, we've put a lot of effort into responding to users questions, comments, and suggestions.  One of the most common suggestions has been to provide eigenvalue calculations, as that was one of the major subject areas of linear algebra our 1.0 release hadn't provided - and so we ensured in this release that this functionality was our top priority.

Another common suggestion has been to provide a migration path from other LAPACK systems to CULA.  With CULA 1.1, we're introducing one of our most exciting new features: the Bridge interface.  This interface is targeted at users who are porting existing linear algebra codes to a GPU-accelerated environment.  It makes this job easier by matching the function names and signatures of several popular linear algebra packages including MKL, ACML, and cLAPACK.  It additionally provides a fallback to CPU execution when a user does not have a GPU or when problem size is too small to take advantage of GPU execution.  By providing a migration path from three major LAPACK packages, we hope this feature shows how committed we are to making CULA easy to use.

One of the things that users have asked for is a detailed function reference.  Although CULA closely mirrors other LAPACK packages, some of our users (especially those without LAPACK experience) were confused by some functions and semantics.  With this release, we've included a reference manual that is a companion to the programmers guide.  We've put a lot of effort into making sure that these documents cover all of the questions our users have posted and we hope it shows in this new release.

With CULA 1.1, we've created 3 new examples.  One of these examples showcases the Bridge interface mentioned I mentioned previously.  The other example demonstrates using CULA in a multi-threaded multi-GPU environment, a common question our users have asked on our forums.  The last example uses system solves to demonstrate using CULA with data types other than single-precision real.  If you have any suggestions for new examples, post your suggestions in the forums and we may be able to get these into the final release of 1.1.

Last but not least, we're introducing a Preview build for Mac OS X 10.5 Leopard.  Why do we say it's a Preview?  Currently, our Mac toolchain is limiting us to supporting only single-precision real across the board, but we'll be expanding this support very shortly.  Please keep an eye on this blog for future updates; we have a lot to say about Mac!

I hope this post gave you a good impression of some of the things we've been working on for the past two months.  Over the next few posts we'll be describing some of these features in more detail.  Until then, download the latest version of CULA and post your results on the forums.  We hope you're as excited about this new release as we are!



by Dan

At GTC, NVIDIA released details on their next-generation GPU architecture, called Fermi.  Among the changes this architectural generation brings are:

  • Massively increased double-precision performance
  • Single-precision performance gains, and now with full IEEE accuracy
  • Fused multiply-add (IEEE754-2008 standard compliant)
  • Per-multiprocessor caching
  • The ability to run several CUDA kernels concurrently
  • ECC memory and addressability beyond 4GB
  • Faster atomics

Wow, that is quite a list!  What does it all mean for CULA?  The short answer is: you can expect to see more performance!

Perhaps the most exciting and enabling of these new features is the massively increased double-precision performance.  Current generation GPUs are unbalanced in terms of their performance.  While a GPU has almost 10X the single-precision performance than a modern CPU, they tend to be only 2-3x more powerful in double-precision  performance (if they have double-precision capabilities at all).  What this means for CULA is that we typically only exhibit about that 2-3x performance over vendor-tuned CPU solvers when considering double-precision operations.  With a massive increase coming with Fermi, however, we're expecting to increase this lead to at least 10x with the new architecture!  Another of Fermi's features, per-multiprocessor caching, will lead to further improvements in performance for both single- and double-precision arithmetic, because cache will improve the performance of Fermi's memory across the board.

While it is easy to see how an increase in raw computational throughput and caching will lead to increased performance from CULA, Fermi also bring some capabilities that will enable us as developers to increase CULA's performance even further.  One of these features is the ability to run several CUDA kernels concurrently.  For those of you unfamiliar with CUDA, a kernel is a section of code that is executed on the GPU.  With the current CUDA model, you can only program one of these kernels to run on a GPU at a given time.  For kernels that work on a large amount of data, this doesn't have much of an impact, because the GPU is kept fully occupied by this work.  For smaller workloads, however, the GPU can be under-utilized.  With the ability to run several kernels concurrently, however, we can better utilize the GPU by either packing several of these smaller kernels together or packing a smaller kernel alongside one that does more work.  This will result in the GPU being better utilized which will lead to even greater performance.

Last but certainly not least, Fermi is bringing full ECC memory to the GPU.  While this won't affect the performance of CULA and nor will it change the way GPUs are programmed, it will lead to increased adoption of GPUs for those that need ECC Memory.  This capability really underscores NVIDIA's commitment to general-purpose computing with GPUs  -- something that all people with long runtimes can appreciate.

I think it's easy to say that there is a lot to look forward to in Fermi.  We can't wait to get our hands on these cards and to expose the full computing power of the GPU!