by Dan

At GTC, NVIDIA released details on their next-generation GPU architecture, called Fermi.  Among the changes this architectural generation brings are:

  • Massively increased double-precision performance
  • Single-precision performance gains, and now with full IEEE accuracy
  • Fused multiply-add (IEEE754-2008 standard compliant)
  • Per-multiprocessor caching
  • The ability to run several CUDA kernels concurrently
  • ECC memory and addressability beyond 4GB
  • Faster atomics

Wow, that is quite a list!  What does it all mean for CULA?  The short answer is: you can expect to see more performance!

Perhaps the most exciting and enabling of these new features is the massively increased double-precision performance.  Current generation GPUs are unbalanced in terms of their performance.  While a GPU has almost 10X the single-precision performance than a modern CPU, they tend to be only 2-3x more powerful in double-precision  performance (if they have double-precision capabilities at all).  What this means for CULA is that we typically only exhibit about that 2-3x performance over vendor-tuned CPU solvers when considering double-precision operations.  With a massive increase coming with Fermi, however, we're expecting to increase this lead to at least 10x with the new architecture!  Another of Fermi's features, per-multiprocessor caching, will lead to further improvements in performance for both single- and double-precision arithmetic, because cache will improve the performance of Fermi's memory across the board.

While it is easy to see how an increase in raw computational throughput and caching will lead to increased performance from CULA, Fermi also bring some capabilities that will enable us as developers to increase CULA's performance even further.  One of these features is the ability to run several CUDA kernels concurrently.  For those of you unfamiliar with CUDA, a kernel is a section of code that is executed on the GPU.  With the current CUDA model, you can only program one of these kernels to run on a GPU at a given time.  For kernels that work on a large amount of data, this doesn't have much of an impact, because the GPU is kept fully occupied by this work.  For smaller workloads, however, the GPU can be under-utilized.  With the ability to run several kernels concurrently, however, we can better utilize the GPU by either packing several of these smaller kernels together or packing a smaller kernel alongside one that does more work.  This will result in the GPU being better utilized which will lead to even greater performance.

Last but certainly not least, Fermi is bringing full ECC memory to the GPU.  While this won't affect the performance of CULA and nor will it change the way GPUs are programmed, it will lead to increased adoption of GPUs for those that need ECC Memory.  This capability really underscores NVIDIA's commitment to general-purpose computing with GPUs  -- something that all people with long runtimes can appreciate.

I think it's easy to say that there is a lot to look forward to in Fermi.  We can't wait to get our hands on these cards and to expose the full computing power of the GPU!

Comments (0) Trackbacks (0)

Sorry, the comment form is closed at this time.

Trackbacks are disabled.