Initial Fermi Performance
Hot off the heels of a 1.3a service release, we've got some brand new information on the future directions of CULA. Today we'll be talking about Fermi, NVIDIA's next-generation GPU architecture that was announced in September at the GPU Technology Conference. At that time, we shared our thoughts on the new and exciting performance we hoped Fermi would bring. After 6 months of anticipation, we're very proud today to debut the first performance results for CULA running on Fermi. To our knowledge, these results are the first published double-precision performance results for Fermi running real-world code.
As NVIDIA discussed at Fermi's unveiling, their next-generation part brings an increase in double-precision performance. When we received our Fermi based Tesla C2050, we didn't hesitate to port CULA to the new platform. All that was required to get CULA up and running on Fermi was to set a few compiler flags for the SM 2.0 model, upgrade our graphics driver, and make a few small code changes for the new architecture (more on this in a later post). Once that was done, we ran through our publicly available benchmark suite to bring you the numbers you see below:
As you can see, Fermi is no slouch! We're reporting performance gains for doubles up to 3x over the previous generation of Tesla GPUs. It's also very important to note that these gains are achieved with no Fermi-specific optimizations added -- these are practically plug-and-play performance enhancements. We have every expectation that with a little time and effort we can improve significantly upon these already impressive numbers.
Well, there you have it. Fermi is here and NVIDIA has delivered considerable double-precision gains. We'll be releasing a Fermi-enabled version of CULA very soon so check back often for the latest and greatest in GPU computation. Until then, enjoy these graphs and get your systems prepared for CULA 2.0 and this must-have hardware upgrade. As an aside, for those of you wondering why we haven't released a Fermi-supporting version of CULA just yet, it is important to note that there is much more to a release than just code or compiler flags, including: upgrading all of our builders to CUDA 3.0, updating packaging scripts, testing across all operating systems, etc.
Fermi
At GTC, NVIDIA released details on their next-generation GPU architecture, called Fermi. Among the changes this architectural generation brings are:
- Massively increased double-precision performance
- Single-precision performance gains, and now with full IEEE accuracy
- Fused multiply-add (IEEE754-2008 standard compliant)
- Per-multiprocessor caching
- The ability to run several CUDA kernels concurrently
- ECC memory and addressability beyond 4GB
- Faster atomics
Wow, that is quite a list! What does it all mean for CULA? The short answer is: you can expect to see more performance!
Perhaps the most exciting and enabling of these new features is the massively increased double-precision performance. Current generation GPUs are unbalanced in terms of their performance. While a GPU has almost 10X the single-precision performance than a modern CPU, they tend to be only 2-3x more powerful in double-precision performance (if they have double-precision capabilities at all). What this means for CULA is that we typically only exhibit about that 2-3x performance over vendor-tuned CPU solvers when considering double-precision operations. With a massive increase coming with Fermi, however, we're expecting to increase this lead to at least 10x with the new architecture! Another of Fermi's features, per-multiprocessor caching, will lead to further improvements in performance for both single- and double-precision arithmetic, because cache will improve the performance of Fermi's memory across the board.
While it is easy to see how an increase in raw computational throughput and caching will lead to increased performance from CULA, Fermi also bring some capabilities that will enable us as developers to increase CULA's performance even further. One of these features is the ability to run several CUDA kernels concurrently. For those of you unfamiliar with CUDA, a kernel is a section of code that is executed on the GPU. With the current CUDA model, you can only program one of these kernels to run on a GPU at a given time. For kernels that work on a large amount of data, this doesn't have much of an impact, because the GPU is kept fully occupied by this work. For smaller workloads, however, the GPU can be under-utilized. With the ability to run several kernels concurrently, however, we can better utilize the GPU by either packing several of these smaller kernels together or packing a smaller kernel alongside one that does more work. This will result in the GPU being better utilized which will lead to even greater performance.
Last but certainly not least, Fermi is bringing full ECC memory to the GPU. While this won't affect the performance of CULA and nor will it change the way GPUs are programmed, it will lead to increased adoption of GPUs for those that need ECC Memory. This capability really underscores NVIDIA's commitment to general-purpose computing with GPUs -- something that all people with long runtimes can appreciate.
I think it's easy to say that there is a lot to look forward to in Fermi. We can't wait to get our hands on these cards and to expose the full computing power of the GPU!
GTC 2009 Retrospective
Hi, everyone!
Today I wanted to discuss NVIDIA's GPU Technology Conference (GTC) that was held back in September. I know that a few weeks have passed since the event, but we've been so busy working on CULA that we haven't quite had the time to write about it! At GTC, the CULA team officially released CULA 1.0. For any of you that were there, I think we can agree that it was a great show. Thank you to any of you that stopped by our booth, we're always excited to meet people using CULA.
While at the show, the CULA team also had the chance to meet with many of our collaborators at NVIDIA. It was great to finally put a face to many of the people we have worked with over the past year. For those of you who weren't there, you missed out on some great CULA Cakes!
We got a lot of great feedback from the show, and we're excited to be incorporating that feedback into CULA as we look beyond 1.0. Over the next few posts we'll be talking about some of that feedback and how it influenced many of the features that you'll find in CULA 1.1. We'll also be talking about the next generation GPU architecture NVIDIA's announced at GTC and how it will benefit CULA. Until then, thanks for reading and check back soon!




