Today we're very excited to announce the release of CULA 1.1.  This new version of CULA includes improvements from every angle.  Here is a subset of the improvements that have made it into this release:

  • Exciting new functions including general Eigensolver (Premium Feature)
  • Bridge interface for migrating currently existing LAPACK/MKL code
  • Better documentation including a full API reference
  • New examples constructed from user feedback
  • More performance!
  • Mac OS X support (Preview)

In this new version of CULA, we've put a lot of effort into responding to users questions, comments, and suggestions.  One of the most common suggestions has been to provide eigenvalue calculations, as that was one of the major subject areas of linear algebra our 1.0 release hadn't provided - and so we ensured in this release that this functionality was our top priority.

Another common suggestion has been to provide a migration path from other LAPACK systems to CULA.  With CULA 1.1, we're introducing one of our most exciting new features: the Bridge interface.  This interface is targeted at users who are porting existing linear algebra codes to a GPU-accelerated environment.  It makes this job easier by matching the function names and signatures of several popular linear algebra packages including MKL, ACML, and cLAPACK.  It additionally provides a fallback to CPU execution when a user does not have a GPU or when problem size is too small to take advantage of GPU execution.  By providing a migration path from three major LAPACK packages, we hope this feature shows how committed we are to making CULA easy to use.

One of the things that users have asked for is a detailed function reference.  Although CULA closely mirrors other LAPACK packages, some of our users (especially those without LAPACK experience) were confused by some functions and semantics.  With this release, we've included a reference manual that is a companion to the programmers guide.  We've put a lot of effort into making sure that these documents cover all of the questions our users have posted and we hope it shows in this new release.

With CULA 1.1, we've created 3 new examples.  One of these examples showcases the Bridge interface mentioned I mentioned previously.  The other example demonstrates using CULA in a multi-threaded multi-GPU environment, a common question our users have asked on our forums.  The last example uses system solves to demonstrate using CULA with data types other than single-precision real.  If you have any suggestions for new examples, post your suggestions in the forums and we may be able to get these into the final release of 1.1.

Last but not least, we're introducing a Preview build for Mac OS X 10.5 Leopard.  Why do we say it's a Preview?  Currently, our Mac toolchain is limiting us to supporting only single-precision real across the board, but we'll be expanding this support very shortly.  Please keep an eye on this blog for future updates; we have a lot to say about Mac!

I hope this post gave you a good impression of some of the things we've been working on for the past two months.  Over the next few posts we'll be describing some of these features in more detail.  Until then, download the latest version of CULA and post your results on the forums.  We hope you're as excited about this new release as we are!

-- The CULA Team

At GTC, NVIDIA released details on their next-generation GPU architecture, called Fermi.  Among the changes this architectural generation brings are:

  • Massively increased double-precision performance
  • Single-precision performance gains, and now with full IEEE accuracy
  • Fused multiply-add (IEEE754-2008 standard compliant)
  • Per-multiprocessor caching
  • The ability to run several CUDA kernels concurrently
  • ECC memory and addressability beyond 4GB
  • Faster atomics

Wow, that is quite a list!  What does it all mean for CULA?  The short answer is: you can expect to see more performance!

Perhaps the most exciting and enabling of these new features is the massively increased double-precision performance.  Current generation GPUs are unbalanced in terms of their performance.  While a GPU has almost 10X the single-precision performance than a modern CPU, they tend to be only 2-3x more powerful in double-precision  performance (if they have double-precision capabilities at all).  What this means for CULA is that we typically only exhibit about that 2-3x performance over vendor-tuned CPU solvers when considering double-precision operations.  With a massive increase coming with Fermi, however, we're expecting to increase this lead to at least 10x with the new architecture!  Another of Fermi's features, per-multiprocessor caching, will lead to further improvements in performance for both single- and double-precision arithmetic, because cache will improve the performance of Fermi's memory across the board.

While it is easy to see how an increase in raw computational throughput and caching will lead to increased performance from CULA, Fermi also bring some capabilities that will enable us as developers to increase CULA's performance even further.  One of these features is the ability to run several CUDA kernels concurrently.  For those of you unfamiliar with CUDA, a kernel is a section of code that is executed on the GPU.  With the current CUDA model, you can only program one of these kernels to run on a GPU at a given time.  For kernels that work on a large amount of data, this doesn't have much of an impact, because the GPU is kept fully occupied by this work.  For smaller workloads, however, the GPU can be under-utilized.  With the ability to run several kernels concurrently, however, we can better utilize the GPU by either packing several of these smaller kernels together or packing a smaller kernel alongside one that does more work.  This will result in the GPU being better utilized which will lead to even greater performance.

Last but certainly not least, Fermi is bringing full ECC memory to the GPU.  While this won't affect the performance of CULA and nor will it change the way GPUs are programmed, it will lead to increased adoption of GPUs for those that need ECC Memory.  This capability really underscores NVIDIA's commitment to general-purpose computing with GPUs  -- something that all people with long runtimes can appreciate.

I think it's easy to say that there is a lot to look forward to in Fermi.  We can't wait to get our hands on these cards and to expose the full computing power of the GPU!

-- Dan

Hi, everyone!

Today I wanted to discuss NVIDIA's GPU Technology Conference (GTC) that was held back in September. I know that a few weeks have passed since the event, but we've been so busy working on CULA that we haven't quite had the time to write about it! At GTC, the CULA team officially released CULA 1.0. For any of you that were there, I think we can agree that it was a great show. Thank you to any of you that stopped by our booth, we're always excited to meet people using CULA.

While at the show, the CULA team also had the chance to meet with many of our collaborators at NVIDIA. It was great to finally put a face to many of the people we have worked with over the past year. For those of you who weren't there, you missed out on some great CULA Cakes!

Eric and Liana with Jen-Hsun Huang

Delicious CULA Cakes

We got a lot of great feedback from the show, and we're excited to be incorporating that feedback into CULA as we look beyond 1.0. Over the next few posts we'll be talking about some of that feedback and how it influenced many of the features that you'll find in CULA 1.1. We'll also be talking about the next generation GPU architecture NVIDIA's announced at GTC and how it will benefit CULA. Until then, thanks for reading and check back soon!

-- The CULA Team

With the release of CULA 1.0, we (the CULA team) thought it would be a great idea to interact more closely with our users. Before today, we've interacted with many of you on our forums and via email, but we'd like to extend our communication in ways for which the forum isn't quite the appropriate outlet. To meet this goal, we've decided to start up our very own blog!

The CULA blog will be the resource for what's going on in CULA and related areas. We want this blog to be a mix of technical information, marketing magic, and other things we (and hopefully you) find cool. We hope that it contains information that can help you in your work, will entertain you, and will keep you coming back!

So let's start basic with some introductions - what is CULA and who developed it? CULA is a dense linear algebra library implemented with CUDA. CULA was started because there was an obvious need for a set of linear algebra functions for the GPU based on a large number of requests in many academic papers as well as on many internet forums, including the official CUDA forums. With the help of NASA and NVIDIA we (EM Photonics) brought CULA to the market. CULA is the result of many years of effort and GPU programming experience. An interesting aside: EM Photonics was one of the first companies to release a fully-capable, free program that demonstrated the power of GPUs for general purpose computing that was based on the old-style of OpenGL GPGPU programming. Since then we’ve happily transitioned our work to the CUDA world.

What makes GPU computing (especially CUDA) so exciting? Whether it be modeling, visualization, image processing, physics, or mathematics, your computation time is significantly faster on the GPU! With the release of CULA, which is modeled after the LAPACK interface, we've brought the power of GPU computing to each of these fields as well as many more beyond this small list. We are extremely proud of CULA and are excited to continue building the next generation of GPU computing.

As we add more posts to this blog, we'd like you to stay up to date. Visit our site often or subscribe to our RSS feed to stay up to date. You'll be hearing from us again soon!

-- The CULA Team