The big news from NVIDIA last week was the release of the first Kepler card, the GeForce GTX 680. This card features a radical expansion of cores, from 512 to 1536 (3x!), although each core is clocked slower than previous generations. Since there is only a GeForce part available, this isn't a compute-oriented release, but CUDA still runs nicely on gaming parts in single precision. We have our 680 in-house and have started working with it, and hope to post some performance results in the near future.
The natural question from our users is "so when will CULA support Kepler?" As per our normal release cadence, we will release a Kepler-enabled CULA as soon as possible after the supporting version of CUDA goes final. Note that this doesn't include any RC versions of CUDA which traditionally become available prior to the final release.
The only downside we see to the new chip here is that the double precision performance is quite low (as is traditional for gaming chips), but the single precision numbers are exciting, and many of our users do their work primarily in single precision. It's been some time since we got a new chip, so we're diving in, tuning up our solves, and seeing what kind of results we can get! We look forward to future blog posts where we detail the performance of this new generation.
Engineers with top notch parallel programming experience are highly in demand in the U.S. This fact was recently pointed out in stories published by the mainstream Daily Beast, as well as HPC Wire. A quote from Stan Ahalt in the Daily Beast story caught my attention: “It’s not enough to keep building powerful supercomputers unless we have the brains. Think of a supercomputer as a very fast racing engine. We need more drivers to use those engines." Stan is the director of a supercomputing center at the University of North Carolina at Chapel Hill.
Programming supercomputers is hard work. Those involved in programming large HPC systems go through in-depth training and spend months (sometimes years) fine-tuning their algorithms until they are fully leveraging the massive computing power these machines offer. There is a growing number of tools and libraries for HPC programmers, but not necessarily suitable for all levels of computer engineers. For non HPC-experts, programming small to mid-scale systems can be a pretty challenging and time-consuming task, something we hear quite often from our customers and partners.
Where EM Photonics Can Make a Difference
Companies with recently installed small- to mid-scale supercomputing systems often need help porting their applications to their new machines. This is where we bring tremendous value. We are easy to engage with and offer in-depth understanding of parallel architectures. On top of parallel programming expertise, we bring knowledge and experience in physics-based modeling and simulation, image processing, life sciences, finance, military and defense applications. (Typically, the bigger the problem, the greater the fun!)
We encourage you to take a peak at our EM Photonics site to learn more about our consulting services, as well as current research projects and published papers. We have a team of talented engineers looking forward to tackling new challenges. Just let us know how we can help!
Dr. Vincent Natoli from Stone Ridge Technology has recently published a very good article in HPCwire evaluating 10 common objections to GPU computing. In this article he brings up 10 reasons why people have been hesitant to get involved with GPU computing and provides a counter-argument to each of these reasons.
The GPU team at EM Photonics agrees with a number of Dr. Natoli's points as can be seen in our CULA designs principals. For example, his #1 fallacy is:
"I don’t want to rewrite my code or learn a new language."
In CULA, we designed a system that completely abstracts all GPU details from the user - there is no need to learn CUDA to accelerate an existing LAPACK function. You can simply compile your code using CULA functions (or alternatively link against the new link interface) and all details including initialization, memory transfers, and computation are performed with a single function call. This approach allows scientists and developers unfamiliar with GPU programming to quickly accelerate codes with LAPACK functionality.
Another fallacy examined by Dr. Natoli is:
"The PCIe bandwidth will kill my performance"
In CULA, for large enough problems, the PCIe transfer time accounts for less than 1% of the total runtime! For many of the LAPACK functions, the memory requirements are of order O(N2) while the computation is of order O(N3). This discrepancy means that the amount of computation needed is growing at a much faster rate than the memory required. While this might not always be true for other domains, it is certainly the case for the majority of CULA. Additionally, through creative implementations it is possible to overlap GPU computation with GPU transfers and CPU computation. This technique is used heavily in CULA to achieve even higher speeds.
Overall, the article answers some of the most common misconceptions about GPU computing and is a good read for both novices and experts in the area.