<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>CULA</title>
	<atom:link href="http://www.culatools.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.culatools.com</link>
	<description>GPU Accelerated Linear Algebra</description>
	<lastBuildDate>Tue, 31 Jan 2012 20:58:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>CULA Dense R14 and Sparse S2 &#8211; Now Supporting CUDA 4.1</title>
		<link>http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/</link>
		<comments>http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/#comments</comments>
		<pubDate>Tue, 31 Jan 2012 20:58:55 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=3052</guid>
		<description><![CDATA[We're pleased to announce the release of our latest CULA Dense and Sparse versions, with full compatibility for CUDA 4.1. A major highlight of R14 is the inclusion of a preview of multi-GPU LAPACK routines, hereby called the pCULA branch of CULA Dense. Again, this is a preview designed to show potential performance as well [...]]]></description>
			<content:encoded><![CDATA[<p>We're pleased to announce the release of our latest CULA Dense and Sparse versions, with full compatibility for CUDA 4.1. A major highlight of R14 is the inclusion of a preview of multi-GPU LAPACK routines, hereby called the pCULA branch of CULA Dense. Again, this is a preview designed to show potential performance as well as an interface which will likely continue to evolve over time. The new multi-GPU routines are:<br />
<code><br />
pculaGetrf (LU decomposition)<br />
pculaGetrs (LU solve)<br />
pculaGesv (general system solve via LU)<br />
pculaPotrf (Cholesky decomposition)<br />
pculaPotrs (Cholesky solve)<br />
pculaPosv (hermitian/symmetric postive-definite system solve)<br />
pculaTrsm (BLAS triangular system solve)<br />
pculaGemm (BLAS general matrix multiply)<br />
</code></p>
<p>An upcoming blog post will contain more on the usage and expectations of these routines, but a simple example is quite easy to create:<br />
<code><br />
culaInitialize();</p>
<p>pculaConfig config;<br />
pculaConfigInit(&#038;config);<br />
// some users may wish to tweak the default options here<br />
// the default is to use all CUDA devices and to allow the routine<br />
// to select the parameters it feels is best</p>
<p>culaStatus status = pculaPotrf(&#038;config, m, n, A, lda);<br />
</code></p>
<p>As always, in addition to new features are bug fixes and speed/stability improvements. The full release notes for both R14 and S2 are available at the <a href="http://www.culatools.com/downloads/dense/">dense downloads page</a> and the <a href="http://www.culatools.com/downloads/sparse/">sparse downloads page</a>, respectively.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2012/01/31/cula-dense-r14-and-sparse-s2-now-supporting-cuda-4-1/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Debugging with CULA Sparse</title>
		<link>http://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/</link>
		<comments>http://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 20:48:00 +0000</pubDate>
		<dc:creator>Dan</dc:creator>
				<category><![CDATA[CULA Design Principals]]></category>
		<category><![CDATA[CULA Routine Feature]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=3038</guid>
		<description><![CDATA[CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter [...]]]></description>
			<content:encoded><![CDATA[<p><img class="wp-image-3040 alignright" title="debugging" src="http://www.culatools.com/wp-content/uploads/2012/01/debugging.png" alt="" width="280" height="180" /><br />
CULA Sparse offers a unique debugging feature. When enabled, this feature allows you to perform extra checks on your matrix. Our recommended use case is to use debugging mode when getting started running the library or if you run into a problem. Once you have fixed any any issues you might encounter (if you encounter none, good for you!), you can switch off debugging mode to make sure you are running at full performance.</p>
<p>Currently, one of the most important things that debugging mode enables is a check to ensure that your matrix is well-formed. In a <a href="http://www.culatools.com/blog/2011/09/08/sparse-101-matrix-formats/" title="Sparse 101: Matrix Formats">previous post</a>, I discussed sparse matrix formats. CULA Sparse, being flexible, provides an indexing parameter for you to specify whether your data is one- or zero-based. It is a very common error, however, that users do not specify their index or matrix data correctly when they use the library. Debugging mode helps here because it can identify when there is a mismatch between the actual matrix data and the specified indexing.</p>
<p>In future revisions of CULA Sparse, there is an opportunity to introduce even more options, such as introducing a check that helps to steer you towards a good solver. For example, BiCG is intended only for symmetric matrices; if you use a non-symmetric matrix with it, you are likely to get poor performance. In a future release, we may check for this case and report to you if you are using a solver incorrectly.</p>
<p>We think that providing developer-oriented features and ease-of-use features are just as important as performance, although of course we provide that in spades. If you haven’t tried CULA Sparse yet, <a href="http://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/" title="Introducing the CULA Sparse Demo">try out the demo</a> and see how our combination or performance and ease-of-use work for you!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2012/01/13/debugging-with-cula-sparse/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Not enough HPC programmers. How to fill the gap?</title>
		<link>http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/</link>
		<comments>http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 17:42:57 +0000</pubDate>
		<dc:creator>Liana</dc:creator>
				<category><![CDATA[GPGPU Industry]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2968</guid>
		<description><![CDATA[Engineers with top notch parallel programming experience are highly in demand in the U.S.  This fact was recently pointed out in stories published by the mainstream Daily Beast, as well as HPC Wire. A quote from Stan Ahalt in the Daily Beast story caught my attention: “It’s not enough to keep building powerful supercomputers unless we [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.culatools.com/wp-content/uploads/2012/01/business-intelligence2.jpg"><img class="alignright  wp-image-2975" title="HPC Brain" src="http://www.culatools.com/wp-content/uploads/2012/01/business-intelligence2.jpg" alt="" width="204" height="169" /></a>Engineers with top notch parallel programming experience are highly in demand in the U.S.  This fact was recently pointed out in stories published by the mainstream <a href="http://www.thedailybeast.com/articles/2011/12/28/the-u-s-is-busy-building-supercomputers-but-needs-someone-to-run-them.html">Daily Beast</a>, as well as <a href="http://www.hpcwire.com/hpcwire/2012-01-03/wanted:_supercomputer_programmers.html" target="_blank">HPC Wire</a>. A quote from Stan Ahalt in the Daily Beast story caught my attention: “It’s not enough to keep building powerful supercomputers unless we have the brains. Think of a supercomputer as a very fast racing engine. We need more drivers to use those engines." Stan is the director of a supercomputing center at the University of North Carolina at Chapel Hill.</p>
<p>Programming supercomputers is hard work. Those involved in programming large HPC systems go through in-depth training and spend months (sometimes years) fine-tuning their algorithms until they are fully leveraging the massive computing power these machines offer. There is a growing number of tools and libraries for HPC programmers, but not necessarily suitable for all levels of computer engineers. For non HPC-experts, programming small to mid-scale systems can be a pretty challenging and time-consuming task, something we hear quite often from our customers and partners.</p>
<p><strong>Where EM Photonics Can Make a Difference</strong></p>
<p>Companies with recently installed small- to mid-scale supercomputing systems often need help porting their applications to their new machines. This is where we bring tremendous value. We are easy to engage with and offer in-depth understanding of parallel architectures. On top of parallel programming expertise, we bring knowledge and experience in <a href="http://www.emphotonics.com/research.html" target="_blank">physics-based modeling and simulation</a>, <a href="http://www.emphotonics.com/atcom.html" target="_blank">image processing</a>, life sciences, finance, <a href="http://www.emphotonics.com/cfd.html" target="_blank">military and defense applications</a>. (Typically, the bigger the problem, the greater the fun!)</p>
<p>We encourage you to take a peak at our <a href="http://www.emphotonics.com/" target="_blank">EM Photonics</a> site to learn more about our consulting services, as well as current research projects and <a href="http://www.emphotonics.com/news.html" target="_blank">published papers</a>. We have a team of talented engineers looking forward to tackling new challenges. Just <a href="http://www.emphotonics.com/contact-us.php" target="_blank">let us know</a> how we can help!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2012/01/10/not-enough-hpc-programmers-how-to-fill-the-gap/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Reordering to increase parallelism</title>
		<link>http://www.culatools.com/blog/2011/12/19/2957/</link>
		<comments>http://www.culatools.com/blog/2011/12/19/2957/#comments</comments>
		<pubDate>Mon, 19 Dec 2011 17:03:52 +0000</pubDate>
		<dc:creator>Kyle</dc:creator>
				<category><![CDATA[CULA Design Principals]]></category>
		<category><![CDATA[Sparse]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2957</guid>
		<description><![CDATA[Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into levels where an entire level of unknowns can be solved in [...]]]></description>
			<content:encoded><![CDATA[<p>Solving a dense triangular matrix (such as those produced by an LU factorization) is a serial process – each row must be solved before moving on to the next row. However, when solving a sparse triangular matrix, it’s possible to group these operations into <em>levels</em> where an entire level of unknowns can be solved in parallel before moving on to the next level. In the worst case, there are <em>n</em> levels and the entire process must be serialized. In the best case, there is one level and the entire process can be parallelized (this would only happen in a diagonal matrix).</p>
<p>As expected, one might seek to the decrease the number of levels in their sparse matrix in an effort to increase parallelism and (potentially) decrease the time it takes to solve their sparse triangular matrix. In many cases this can be achieved by reordering the sparse matrix. A matrix reordering simply involves swapping a number of rows and/or columns to alter the structure of the matrix while preserving the values. In many cases, reordering is performed to increase accuracy or reduce the fill-in of direct solve methods. However, in the case of CULA Sparse, we reorder the matrix to increase <em>parallelism</em>. In the sparse linear algebra domain, there exist a number of different reordering schemes (such as minimum degree, approximate minimum degree) but they are beyond the scope of this blog post.</p>
<div id="attachment_2958" class="wp-caption aligncenter" style="width: 440px"><a href="http://www.culatools.com/wp-content/uploads/2011/12/reorder.png"><img class="size-full wp-image-2958" title="reorder" src="http://www.culatools.com/wp-content/uploads/2011/12/reorder.png" alt="" width="430" height="209" /></a><p class="wp-caption-text">This image illustrates how matrix reordering is applied to a sparse matrix. In this case, a large circuit simulation problem (AMD/G3_circuit) is reordered using the symmetric minimum degree reordering method.</p></div>
<p>One of the options for CULA Spare’s ILU0 preconditioner is <em>reordering</em>. This will (potentially) reduce the levels in the lower and upper triangular factors produced by the factorization in an effort to increase the parallelism. Since applying the ILU0 preconditioner through two triangular solves is typically a massive bottleneck, any speed-up here will directly decrease the total time needed to converge on a solution.</p>
<p>When applying matrix reordering to a real world matrix such as the circuit simulation problem introduced above, we can decrease the number of levels from 2594 to 15. This decreases the time to solve the triangular matrixes from 24.2 ms to 8.5 ms - a 2.8x speedup! When using the reordered ILU0 preconditioner with the conjugate gradient solver, we see the total time per iteration drop to 53.4 ms to 22.1 ms – 2.4x speedup!</p>
<p>In conclusion, CULA Sparse’s ILU0 reordering option (<a href="http://www.culatools.com/cula_sparse_programmers_guide/#culailu0options">more info</a>) can be used to drastically reduce the time it takes to apply the triangular factors produced by the LU factorization. However, one must also consider that the reordering step has a steep calculation overhead. Additionally, since the structure of the matrix is changing, some of the conjugate-based methods will now take a different number of iterations to converge on a solution.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/12/19/2957/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Batched Operations</title>
		<link>http://www.culatools.com/blog/2011/12/09/batched-operations/</link>
		<comments>http://www.culatools.com/blog/2011/12/09/batched-operations/#comments</comments>
		<pubDate>Fri, 09 Dec 2011 18:41:01 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2946</guid>
		<description><![CDATA[Readers of our forum will know that we often receive questions about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a [...]]]></description>
			<content:encoded><![CDATA[<p>Readers of our forum will know that we often receive <a href="http://www.culatools.com/forums/viewtopic.php?f=14&#038;t=774">questions </a>about the batch case for dense linear algebra operators. These are cases where the user has many very small matrices which can be solved simultaneously. Some areas where we have seen this are in image processing (inverse or SVD either per-pixel or in a stencil region) and fluid dynamics (solve a 5x5 matrix at each node).</p>
<p>It's worth starting with a statement regarding CPU performance for these problems. If you consider that each of the problems is O(N<span style="position: relative; top: -0.5em; font-size: 80%;">3</span>), you find that a 5x5 inverse requires only a hundred or so FLOPS, which puts the solution time for a single matrix into the microseconds regime. The process of even initiating GPU work is an order of magnitude or two larger than this, which suggests that there must be a significant number of small operations before the GPU can make a noticeable difference in overall performance.</p>
<p>In a similar vein, the GPU prefers having tens of thousands of simultaneous threads in order to get peak performance. If you consider the 5x5 matrix, the theoretical maximum number of usable threads would be one per matrix element, which is 25 in this case. This would lead to needing over 400 simultaneous matrices for performance. In reality, the number of practical threads to use per matrix is more like 1 or 5, meaning many thousands of simultaneous matrices are needed. </p>
<p>At this point, we can illustrate why CUDA streams aren't the proper solution to this particular problem. For background, CUDA streams allow the programmer to specify that two kernels launched are independent (in terms of data required) and thus the card is free to execute them simultaneously. One common use for this is to eliminate the so-called "tail effect" which is where the first kernel is finishing and so some of the hardware is idle and waiting for the rest to finish - streaming allows the next kernel to begin using that idle hardware before the first kernel has fully completed. A second use is to allow two or more kernels to occupy the card simultaneously, which is good if neither using all of the hardware. The batching case we are describing certainly falls into the latter category.</p>
<p>One could, in theory, use a CUDA stream per problem and launch one problem at a time. This would be ill-performing for two reasons. First is that the number of threads per block would be far too low; we typically want no fewer than 32 for that, and we have already motivated why that is not practical for one matrix. Second is that the overhead incurred by launching thousands of operations in this manner would be unacceptable, because the launch code is as expensive (if not more expensive) as just performing the matrix on the CPU. The realistic approach here would be to collect elements and then group them into thread blocks, but again the time to form the batches from the streams would be more expensive than just performing the operation.</p>
<p>In response to this problem, NVIDIA's CUBLAS has introduced a feature called Batch Mode in the <a href="http://developer.nvidia.com/cuda-toolkit-41">newest developer versions</a>. At present this is available for the matrix multiply routine. The Batch mode is intended for solving this problem effectively by allowing the user to request solution for a group of identical problems at once, albeit on different data. This is a SIMD operation, just at a slightly higher level than we normally associate with that term. Our experience with this interface is that it is competitive with the CPU for a specific set of circumstances (indeed, these are the circumstances for which the CUBLAS Batch Mode was intended) - which are very small problems (N<32) and very large batch sizes (N>1000). </p>
<p>As for CULA, we have considered this approach but found that batch sizes on the order of thousands are required in order to gain competitive performance versus the CPU, which is why such solvers are not generally available in the package. It is our hope that some day we can find a solution that gets good performance on the GPU for a wide range of matrix sizes as well as varying batch sizes, but for now we are pursuing this work only on a case-by-case basis, tuned for a user's exact needs. For more information, please see our <a href="http://www.culatools.com/contact/">contact page</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/12/09/batched-operations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Introducing the CULA Sparse Demo</title>
		<link>http://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/</link>
		<comments>http://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/#comments</comments>
		<pubDate>Tue, 29 Nov 2011 18:42:53 +0000</pubDate>
		<dc:creator>Dan</dc:creator>
				<category><![CDATA[Release Notes & News]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2921</guid>
		<description><![CDATA[We are very pleased to announce that we have recently released a free demo for CULA Sparse. This demo is manifested in a standalone, command line driven program with which you can choose your options and see the performance for a particular routine. All solvers and most features that are provided by CULA Sparse are [...]]]></description>
			<content:encoded><![CDATA[<p>We are very pleased to announce that we have recently released a free demo for CULA Sparse. This demo is manifested in a standalone, command line driven program with which you can choose your options and see the performance for a particular routine. All solvers and most features that are provided by CULA Sparse are supported. </p>
<p>For example, to run the demo with a cg solver and jacobi preconditioner, you can use the command below. The demo accepts matrices that are in the matrix market format (.mtx). For information on this format, see the resources provided by <a title="Matrix Market Forms" href="http://math.nist.gov/MatrixMarket/formats.html" target="_blank">this NIST site</a>.</p>
<pre>iterativeBenchmark solver=cg preconditioner=jacobi A=myfile.mtx b=ones tolerance=1e-5</pre>
<p><br/ ><br />
The CULA Sparse demo is powerful because it allows you to easily try our several different solvers, preconditioners, and other features without coding or building any software. And once you’ve found out the combination of inputs that is ideal for you, you can easily transition this knowledge into your CULA Sparse implementation.</p>
<p><a title="Download CULA Sparse Demo" href="downloads/sparse/">Download the CULA Sparse demo</a> today and see how our GPU-accelerated solvers can work for you.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/11/29/introducing-the-cula-sparse-demo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Directly from SC11</title>
		<link>http://www.culatools.com/blog/2011/11/15/directly-from-sc11/</link>
		<comments>http://www.culatools.com/blog/2011/11/15/directly-from-sc11/#comments</comments>
		<pubDate>Tue, 15 Nov 2011 17:50:57 +0000</pubDate>
		<dc:creator>Liana</dc:creator>
				<category><![CDATA[Events and Conferences]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2889</guid>
		<description><![CDATA[The entire CULA team is here in Seattle and everyone is pumped up for the first big day of action. Last night, at the opening gala, we were pleased to see familiar faces all around us. It's not an easy showroom to navigate, but we hope our users will find us at booth # 244.  A [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.culatools.com/wp-content/uploads/2011/11/logo_sc11.jpg"><img class="alignright size-full wp-image-2897" title="SC11" src="http://www.culatools.com/wp-content/uploads/2011/11/logo_sc11.jpg" alt="" width="192" height="138" /></a>The entire CULA team is here in Seattle and everyone is pumped up for the first big day of action. Last night, at the opening gala, we were pleased to see familiar faces all around us. It's not an easy showroom to navigate, but we hope our users will find us at booth # 244.  A number of people came by our booth to ask about CULA Sparse, as well as a few scavenger hunters (fun!), and we hope this will be another great show for everyone. Today we will be catching up with our partners to find out what their vision of the SC market is and how we can work together and contribute to their strategies.</p>
<p>By the way, <strong>it is TODAY that John Humphrey will be giving his presentation on CULA</strong> Sparse and all of the great features added to the CULA Dense library!  We hope you can make it!</p>
<p><strong>What:</strong> Exhibitor Forums: <a href="http://sc11.supercomputing.org/schedule/event_detail.php?evid=exforum121" target="_blank">Advances in the CULA Linear Algebra Library </a><strong><br />
</strong><strong>Where:</strong> <span style="color: #ff6600;"><strong>613/614</strong></span></p>
<p>Enjoy the show!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/11/15/directly-from-sc11/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CULA Sparse &#8211; Real World Results</title>
		<link>http://www.culatools.com/blog/2011/11/04/cula-sparse-real-world-results/</link>
		<comments>http://www.culatools.com/blog/2011/11/04/cula-sparse-real-world-results/#comments</comments>
		<pubDate>Fri, 04 Nov 2011 21:54:08 +0000</pubDate>
		<dc:creator>Kyle</dc:creator>
				<category><![CDATA[CULA Applications]]></category>
		<category><![CDATA[CULA Routine Feature]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2840</guid>
		<description><![CDATA[We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for [...]]]></description>
			<content:encoded><![CDATA[<p>We've received a number of questions regarding the performance of our latest CULA Sparse release. Unlike the dense domain, the performance of sparse problems can change drastically depending on the structure and size of the matrix. In this blog post, we'll analyze the performance of a large real-world problem that was a perfect candidate for GPU acceleration.</p>
<p>Obtained from the <a href="http://www.cise.ufl.edu/research/sparse/matrices/">The University of Florida Sparse Matrix Collection</a>, the matrix <a href="http://www.cise.ufl.edu/research/sparse/matrices/Schmid/thermal2">Schmid/thermal2</a> is a steady state thermal problem (FEM) on an unstructured grid. This is a fairly large matrix with 1.2 million rows and 8.5 million non-zero elements. It's worth noting that this problem only needs about 100 MB of storage so it can easily fit on even an entry level GPU offerings.</p>
<p>Like many FEM problems, the resulting matrix representation is positive definite so the conjugate gradient (CG) solver was chosen. Using this solver, we tried all of the available preconditioners available in CULA Sparse.</p>
<table style="margin-left: auto; margin-right: auto;" width="400px">
<tbody>
<tr>
<td style="background-color: white;"></td>
<th colspan="2">Time</th>
<th colspan="2">Iterations</th>
</tr>
<tr>
<th>Method</th>
<th>CPU</th>
<th>GPU</th>
<th>CPU</th>
<th>GPU</th>
</tr>
<tr>
<td>None</td>
<td>246.6</td>
<td>24.57</td>
<td>4589</td>
<td>4589</td>
</tr>
<tr>
<td>ILU</td>
<td>208.5</td>
<td>74.61</td>
<td>1946</td>
<td>1947</td>
</tr>
<tr>
<td>ILU + Reorder</td>
<td>211.2</td>
<td>54.04</td>
<td>1789</td>
<td>1789</td>
</tr>
<tr>
<td>Jacobi</td>
<td>250.0</td>
<td>29.49</td>
<td>4558</td>
<td>4555</td>
</tr>
<tr>
<td>Block Jacobi</td>
<td>271.9</td>
<td>31.99</td>
<td>4694</td>
<td>4694</td>
</tr>
</tbody>
</table>
<p>As demonstrated above, the GPU showed an appreciable speedup for all of the preconditioner methods. In the best case, with no preconditioner selected, the GPU was <strong>over 10x faster</strong> than the CPU! However, on the more serial CPU, the best time was achieved using the ILU0 preconditioner. Interestingly enough, the ILU0 preconditioner was not the best choice on the GPU. While this preconditioner did half the number of iterations, the overhead introduced became a bottleneck and the un-preconditioned version has the lowest wall clock performance. Comparing the best GPU algorithm to the best CPU algorithm we still see an <strong>8.5x speedup</strong>!</p>
<p>All timing benchmarks obtained in this example were performed using an NVIDIA C2050 and an Intel X5660. The CPU results were calculated using fully optimized MKL libraries while the GPU results were obtained with CULA Sparse S1. All transfer overheads are included.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/11/04/cula-sparse-real-world-results/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>CULA Sparse Available!</title>
		<link>http://www.culatools.com/blog/2011/11/03/cula-sparse-available/</link>
		<comments>http://www.culatools.com/blog/2011/11/03/cula-sparse-available/#comments</comments>
		<pubDate>Thu, 03 Nov 2011 19:36:49 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[Release Notes & News]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2835</guid>
		<description><![CDATA[After several months of valuable Beta testing, we are pleased to announce the release and immediate availability of CULA Sparse. Our first release contains 6 solvers, 3 preconditioners, and supports double-precision and double-precision complex in a variety of matrix formats. Performance of 10x or more versus a fully threaded CPU solution is now available in [...]]]></description>
			<content:encoded><![CDATA[<p>After several months of valuable Beta testing, we are pleased to announce the release and immediate availability of <a href="http://www.culatools.com/sparse/" title="Sparse">CULA Sparse</a>. Our first release contains 6 solvers, 3 preconditioners, and supports double-precision and double-precision complex in a variety of matrix formats. Performance of 10x or more versus a fully threaded CPU solution is now available in an easy to use package!</p>
<p>CULA Dense R13 is a simultaneous release, also available now, and features three new routines (potri, gesdd, and geqrfp) as well as explicit compatibility with CULA Sparse.</p>
<p>For current users, we have changed the name of CULA Premium to CULA Dense, and CULA Basic is now CULA Dense Free Edition.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/11/03/cula-sparse-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Interpreting CULA Sparse Results</title>
		<link>http://www.culatools.com/blog/2011/10/23/interpreting-cula-sparse-results/</link>
		<comments>http://www.culatools.com/blog/2011/10/23/interpreting-cula-sparse-results/#comments</comments>
		<pubDate>Sun, 23 Oct 2011 21:04:05 +0000</pubDate>
		<dc:creator>John</dc:creator>
				<category><![CDATA[CULA Design Principals]]></category>
		<category><![CDATA[CULA Routine Feature]]></category>

		<guid isPermaLink="false">http://www.culatools.com/?p=2420</guid>
		<description><![CDATA[One design goal for CULA Sparse was to give the user informative output so to avoid the user having to write verbose checking routines. The routine culaIterativeResultString() is key here. This routine accepts a culaIterativeResult structure which is an output from each CULA Sparse solver (it is the last parameter). The output produced is shown [...]]]></description>
			<content:encoded><![CDATA[<p>One design goal for CULA Sparse was to give the user informative output so to avoid the user having to write verbose checking routines. The routine <span style="font-family: 'Courier New', Courier, mono;">culaIterativeResultString()</span> is key here. This routine accepts a <span style="font-family: 'Courier New', Courier, mono;">culaIterativeResult</span> structure which is an output from each CULA Sparse solver (it is the last parameter). The output produced is shown below:</p>
<pre>Solver:      Cg
Precond:     Block Jacobi (block size 16)
Flag:        Converged successfully in 27 iterations
Residual:    8.424304e-07

Total Time:  0.02827s (overhead + precond + solve)
   Overhead: 0.000569s
   Precond:  2.8e-05s
   Solve:    0.02767s</pre>
<p>You will notice that basic stats are produced, such as the solver and preconditioner used. The Flag field helps to interpret the mathematical status of the solve process. The example here shows a successful convergence in 27 iterations, but the Flag can also indicate conditions such as solver stagnation (failing to make progress for several consecutive iterations) or numerical breakdown. The Residual field indicates the quality of the final answer.</p>
<p>There is then a timing output block, which shows a total execution time plus a breakdown of where the time was spent. The Overhead field shows time spent for GPU-specific operations such as device memory allocation and transfer. The Precond field shows the total time required to <em>generate</em> the preconditioner, because the time required to generate a given preconditioner can vary wildly among different matrices and different preconditioners. The final field, Solve, shows the time taken for the actual system solution.</p>
<p>In addition to the <span style="font-family: 'Courier New', Courier, mono;">culaIterativeResult</span> field, each solver <em>returns</em> a <span style="font-family: 'Courier New', Courier, mono;">culaStatus</span> that is used to indicate important runtime information, such as incorrect parameters (specifying a matrix size less than zero, for example) or not having the proper version of the CUDA driver installed. Users of CULA Dense will already be familiar with this parameter. In all cases, it is recommended to first check the returned status, followed then by obtaining the iterative result string. The examples in your CULA Sparse installation clearly show how to integrate this into your code.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.culatools.com/blog/2011/10/23/interpreting-cula-sparse-results/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

