Research Papers

Listed here are a few papers published by users who have put CULA to the test during their research project.  You will also find papers published by our team of CULA engineers.

If you have published a research paper in which your work with CULA is referenced, we encourage you to share it with the rest of the CULA user community.

Contact us if you would like to see your paper listed here or if you would like to be featured in one of the case studies we will be writing in the near future.


CULA: Hybrid GPU Accelerated Linear Algebra Routines (Proceedings Paper)

Author(s): John Humphrey, Daniel Price, Kyle Spagnoli, Aaron Paolini, Eric Kelmelis

Summary: The modern graphics processing unit (GPU) found in many standard personal computers is a highly parallel math processor capable of nearly 1 TFLOPS peak throughput at a cost similar to a high-end CPU and an excellent FLOPS/watt ratio. High-level linear algebra operations are computationally intense, often requiring O(N3) operations and would seem a natural fit for the processing power of the GPU. Our work is on CULA, a GPU accelerated implementation of linear algebra routines. We present results from factorizations such as LU decomposition, singular value decomposition and QR decomposition along with applications like system solution and least squares.

View it at or download a PDF version now.

"Copyright 2010 Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic electronic or print reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited."

To cite CULA in a paper, please use this citation:
J. R. Humphrey, D. K. Price, K. E. Spagnoli, A. L. Paolini, E. J. Kelmelis, "CULA: Hybrid GPU Accelerated Linear Algebra Routines," SPIE Defense and Security Symposium (DSS), April, 2010.

Papers published by CULA users

Using GPU with CUDA to Accelerate MoM-Based Electromagnetic Simulation of Wire-Grid Models

Author(s): Tomasz Topa, Andrzej Karwowski, and Artur Noga

Summary: A CUDA-enabled graphics processing unit (GPU) accelerated implementation of the method of moments (MoM) for electromagnetic simulation of wire-grid models of arbitrary configurations of conducting surfaces and wires is presented. The solution based on the frequency-domain electric field integral equation (EFIE) discretized using piecewise-linear (triangular) functions for expansion and testing is considered. Some issues pertinent to porting a single-CPU sequential code to an inherently parallel GPU platform are addressed. The GPU numerical results for a user-created benchmark structure are backed up with comparison to CPU results. A noticeable speedup (about 6x) of the overall MoM simulation is achieved due to employing GPU.

Available at:

Adapting MoM with RWG Basis Functions to GPU Technology Using CUDA

Author(s): Tomasz Topa, Artur Noga, and Andrzej Karwowski

Summary: A CUDA-enabled graphics processing unit (GPU)-accelerated implementation of the method of moments (MoM) for solving three-dimensional conducting body–wire problems is presented. The solution is based on the mixed potential integral equation (MPIE) discretized using Rao–Wilton–Glisson (RWG) basis functions. The CUDA environment is employed to port a single-CPU sequential code to the parallel GPU platform, and some relevant issues are discussed. Numerical results are given for a helical antenna with a cylindrical cup reflector. The measured speedup of about eight times over the CPU implementation is demonstrated.

Available at:

The CUBLAS and CULA Based GPU Acceleration of Adaptive Finite Element Framework for Bioluminescence Tomography

Author(s): Bo Zhang, Xiang Yang, Fei Yang, Xin Yang, Chenghu Qin, Dong Han, Xibo Ma, Kai Liu, and Jie Tian

Summary: In this paper, we for the first time introduce a new kind of acceleration technology to accelerate the AFE framework for BLT, using the graphics processing unit (GPU). Besides the processing speed, the GPU technology can get a balance between the cost and performance. The CUBLAS and CULA are two main important and powerful libraries for programming on NVIDIA GPUs. With the help of CUBLAS and CULA, it is easy to code on NVIDIA GPU and there is no need to worry about the details about the hardware environment of a specific GPU.

Access this paper at

Accelerating Frequency-Domain Diffuse Optical Tomographic Image Reconstruction Using Graphics Processing Units

Author(s): Jaya Prakash, Venkittarayan Chandrasekharan, Vishwajith Upendra, and Phaneendra K. Yalavarthy

Summary: Diffuse optical tomographic image reconstruction uses advanced numerical models that are computationally costly to be implemented in the real time. The graphics processing units (GPUs) offer desktop massive parallelization that can accelerate these computations. An open-source GPU-accelerated linear algebra library package is used to compute the most intensive matrix-matrix calculations and matrix decompositions that are used in solving the system of linear equations.

Access this paper at

Option Pricing with the SABR Model on the GPU

Author(s): Yu Tian, Zili Zhu, Fima C. Klebaner, and Kais Hamza

Summary: In this paper, we will present our research on the acceleration for option pricing using Monte Carlo techniques on the GPU. We first introduce some basic ideas of GPU programming and then the stochastic volatility SABR model. Under the SABR model, we discuss option pricing with Monte Carlo techniques. (...) Finally, we implement our GPU programs, and compare their performance with their CPU counterparts. From our numerical results, around 100× speedup in European option pricing and 10× speedup in American option pricing can be achieved by GPU computing while maintaining satisfactory pricing accuracy.

Access this paper at

GPU-accelerated and parallelized ELM ensembles for large-scale regression

Author(s): Mark van Heeswijk, Yoan Miche, Erkki Oja, and Amaury Lendasse

Summary: The paper presents an approach for performing regression on large data sets in reasonable time, using an ensemble of extreme learning machines (ELMs). The main purpose and contribution of this paper are to explore how the evaluation of this ensemble of ELMs can be accelerated in three distinct ways: (1) training and model structure selection of the individual ELMs are accelerated by performing these steps on the graphics processing unit (GPU), instead of the processor (CPU); (2) the training of ELM is performed in such a way that computed results can be reused in the model structure selection, making training plus model structure selection more efficient; (3) the modularity of the ensemble model is exploited and the process of model training and model structure selection is parallelized across multiple GPU and CPU cores, such that multiple models can be built at the same time. The experiments show that competitive performance is obtained on the regression tasks, and that the GPU-accelerated and parallelized ELM ensemble achieves attractive speedups over using a single CPU. Furthermore, the proposed approach is not limited to a specific type of ELM and can be employed for a large variety of ELMs.

Access this paper at: