## sgels on "narrow" A matrix

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

### sgels on "narrow" A matrix

Hi guys,

I'm working on a problem where it's necessary to repeatedly (1000's of times) solve a least squares problem A*X=y where A has dimensions 28,000,000 x 32, and y is a column vector of length 28,000,000. I've been solving this in chunks, where A is 500,000 x 32. Each of these chunks take about 3-4 seconds using the numpy least squares routine.

I wasn't expecting much speedup from the Cula sgels on a matrix of this weird shape, and it indeed performs 4-10x slower than numpy. I was thinking I could write a custom solver in CUDA that solves many "chunks" of A in parallel and then reduces to a final answer, but before embarking on that, I'm wondering if you have any ideas for how I might transform the problem to best take advantage of how CULA accelerates the computation.

Also, the A matrix chunks are around 128MB, so transferring that back and forth might be a significant contributor to the slowdown, however the A matrix is static for each iteration. If there was a way to avoid overwriting the matrices in place, potentially I'd only need to transfer the column vector to the GPU on each iteration.

Any ideas would be greatly appreciated, thanks for your time.
eyew

Posts: 1
Joined: Wed Jul 27, 2011 10:31 am

### Re: sgels on "narrow" A matrix

You are correct in that the skinny matrix will not see any acceleration with CULA's current algorithm. However, we are working on alternate algorithms for tall and skinny matrices using communication avoiding techniques that will certainly show a speedup.
kyle

Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

### Re: sgels on "narrow" A matrix

Has this been improved already, perhaps?
ikku100

Posts: 4
Joined: Mon Jun 18, 2012 6:45 am