## sgels on "narrow" A matrix

3 posts
• Page

**1**of**1**### sgels on "narrow" A matrix

Hi guys,

I'm working on a problem where it's necessary to repeatedly (1000's of times) solve a least squares problem A*X=y where A has dimensions 28,000,000 x 32, and y is a column vector of length 28,000,000. I've been solving this in chunks, where A is 500,000 x 32. Each of these chunks take about 3-4 seconds using the numpy least squares routine.

I wasn't expecting much speedup from the Cula sgels on a matrix of this weird shape, and it indeed performs 4-10x slower than numpy. I was thinking I could write a custom solver in CUDA that solves many "chunks" of A in parallel and then reduces to a final answer, but before embarking on that, I'm wondering if you have any ideas for how I might transform the problem to best take advantage of how CULA accelerates the computation.

Also, the A matrix chunks are around 128MB, so transferring that back and forth might be a significant contributor to the slowdown, however the A matrix is static for each iteration. If there was a way to avoid overwriting the matrices in place, potentially I'd only need to transfer the column vector to the GPU on each iteration.

Any ideas would be greatly appreciated, thanks for your time.

I'm working on a problem where it's necessary to repeatedly (1000's of times) solve a least squares problem A*X=y where A has dimensions 28,000,000 x 32, and y is a column vector of length 28,000,000. I've been solving this in chunks, where A is 500,000 x 32. Each of these chunks take about 3-4 seconds using the numpy least squares routine.

I wasn't expecting much speedup from the Cula sgels on a matrix of this weird shape, and it indeed performs 4-10x slower than numpy. I was thinking I could write a custom solver in CUDA that solves many "chunks" of A in parallel and then reduces to a final answer, but before embarking on that, I'm wondering if you have any ideas for how I might transform the problem to best take advantage of how CULA accelerates the computation.

Also, the A matrix chunks are around 128MB, so transferring that back and forth might be a significant contributor to the slowdown, however the A matrix is static for each iteration. If there was a way to avoid overwriting the matrices in place, potentially I'd only need to transfer the column vector to the GPU on each iteration.

Any ideas would be greatly appreciated, thanks for your time.

- eyew
**Posts:**1**Joined:**Wed Jul 27, 2011 10:31 am

### Re: sgels on "narrow" A matrix

You are correct in that the skinny matrix will not see any acceleration with CULA's current algorithm. However, we are working on alternate algorithms for tall and skinny matrices using communication avoiding techniques that will certainly show a speedup.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: sgels on "narrow" A matrix

Has this been improved already, perhaps?

- ikku100
**Posts:**4**Joined:**Mon Jun 18, 2012 6:45 am

3 posts
• Page

**1**of**1**Return to General CULA Discussion

### Who is online

Users browsing this forum: No registered users and 1 guest