## support of dsyevr function

4 posts
• Page

**1**of**1**### support of dsyevr function

Hi,

I was wondering if you were planning to support the dsyevr function in the near future.

We were previously using this function from the Intel MKL in our application and would now like to switch to CULA in order to exploit the possibility offered by our new hardware.

We have been doing some benchmark between the Intel dsyevr function and the CULA dsyevx function which is the closest function supported by CULA. However the test results show that Intel dsyevr function is about 10 times faster that the dsyevx CULA one independently of the size of the matrix used (we have tested from 10x10 up to 1000x1000). So this is not what we were expected as we were hoping for a speed up using CULA. Of course the algorithm used for dsyevr is supposed to be faster than dyevx but we were hoping that the parallelization offered by the CULA function would compensate and even outperform the intel one.

Our test machine is a Fujitsu Celsius M470 with an intel Xeon W3550 CPU and a nvidia Tesla C2050 GPU under Windows 7 professional.

I was wondering if you were planning to support the dsyevr function in the near future.

We were previously using this function from the Intel MKL in our application and would now like to switch to CULA in order to exploit the possibility offered by our new hardware.

We have been doing some benchmark between the Intel dsyevr function and the CULA dsyevx function which is the closest function supported by CULA. However the test results show that Intel dsyevr function is about 10 times faster that the dsyevx CULA one independently of the size of the matrix used (we have tested from 10x10 up to 1000x1000). So this is not what we were expected as we were hoping for a speed up using CULA. Of course the algorithm used for dsyevr is supposed to be faster than dyevx but we were hoping that the parallelization offered by the CULA function would compensate and even outperform the intel one.

Our test machine is a Fujitsu Celsius M470 with an intel Xeon W3550 CPU and a nvidia Tesla C2050 GPU under Windows 7 professional.

- vdelage
**Posts:**1**Joined:**Tue Apr 26, 2011 4:10 am

### Re: support of dsyevr function

Which usage scenario are you trying to benchmark? Currently we have accelerated the paths where:

1) ALL eigenvalues and vectors are requested

2) ALL eigenvalues are requested

3) Some eigenvalues are requested

We have not accelerated the paths where:

1) Some eigenvalues and vectors are request

For the accelerated paths, we show approximately a 3-4x speed up for larger problems (over 2k). At the small end, the GPU suffers due to the extremely high memory bandwidth requirements where the CPU can fit the entire problem in fast cache memory.

1) ALL eigenvalues and vectors are requested

2) ALL eigenvalues are requested

3) Some eigenvalues are requested

We have not accelerated the paths where:

1) Some eigenvalues and vectors are request

For the accelerated paths, we show approximately a 3-4x speed up for larger problems (over 2k). At the small end, the GPU suffers due to the extremely high memory bandwidth requirements where the CPU can fit the entire problem in fast cache memory.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

### Re: support of dsyevr function

kyle,

I m working with Vivien. We are using the routine to obtain ALL eigenvalues and eigenvectors (path 1 in your reply), however on CPU were using the MRRR algorithm (dsyevr), rather than bisection and inverse iteration (dsyevx).

We did extensive evaluation on CPUs, always coming out with the MRRR as the favorite (with exception of matrixes dim < 600 where the dsyevd on one platform was the fastest). Thus the question whether there is any plan on your side to add such implementation to CULA.

However if I get you right, from your experience you suggest that the testdimension we used(around 1000) is too small to offset the memory bandwidth issue so to get and advantage vs. the CPU ?

I m working with Vivien. We are using the routine to obtain ALL eigenvalues and eigenvectors (path 1 in your reply), however on CPU were using the MRRR algorithm (dsyevr), rather than bisection and inverse iteration (dsyevx).

We did extensive evaluation on CPUs, always coming out with the MRRR as the favorite (with exception of matrixes dim < 600 where the dsyevd on one platform was the fastest). Thus the question whether there is any plan on your side to add such implementation to CULA.

However if I get you right, from your experience you suggest that the testdimension we used(around 1000) is too small to offset the memory bandwidth issue so to get and advantage vs. the CPU ?

- cbr322
- CULA Premium
**Posts:**1**Joined:**Wed Apr 20, 2011 3:17 am

### Re: support of dsyevr function

You are correct in your assessment of the sizes. The symmetric eigenvalue problem has a high memory to arithmetic ratio which allows for very good performance on the CPU when the entire problem fits in fast cache memory. For example, at N=600, in double precision, the matrix is only about 2.5 MB in size. This easily fits in the CPUs cache whereas the GPU, due to the architecture, must constantly read-and-write the data to-and-from it's "cache" memory (also known as shared memory). This happens because shared memory banks cannot talk to each other and must travel through the relatively slow main GPU memory to communicate.

I hope this explains why you aren't seeing a speedup for these sizes.

It's also worth noting that the first steps of any symmetric eigenvalue problem is to reduce to tridiagonal form and to create an orthogonal basis. These steps will domination the calculation time and the actual eigenvalue step is minimal (more-so at large sizes due to N^3 scaling). For the later eigenvalue step, we have first chosen the bisection and QR methods because they have more regular parallelization patterns that better fit the GPU but perhaps we'll examine the MMR method at a later date.

I hope this explains why you aren't seeing a speedup for these sizes.

It's also worth noting that the first steps of any symmetric eigenvalue problem is to reduce to tridiagonal form and to create an orthogonal basis. These steps will domination the calculation time and the actual eigenvalue step is minimal (more-so at large sizes due to N^3 scaling). For the later eigenvalue step, we have first chosen the bisection and QR methods because they have more regular parallelization patterns that better fit the GPU but perhaps we'll examine the MMR method at a later date.

- kyle
- Administrator
**Posts:**301**Joined:**Fri Jun 12, 2009 7:47 pm

4 posts
• Page

**1**of**1**### Who is online

Users browsing this forum: No registered users and 6 guests