## Rpeak of NVIDIA GPUs

9 posts
• Page

**1**of**1**### Rpeak of NVIDIA GPUs

Hello:

Can someone point any paper, document or web where the Rpeak of NVIDIA GPUs are listed? Or maybe a way to compute the Rpeak?

I have a GeForce GTX 550 Ti and I obtain a Rmax about 40 GFLOPS using culaDgetrf() (a similar value is obtained using DGEMM from CUBLAS). I would like to compare this value against the Rmax considering the hardware.

Cheers

Can someone point any paper, document or web where the Rpeak of NVIDIA GPUs are listed? Or maybe a way to compute the Rpeak?

I have a GeForce GTX 550 Ti and I obtain a Rmax about 40 GFLOPS using culaDgetrf() (a similar value is obtained using DGEMM from CUBLAS). I would like to compare this value against the Rmax considering the hardware.

Cheers

- jgpallero
**Posts:**10**Joined:**Wed May 08, 2013 3:01 pm

### Re: Rpeak of NVIDIA GPUs

Thank you for the link, John. I can see my GeForce GTX 550 Ti has a teorethical single precision peak of 691.2 GFLOPS/s. In order to convert this value to double pecission I have found a link (http://en.wikipedia.org/wiki/Nvidia_tesla) where is stated that in the GeForce 500 series the double peak should be computed as single precision peak/8, so the double Rpeak for my card is 691.2/8=86.4 GFLOPS/s.

As I stated in my first post, I obtain with culaDgeqrf (with matrices in the range N=3000 to 6000) a Rmax of about 40 GFLOPS/s, i.e., a ratio Rmax/Rpeak of about 50%. Using for example Intel MKL on the CPU, a ratio of about 85%-90% can be achieved.

I'm using the functions from CULA that works with the original data on the principal memory, so the data copy to the GPU is done internally and, of course, consumes time worsening and falsifying the performance computations.

Exists any document, test or study about the performance Rmax/Rpeak of CULA, using the functions with data on RAM and on GPU memory for differents GPUs?

Cheers

As I stated in my first post, I obtain with culaDgeqrf (with matrices in the range N=3000 to 6000) a Rmax of about 40 GFLOPS/s, i.e., a ratio Rmax/Rpeak of about 50%. Using for example Intel MKL on the CPU, a ratio of about 85%-90% can be achieved.

I'm using the functions from CULA that works with the original data on the principal memory, so the data copy to the GPU is done internally and, of course, consumes time worsening and falsifying the performance computations.

Exists any document, test or study about the performance Rmax/Rpeak of CULA, using the functions with data on RAM and on GPU memory for differents GPUs?

Cheers

- jgpallero
**Posts:**10**Joined:**Wed May 08, 2013 3:01 pm

### Re: Rpeak of NVIDIA GPUs

Es ist viel interessant! Danke für den Link

Mmmm, it's so interesting. I suppose I obtain a Rmax around only 40 GFLOPS/s due to I use the CULA interface which assigns internally the resources to the GPU, so some of the computing time is actually used to allocate-deallocate data.

But, why the divisor 12?

Cheers

- jgpallero
**Posts:**10**Joined:**Wed May 08, 2013 3:01 pm

### Re: Rpeak of NVIDIA GPUs

These utilization numbers (~50%) are typical for GPUs. If you're not familiar with LAPACK algorithms, they work down the diagonal of the matrix, and at each step the problem size decreases. Eventually this becomes too small for good performance on the GPU - ie too little work to keep all the cores busy. The CPU has far fewer compute elements, so it can keep up near-peak performance even on the small subproblems. The CPU also doesn't need to deal with data transfers.

As for your question on the impact of transfers, we tend to find that it's about a 10% cost, though it will depend on your particular hardware and the problem's execution duration.

I should also note that CULA is performance tuned primarily for the current GPU generation (your 500 is previous-generation) and for the largest GPUs (the 550 has relatively few cores.)

As for your question on the impact of transfers, we tend to find that it's about a 10% cost, though it will depend on your particular hardware and the problem's execution duration.

I should also note that CULA is performance tuned primarily for the current GPU generation (your 500 is previous-generation) and for the largest GPUs (the 550 has relatively few cores.)

- john
- Administrator
**Posts:**587**Joined:**Thu Jul 23, 2009 2:31 pm

### Re: Rpeak of NVIDIA GPUs

Hello,

After several months, I'm back with the GPU computing. I have again sobe questions about the theoretical peak computation. In http://en.wikipedia.org/wiki/Comparison ... 500_Series can be read that the 691.2 GFLOPS/s theoretical peak for my GTX 550 Ti is a FMA peak, i.e. is the FMA (c=a*b+c) operation count. As a FLOP is considered either a product or an addition and in order to obtain the rpeak in FLOPS, should this 691.2 multiplied times two? In the referred link says that each unit is capable for 2 FMA per cycle, so it is 4 FLOPS per cycle. Then, as the GTX 550 Ti has 192 cuda cores, the theoretical peak in single precision should be 192*4*1.8GHz=1382.4 GFLOPS/s and 16*4*1.8=115.2 GFLOPS/s in double precision, as the ratio SP/DP for this GPU is 1/12. Am I right?

As I previously posted, I obtain a performance in dpuble abour 40 GFLOPS/s (counting 1 FLOP as 1 product or 1 addition) using culaDgetrf with matrices of dimensions 3000x3000 and NRHS=1. For the performance computation I've considered the complexity of the dgetrf of O(2/3N^3). So, the actual performance is 40/115.2*100 = 34.7%. The time was measured before the calling to culaDgetrf(), so the internal copy of data to the GPU is included in the total time to obtain the performance. I find this 34% a bit low, but I suppose that is due to the GTX 550Ti is a low end GPU.

In summary, I have a terrible mesh about the peak computation on GPU. Can anyone help me, please?

After several months, I'm back with the GPU computing. I have again sobe questions about the theoretical peak computation. In http://en.wikipedia.org/wiki/Comparison ... 500_Series can be read that the 691.2 GFLOPS/s theoretical peak for my GTX 550 Ti is a FMA peak, i.e. is the FMA (c=a*b+c) operation count. As a FLOP is considered either a product or an addition and in order to obtain the rpeak in FLOPS, should this 691.2 multiplied times two? In the referred link says that each unit is capable for 2 FMA per cycle, so it is 4 FLOPS per cycle. Then, as the GTX 550 Ti has 192 cuda cores, the theoretical peak in single precision should be 192*4*1.8GHz=1382.4 GFLOPS/s and 16*4*1.8=115.2 GFLOPS/s in double precision, as the ratio SP/DP for this GPU is 1/12. Am I right?

As I previously posted, I obtain a performance in dpuble abour 40 GFLOPS/s (counting 1 FLOP as 1 product or 1 addition) using culaDgetrf with matrices of dimensions 3000x3000 and NRHS=1. For the performance computation I've considered the complexity of the dgetrf of O(2/3N^3). So, the actual performance is 40/115.2*100 = 34.7%. The time was measured before the calling to culaDgetrf(), so the internal copy of data to the GPU is included in the total time to obtain the performance. I find this 34% a bit low, but I suppose that is due to the GTX 550Ti is a low end GPU.

In summary, I have a terrible mesh about the peak computation on GPU. Can anyone help me, please?

- jgpallero
**Posts:**10**Joined:**Wed May 08, 2013 3:01 pm

### Re: Rpeak of NVIDIA GPUs

There are many good information and details so I hope you will read carefully and take more information by take a look here.

Pass your 70-463 exam questions - braindumps.com exams in first try by using our APPLE ccna wireless & University of Oxford tutorials and best quality wikipedia dumps along with ccie Brite Divinity School

- raima55
**Posts:**1**Joined:**Fri Oct 24, 2014 12:07 am

### Re: Rpeak of NVIDIA GPUs

raima55 wrote:There are many good information and details so I hope you will read carefully and take more information by take a look here.

Hello:

Thank you for your answer. But, where is "here"? If you refer to a link, you forgot to paste it

- jgpallero
**Posts:**10**Joined:**Wed May 08, 2013 3:01 pm

9 posts
• Page

**1**of**1**Return to General CULA Discussion

### Who is online

Users browsing this forum: Google [Bot] and 1 guest