Rpeak of NVIDIA GPUs

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

Rpeak of NVIDIA GPUs

Postby jgpallero » Fri May 10, 2013 11:30 am

Hello:

Can someone point any paper, document or web where the Rpeak of NVIDIA GPUs are listed? Or maybe a way to compute the Rpeak?

I have a GeForce GTX 550 Ti and I obtain a Rmax about 40 GFLOPS using culaDgetrf() (a similar value is obtained using DGEMM from CUBLAS). I would like to compare this value against the Rmax considering the hardware.

Cheers
jgpallero
 
Posts: 10
Joined: Wed May 08, 2013 3:01 pm

Re: Rpeak of NVIDIA GPUs

Postby john » Fri May 10, 2013 12:46 pm

john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Rpeak of NVIDIA GPUs

Postby jgpallero » Sat May 11, 2013 10:50 am

Thank you for the link, John. I can see my GeForce GTX 550 Ti has a teorethical single precision peak of 691.2 GFLOPS/s. In order to convert this value to double pecission I have found a link (http://en.wikipedia.org/wiki/Nvidia_tesla) where is stated that in the GeForce 500 series the double peak should be computed as single precision peak/8, so the double Rpeak for my card is 691.2/8=86.4 GFLOPS/s.
As I stated in my first post, I obtain with culaDgeqrf (with matrices in the range N=3000 to 6000) a Rmax of about 40 GFLOPS/s, i.e., a ratio Rmax/Rpeak of about 50%. Using for example Intel MKL on the CPU, a ratio of about 85%-90% can be achieved.
I'm using the functions from CULA that works with the original data on the principal memory, so the data copy to the GPU is done internally and, of course, consumes time worsening and falsifying the performance computations.
Exists any document, test or study about the performance Rmax/Rpeak of CULA, using the functions with data on RAM and on GPU memory for differents GPUs?

Cheers
jgpallero
 
Posts: 10
Joined: Wed May 08, 2013 3:01 pm

Re: Rpeak of NVIDIA GPUs

Postby coruun » Mon May 13, 2013 12:21 am

German wikipedia gives a value of 57.6 GFLOPS for GTX 550 Ti.

This is GFLOPS(SP)/12.
coruun
 
Posts: 5
Joined: Wed Mar 27, 2013 8:17 am

Re: Rpeak of NVIDIA GPUs

Postby jgpallero » Mon May 13, 2013 3:33 am

coruun wrote:German wikipedia gives a value of 57.6 GFLOPS for GTX 550 Ti.

This is GFLOPS(SP)/12.


Es ist viel interessant! Danke für den Link :)

Mmmm, it's so interesting. I suppose I obtain a Rmax around only 40 GFLOPS/s due to I use the CULA interface which assigns internally the resources to the GPU, so some of the computing time is actually used to allocate-deallocate data.

But, why the divisor 12?

Cheers
jgpallero
 
Posts: 10
Joined: Wed May 08, 2013 3:01 pm

Re: Rpeak of NVIDIA GPUs

Postby john » Mon May 13, 2013 6:28 am

These utilization numbers (~50%) are typical for GPUs. If you're not familiar with LAPACK algorithms, they work down the diagonal of the matrix, and at each step the problem size decreases. Eventually this becomes too small for good performance on the GPU - ie too little work to keep all the cores busy. The CPU has far fewer compute elements, so it can keep up near-peak performance even on the small subproblems. The CPU also doesn't need to deal with data transfers.

As for your question on the impact of transfers, we tend to find that it's about a 10% cost, though it will depend on your particular hardware and the problem's execution duration.

I should also note that CULA is performance tuned primarily for the current GPU generation (your 500 is previous-generation) and for the largest GPUs (the 550 has relatively few cores.)
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Rpeak of NVIDIA GPUs

Postby jgpallero » Mon Jan 06, 2014 11:04 am

Hello,

After several months, I'm back with the GPU computing. I have again sobe questions about the theoretical peak computation. In http://en.wikipedia.org/wiki/Comparison ... 500_Series can be read that the 691.2 GFLOPS/s theoretical peak for my GTX 550 Ti is a FMA peak, i.e. is the FMA (c=a*b+c) operation count. As a FLOP is considered either a product or an addition and in order to obtain the rpeak in FLOPS, should this 691.2 multiplied times two? In the referred link says that each unit is capable for 2 FMA per cycle, so it is 4 FLOPS per cycle. Then, as the GTX 550 Ti has 192 cuda cores, the theoretical peak in single precision should be 192*4*1.8GHz=1382.4 GFLOPS/s and 16*4*1.8=115.2 GFLOPS/s in double precision, as the ratio SP/DP for this GPU is 1/12. Am I right?

As I previously posted, I obtain a performance in dpuble abour 40 GFLOPS/s (counting 1 FLOP as 1 product or 1 addition) using culaDgetrf with matrices of dimensions 3000x3000 and NRHS=1. For the performance computation I've considered the complexity of the dgetrf of O(2/3N^3). So, the actual performance is 40/115.2*100 = 34.7%. The time was measured before the calling to culaDgetrf(), so the internal copy of data to the GPU is included in the total time to obtain the performance. I find this 34% a bit low, but I suppose that is due to the GTX 550Ti is a low end GPU.

In summary, I have a terrible mesh about the peak computation on GPU. Can anyone help me, please?
jgpallero
 
Posts: 10
Joined: Wed May 08, 2013 3:01 pm

Re: Rpeak of NVIDIA GPUs

Postby raima55 » Fri Oct 24, 2014 12:11 am

There are many good information and details so I hope you will read carefully and take more information by take a look here.
Pass your 70-463 exam questions - braindumps.com exams in first try by using our APPLE ccna wireless & University of Oxford tutorials and best quality wikipedia dumps along with ccie Brite Divinity School
raima55
 
Posts: 1
Joined: Fri Oct 24, 2014 12:07 am

Re: Rpeak of NVIDIA GPUs

Postby jgpallero » Fri Oct 24, 2014 2:30 am

raima55 wrote:There are many good information and details so I hope you will read carefully and take more information by take a look here.


Hello:

Thank you for your answer. But, where is "here"? If you refer to a link, you forgot to paste it :)
jgpallero
 
Posts: 10
Joined: Wed May 08, 2013 3:01 pm


Return to General CULA Discussion

Who is online

Users browsing this forum: Google [Bot] and 1 guest

cron