Batch-level parallelism

General discussion for CULA. Use this forum for questions, examples, feedback, and feature requests.

Batch-level parallelism

Postby plaplaud » Tue Dec 14, 2010 6:52 am

This is a point that is more and more discussed on nvidia forums. GPU runs great on large matrix problems but not many small matrix. Yesterday, I was looking for informations on this topic and I read a post on nvidia forums from a cula team member saying that it was something you were looking into.

For real-time applications, the argument that says "if you have A LOT of small matrix, transfert time will makes you lose the GPU speedup" doesn't stand. If your GPU computes while transfering data, that will only add a delay time (pipeline method).

For the problem I face, if CULA had such an option, i wouldn't hesitate on whether I should use a CPU or a GPU for this real time application.

So my question is very simple, are you working on something like that ? If yes, do you have an idea of when is would be available ?

Cheers.
plaplaud
CULA Premium
 
Posts: 6
Joined: Mon Apr 05, 2010 11:22 pm

Re: Batch-level parallelism

Postby john » Wed Dec 15, 2010 2:29 pm

We completely understand the problem you are describing - I actually spent quite a few minutes during my talk at GTC2010 discussing this very issue. We are actively working on a solution to this problem presently, and I would expect to start seeing it rolled into CULA in the coming year.

What we found regarding transfer time was a little more nuanced. Basically, a host-based interface has a tremendously hard time competing, even with many thousands of matrices, but as you said that is primarily due to latency. Streaming can alleviate that to some degree there. We are finding that there is a much stronger argument to be made for a device-interface Batch mode for CULA, and that is where our work is focused for the time being.

Can you tell me a little more about your needs? Matrix sizes? Which routine(s)?
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Batch-level parallelism

Postby plaplaud » Fri Dec 17, 2010 4:19 am

That's great news !

It's an real-time embedded app so I can't use like 10 CPUs to do it. What is costing the most Flops are the 1000 matrix inversions or eigenvalue decompositions of size ~ 70x70. These matrix are double precision complex data. I have 5 to 10 milliseconds to do it but adding a latency by using a pipeline architecture is not a problem. On entry, I have ~ 12 Mo of data for each 10 milliseconds (=1,2 Go/s, this souldn't be a problem).

With the developpement of GPUs, I surprised that almost all GPU computing applications are for huge-scale simulations but not for embedded real-time apps given the flops per mm² of GPUs.

Cheers.
plaplaud
CULA Premium
 
Posts: 6
Joined: Mon Apr 05, 2010 11:22 pm

Re: Batch-level parallelism

Postby john » Mon Dec 20, 2010 1:13 pm

My guess is that solving 1000 70x70 matrix inversions should be possible within your stated time requirements.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm

Re: Batch-level parallelism

Postby leejc » Wed May 18, 2011 11:22 am

I face a similar problem. I have an algorithm that needs to have many (N~million) independent matrice (of size MxM=50x50) inversions and eigen solves done concurrently. The algorithm however, allows all intermediate variable to reside on the GPU (in fact, all variables needed to generate these matrices are stored on the GPU) and only the final results in form of N 50-vectors need to go in and out of the GPU per time step.

In what way should I call the CULA ZGESV and ZGEEV to do this effectively and what is the expected speed up for situation like this?
leejc
 
Posts: 3
Joined: Tue Jun 29, 2010 12:05 pm

Re: Batch-level parallelism

Postby kyle » Fri May 20, 2011 2:32 pm

The current incarnation of CULA doesn't the support the batch mode parallelism you are interested in. We are working on implementing these methods in a case-by-case basis and eigenvalues are certainly one of the more requested (and higher priority) routines.

Do you have any more information about the matrices themselves? Are they symmetric? All the same size? What size is typical?
kyle
Administrator
 
Posts: 301
Joined: Fri Jun 12, 2009 7:47 pm

Re: Batch-level parallelism

Postby plaplaud » Tue Sep 13, 2011 1:19 am

While reading CULBAS's guide, I noticed that it is possible to Batch Kernels using streams. I also saw a topic here asking if it will be possible to use those streams in CULA but noone answered.

What is currently the objectives of CULA for many small size problems ???
plaplaud
CULA Premium
 
Posts: 6
Joined: Mon Apr 05, 2010 11:22 pm

Re: Batch-level parallelism

Postby john » Tue Sep 13, 2011 7:29 am

Streams aren't a sensible sharing/batching interface for CULA, because CULA is higher level than CUBLAS. CULA functions consist of many kernels, CPU operations, memory transfers, allocations, etc. We can't readily accept user streams, because it would interfere with our internal streaming (there is no concept of nested streams, for instance).

That said, streams are only moderately successful in handling batch operations. Current hardware can only schedule 4 simultaneous streams, so if you issued 8x 1-block kernels in 8 different streams, you would find yourself disappointed in the perf. Streams seem better suited to filling in the tail end of kernels for a modest boost, and for hiding memory copies.

As Kyle noted above, our batching work is being pursued strictly on a case by case basis. What we need are details from users: number of simultaneous matrices, routines needed, data types, paramaters, size of matrices, etc. It only takes the smallest variation in that list to require a whole new parallelization though, so we pursue many of these through custom code for the customer. Get in touch with me via PM and we can discuss further.
john
Administrator
 
Posts: 587
Joined: Thu Jul 23, 2009 2:31 pm


Return to General CULA Discussion

Who is online

Users browsing this forum: No registered users and 1 guest

cron