Batch-level parallelism
8 posts
• Page 1 of 1
Batch-level parallelism
This is a point that is more and more discussed on nvidia forums. GPU runs great on large matrix problems but not many small matrix. Yesterday, I was looking for informations on this topic and I read a post on nvidia forums from a cula team member saying that it was something you were looking into.
For real-time applications, the argument that says "if you have A LOT of small matrix, transfert time will makes you lose the GPU speedup" doesn't stand. If your GPU computes while transfering data, that will only add a delay time (pipeline method).
For the problem I face, if CULA had such an option, i wouldn't hesitate on whether I should use a CPU or a GPU for this real time application.
So my question is very simple, are you working on something like that ? If yes, do you have an idea of when is would be available ?
Cheers.
For real-time applications, the argument that says "if you have A LOT of small matrix, transfert time will makes you lose the GPU speedup" doesn't stand. If your GPU computes while transfering data, that will only add a delay time (pipeline method).
For the problem I face, if CULA had such an option, i wouldn't hesitate on whether I should use a CPU or a GPU for this real time application.
So my question is very simple, are you working on something like that ? If yes, do you have an idea of when is would be available ?
Cheers.
- plaplaud
- CULA Premium
- Posts: 6
- Joined: Mon Apr 05, 2010 11:22 pm
Re: Batch-level parallelism
We completely understand the problem you are describing - I actually spent quite a few minutes during my talk at GTC2010 discussing this very issue. We are actively working on a solution to this problem presently, and I would expect to start seeing it rolled into CULA in the coming year.
What we found regarding transfer time was a little more nuanced. Basically, a host-based interface has a tremendously hard time competing, even with many thousands of matrices, but as you said that is primarily due to latency. Streaming can alleviate that to some degree there. We are finding that there is a much stronger argument to be made for a device-interface Batch mode for CULA, and that is where our work is focused for the time being.
Can you tell me a little more about your needs? Matrix sizes? Which routine(s)?
What we found regarding transfer time was a little more nuanced. Basically, a host-based interface has a tremendously hard time competing, even with many thousands of matrices, but as you said that is primarily due to latency. Streaming can alleviate that to some degree there. We are finding that there is a much stronger argument to be made for a device-interface Batch mode for CULA, and that is where our work is focused for the time being.
Can you tell me a little more about your needs? Matrix sizes? Which routine(s)?
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
Re: Batch-level parallelism
That's great news !
It's an real-time embedded app so I can't use like 10 CPUs to do it. What is costing the most Flops are the 1000 matrix inversions or eigenvalue decompositions of size ~ 70x70. These matrix are double precision complex data. I have 5 to 10 milliseconds to do it but adding a latency by using a pipeline architecture is not a problem. On entry, I have ~ 12 Mo of data for each 10 milliseconds (=1,2 Go/s, this souldn't be a problem).
With the developpement of GPUs, I surprised that almost all GPU computing applications are for huge-scale simulations but not for embedded real-time apps given the flops per mm² of GPUs.
Cheers.
It's an real-time embedded app so I can't use like 10 CPUs to do it. What is costing the most Flops are the 1000 matrix inversions or eigenvalue decompositions of size ~ 70x70. These matrix are double precision complex data. I have 5 to 10 milliseconds to do it but adding a latency by using a pipeline architecture is not a problem. On entry, I have ~ 12 Mo of data for each 10 milliseconds (=1,2 Go/s, this souldn't be a problem).
With the developpement of GPUs, I surprised that almost all GPU computing applications are for huge-scale simulations but not for embedded real-time apps given the flops per mm² of GPUs.
Cheers.
- plaplaud
- CULA Premium
- Posts: 6
- Joined: Mon Apr 05, 2010 11:22 pm
Re: Batch-level parallelism
My guess is that solving 1000 70x70 matrix inversions should be possible within your stated time requirements.
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
Re: Batch-level parallelism
I face a similar problem. I have an algorithm that needs to have many (N~million) independent matrice (of size MxM=50x50) inversions and eigen solves done concurrently. The algorithm however, allows all intermediate variable to reside on the GPU (in fact, all variables needed to generate these matrices are stored on the GPU) and only the final results in form of N 50-vectors need to go in and out of the GPU per time step.
In what way should I call the CULA ZGESV and ZGEEV to do this effectively and what is the expected speed up for situation like this?
In what way should I call the CULA ZGESV and ZGEEV to do this effectively and what is the expected speed up for situation like this?
- leejc
- Posts: 3
- Joined: Tue Jun 29, 2010 12:05 pm
Re: Batch-level parallelism
The current incarnation of CULA doesn't the support the batch mode parallelism you are interested in. We are working on implementing these methods in a case-by-case basis and eigenvalues are certainly one of the more requested (and higher priority) routines.
Do you have any more information about the matrices themselves? Are they symmetric? All the same size? What size is typical?
Do you have any more information about the matrices themselves? Are they symmetric? All the same size? What size is typical?
- kyle
- Administrator
- Posts: 301
- Joined: Fri Jun 12, 2009 7:47 pm
Re: Batch-level parallelism
While reading CULBAS's guide, I noticed that it is possible to Batch Kernels using streams. I also saw a topic here asking if it will be possible to use those streams in CULA but noone answered.
What is currently the objectives of CULA for many small size problems ???
What is currently the objectives of CULA for many small size problems ???
- plaplaud
- CULA Premium
- Posts: 6
- Joined: Mon Apr 05, 2010 11:22 pm
Re: Batch-level parallelism
Streams aren't a sensible sharing/batching interface for CULA, because CULA is higher level than CUBLAS. CULA functions consist of many kernels, CPU operations, memory transfers, allocations, etc. We can't readily accept user streams, because it would interfere with our internal streaming (there is no concept of nested streams, for instance).
That said, streams are only moderately successful in handling batch operations. Current hardware can only schedule 4 simultaneous streams, so if you issued 8x 1-block kernels in 8 different streams, you would find yourself disappointed in the perf. Streams seem better suited to filling in the tail end of kernels for a modest boost, and for hiding memory copies.
As Kyle noted above, our batching work is being pursued strictly on a case by case basis. What we need are details from users: number of simultaneous matrices, routines needed, data types, paramaters, size of matrices, etc. It only takes the smallest variation in that list to require a whole new parallelization though, so we pursue many of these through custom code for the customer. Get in touch with me via PM and we can discuss further.
That said, streams are only moderately successful in handling batch operations. Current hardware can only schedule 4 simultaneous streams, so if you issued 8x 1-block kernels in 8 different streams, you would find yourself disappointed in the perf. Streams seem better suited to filling in the tail end of kernels for a modest boost, and for hiding memory copies.
As Kyle noted above, our batching work is being pursued strictly on a case by case basis. What we need are details from users: number of simultaneous matrices, routines needed, data types, paramaters, size of matrices, etc. It only takes the smallest variation in that list to require a whole new parallelization though, so we pursue many of these through custom code for the customer. Get in touch with me via PM and we can discuss further.
- john
- Administrator
- Posts: 587
- Joined: Thu Jul 23, 2009 2:31 pm
8 posts
• Page 1 of 1
Return to General CULA Discussion
Who is online
Users browsing this forum: No registered users and 0 guests