This reply I think will only apply to Boxed Cylon.

Those two kernels you cited are used solely for the backsolve portion (ie after LU is complete) and each should be called a number of times roughly equal to the N parameter. Some of those other kernels on your list are ours as well (some gemms, some of the other trsm.) Since your N is small (100-400) and your number of TRSM calls is large, is this being run in a loop of some kind? Can you list for me the exact call parameters shown in that screenshot?

If you are indeed calling in a loop, I just want to note that sgesv factors the matrix each time it is called. If the matrix A doesn't change every time, you can always prefactor via getrf before you enter the loop and then use the factored matrix to solve your system via getrs in the loop.

I will admit that we have not spent much time considering the case where NRHS >> N. I will look into that.

All that said, I don't find your numbers troubling from the profiler results. Those two TRSM kernels are only adding up to 20% of your runtime and they represent the entire backsolve process after the LU factorize.

Here are my benchmarks for sgesv, from our benchmark precompiled example:

- Code: Select all
`C:\Program Files\CULA\examples\benchmark>benchmark_.exe sgesv`

Initializing CULA...

Initializing MKL...

Benchmarking the following functions:

-------------------------------------

SGESV

-------------------------------------

-- SGESV Benchmark --

Size CULA (s) MKL (s) Speedup

------ ---------- ---------- ---------

4096 0.59 1.08 1.8226

5120 0.81 1.98 2.4532

6144 1.20 3.42 2.8499

7168 1.70 5.14 3.0336

8192 2.33 7.88 3.3864

C:\Program Files\CULA\examples\benchmark>

(Do note that this is using CULA 1.2, which has a faster factorize and will be available shortly.)

Let's start there. Does this mesh with the numbers you receive from the benchmark example on your machine? If it does, then we know the answer is in the parameters and/or the matlab integrations. If it does not, then the answer could very well be in your hardware or software setup.