21Jul/10Off

## Using CULA in MATLAB, Part 2

by Kyle

In part one of this three part series, we introduced a method using C++ templates to support all four major MATLAB data types. In today's part, we'll detail the specifics of how to integrate CULA's SVD algorithm into MATLAB. Finally, part three will give some tips on error checking, compilation, linking, usage, and benchmarking.

To recap, we left off from part one with a working code base that will enter a templated function dependent on the MATLAB data type.  From here, we'd like to write one templated function that can support single, double, complex single, and complex double datatypes. However, because CULA and MATLAB store complex data differently, we are going to have to create a number of helper functions to convert between the two types.

Internally, MATLAB stores it's complex data using a structure of arrays (SoA) format where the real data and complex data are stored in separate memory locations. In CULA, or any other LAPACK implementation, complex data is stored using an array of structures (AoS) format where the real and complex data are interleaved within a single large array. Therefore, in order to support complex data, we'll need to create two helper functions that convert from AoS to SoA and vice-versa. This unfortunately introduces the need for twice as much allocation when dealing with complex data.  However, MATLAB itself uses LAPACK under the hood and natively suffers from the same plight.

The following code sample illustrates, using C++ templates, how to perform this conversion. I'd recommend putting this function in a header such as 'culamex.hpp' as it will be needed for every function where you wish to support complex data.

```// culamex.hpp
// Common helper functions for integrating CULA into MATLAB

#ifndef __CULAMEX_HPP__
#define __CULAMEX_HPP__

// Used to find real type associated with complex type
template<class T> struct ToReal { typedef T type; };
template<>        struct ToReal<culaFloatComplex> { typedef float type; };
template<>        struct ToReal<culaDoubleComplex> { typedef double type; };

// Convert from MATLAB complex format to CULA complex format
template <typename FloatType>
void MatToCula(FloatType* buf, const mxArray* src)
{
typedef typename ToReal<FloatType>::type RealType;

const RealType* r0 = (const RealType*) mxGetPr(src);
const RealType* i0 = (const RealType*) mxGetPi(src);

int M = (int)mxGetM(src);
int N = (int)mxGetN(src);

for(int j = 0; j < N; ++j)
{
for(int i = 0; i < M; ++i)
{
int p  = j*M+i;
buf[p].x = r0[p];
buf[p].y = i0[p];
}
}
}

// Convert from CULA complex format to MATLAB complex format
template <typename FloatType>
void CulaToMat(mxArray* src, FloatType* buf)
{
typedef typename ToReal<FloatType>::type RealType;

RealType* r0 = (RealType*) mxGetPr(src);
RealType* i0 = (RealType*) mxGetPi(src);

int M = mxGetM(src);
int N = mxGetN(src);

for(int j = 0; j < N; ++j)
{
for(int i = 0; i < M; ++i)
{
int p = j*M+i;
r0[p] = buf[p].x;
i0[p] = buf[p].y;
}
}
}

// Do nothing for non-complex cases
template <> void MatToCula(float* buf, const mxArray* src) {}
template <> void MatToCula(double* buf, const mxArray* src) {}
template <> void CulaToMat(mxArray* src, float* buf) {}
template <> void CulaToMat(mxArray* src, double* buf) {}

#endif //__CULAMEX_HPP__
```

Now, that we have our generic helper functions, we can start writing some code that acts just like MATLAB's svd function.  Upon examining the documentation for MATLAB's svd and CULA's xGESVD function, you'll notice two major differences.

1. MATLAB returns the singular values in diagonal array whereas CULA returns a vector
2. MATLAB return V whereas CULA return V' (transposed)

In order to preserve MATLAB's interface, we'll have to do some extra house keeping to convert from the singular value vector into to a diagonal matrix and also perform a transpose (or for complex data, conjugate transpose) on VT.

```// culaSvd.cpp
// Implements CULA accelerated SVD to be called within MATLAB

#include <algorithm>

#include "mex.h"
#include "culapack.hpp"
#include "culamex.hpp"

using std::max;
using std::min;

// Complex conjugation of complex data
template<class T> void Conjugate(T* a) { /* Do nothing */ };
template<> void Conjugate(culaFloatComplex* a) { a->y = -(a->y); }
template<> void Conjugate(culaDoubleComplex* a) { a->y = -(a->y); }

template <typename T>
void mexCulaGesvd(int nlhs,           /* number of expected outputs */
mxArray* plhs[],        /* output pointer array */
int nrhs,               /* number of inputs */
const mxArray* prhs[],  /* input pointer array */
mxClassID id,
mxComplexity complexity)
{
// Initialize flags and types
typedef typename ToReal<T>::type RealType;
bool isReal = (complexity == mxREAL);
bool isComplex = (complexity == mxCOMPLEX);

// Initialize sizes
int M = (int) mxGetM(prhs[0]);
int N = (int) mxGetN(prhs[0]);
int K = min(M,N);
int L = max(M,N);

// Allocate a temporary to not destroy input data
T* A = (T*) mxMalloc( M * N * sizeof(T) );

if (isReal)
{
// Copy input data directly into temporary
memcpy( A, mxGetPr( prhs[0] ), M * N * sizeof(T) );
}
else if (isComplex)
{
// If complex, convert from MATLAB format into CULA format
MatToCula( A, prhs[0] );
}

// Create MATLAB output matrices
plhs[0] = mxCreateNumericMatrix(M, M, id, complexity);  // U (M x M)
plhs[1] = mxCreateNumericMatrix(M, N, id, 0);           // S (M x N, Real)
plhs[2] = mxCreateNumericMatrix(N, N, id, complexity);  // V (N x N)

// Allocate CULA intermediate
RealType* SVEC = (RealType*) mxMalloc( K * sizeof(RealType) );

// CULA Memory Pointers
T* U;
T* VT;

if (isReal)
{
// Get CULA memory pointers from allocated MATLAB matrices
U = (T*) mxGetPr( plhs[0] );
VT = (T*) mxGetPr( plhs[2] );
}
else if (isComplex)
{
// If complex, allocate an AoS complex buffer for CULA
U = (T*) mxMalloc( M * M * sizeof(T) );
VT = (T*) mxMalloc( N * N * sizeof(T) );
}

// Initialize CULA
culaInitialize();

// CULA SVD Factorization
culaGesvd('A', 'A', M, N, A, M, SVEC, U, M, VT, N);

// Shutdown CULA
culaShutdown();

// Get pointer to output matrix, S
RealType* S = (RealType*) mxGetPr( plhs[1] );

// Copy SVEC to diagonal of S
for (int i=0; i<K; i++)
S[i*M+i] = SVEC[i];

// Inplace transpose of VT
for (int i=0; i<N; i++)
{
for (int j=i; j<N; j++)
{
T temp = VT[j+i*N];
VT[j+i*N] = VT[i+j*N];
VT[i+j*N] = temp;
}
}

// If complex, conjugate VT
if (isComplex)
for (int i=0; i<N; i++)
for (int j=0; j<N; j++)
Conjugate( &VT[j+i*N] );

if (isComplex)
{
// If complex, convert from CULA format into MATLAB format
CulaToMat( plhs[0], U );
CulaToMat( plhs[2], VT );

// Free MATLAB buffers
mxFree(U);
mxFree(VT);
}

// Free allocate data
mxFree(A);
mxFree(SVEC);
}

// MATLAB Gateway Function
void mexFunction(int nlhs,              /* number of expected outputs */
mxArray* plhs[],       /* output pointer array */
int nrhs,              /* number of inputs */
const mxArray* prhs[]  /* input pointer array */ )
{
// See Part 1
// mexCulaGesvd(...)
}
```

When combined with the gateway function introduced in part one, you should now have enough code to compile a working CULA accelerated, MATLAB callable SVD routine that support all four major datatypes. In the next part of this series, we'll discuss how to error check, compile, and use this routine in MATLAB.

While I've tried to keep the comments in the code quite verbose, if you have any questions regarding this code please ask in the comments section or in our user forums.