================================================================================ CULA R11 (CUDA 3.2) Release Notes EM Photonics, Inc. ================================================================================ -------------------------------------------------------------------------------- Installation Instructions -------------------------------------------------------------------------------- For installation instructions, please consult the CULAProgrammersGuide.pdf file included in the 'doc/' folder of your CULA distribution. -------------------------------------------------------------------------------- System Requirements -------------------------------------------------------------------------------- CULA requires that your system be equipped with a NVIDIA CUDA-compatible device in order to run CULA-enabled programs. The NVIDIA drivers must be version 263.06 (or greater) for Windows systems and 260.19.26 for Linux systems. Mac OS X systems must have 3.2.17 or newer. If you wish to use the CULA "Device" interface, you should install the CUDA 3.2 toolkit. -------------------------------------------------------------------------------- Supported Operating Systems -------------------------------------------------------------------------------- All systems feature 32-bit and 64-bit support. * Windows XP / Vista / 7 * Ubuntu Linux 9.04 (and newer) * Red Hat Enterprise Linux 4.8 / 5.3 * Fedora 11 * Mac OS X 10.5 Leopard / 10.6 Snow Leopard -------------------------------------------------------------------------------- Revision History -------------------------------------------------------------------------------- CULA R11 (March 31, 2011) CULA R10 (December 10, 2010) CULA 2.1 (August 31, 2010) CULA 2.0 (June 28, 2010) CULA 2.0 Preview (May 21, 2010) CULA 1.3a (April 19, 2010) CULA 1.3 (April 8, 2010) CULA 1.2 (February 17, 2010) CULA 1.1b (January 6, 2009) CULA 1.1a (December 21, 2009) CULA 1.1 (November 25, 2009) CULA 1.1 Beta (November 13, 2009) CULA 1.0 (September 30, 2009) CULA 1.0 Beta 3 (September 15, 2009) CULA 1.0 Beta 2 (August 27, 2009) CULA 1.0 Beta 1 (August 12, 2009) -------------------------------------------------------------------------------- Changelog -------------------------------------------------------------------------------- Release R11 CUDA 3.2 (March 31, 2011) ------------------------------------- Premium * Feature: Implemented symv (symmetric matrix vector product) * Feature: Implemented hemv (hermitian matrix vector product) * Feature: Implemented geConjugate (conjugate general matrix) * Feature: Implemented trConjugate (conjugate triangular matrix) * Feature: Implemented geNancheck (check for NaNs in matrix) * Feature: Implemented geTranspose (transpose general matrix out of place) * Feature: Implemented geTransposeConjugate (transpose and conjugate out of place) * Feature: Implemented geTransposeInplace (transpose square matrix inplace) * Feature: Implemented geTransposeConjugateInplace (transpose and conjugate square matrix) * Feature: Implemented lacpy (copy matrix) * Feature: Implemented lag2 (convert precision) * Feature: Implemented lar2v (apply rotations) * Feature: Implemented larfb (apply reflector) * Feature: Implemented larfg (generate reflector) * Feature: Implemented largv (generate rotations) * Feature: Implemented lartv (apply rotations) * Feature: Implemented lascl (scale matrix) * Feature: Implemented laset (set matrix) * Feature: Implemented lasr (apply rotation) * Feature: Implemented lat2z (convert precision, triangular matrix) * Improved: Increased speed of syev/syevx significantly when finding eigenvectors * Improved: Increased speed of gebrd * Improved: Increased speed of gesvd when calculating singular values All Versions * Fixed: Failure to initialize for Quadro X000 cards Release R10 CUDA 3.2 (December 10, 2010) ---------------------------------------- Premium * Feature: Implemented pbtrf (positive definite banded matrix factorization) * Feature: Implemented gbtrf (banded matrix factorization) All Versions * Feature: CUDA runtime upgraded to 3.2 * Feature: Explicit support and tuning added for 500-series GPUs * Feature: Added BLAS interfaces * Feature: Implemented culaGetErrorInfoString to aid in error diagnosis * Feature: culatypes.h defines either CULA_BASIC or CULA_PREMIUM * Improved: All routines retuned for Fermi. Gains of up to 100% are available. * Improved: Multi-thread performance and stability * Improved: All examples have more descriptive error output Release 2.1 Final (August 31, 2010) ----------------------------------- Premium * Feature: Implemented orgrq All Versions * Feature: Support for PGI CUDA Fortran Compiler (link -lcula_pgfortran) * Feature: OS X supports 64-bit * Fixed: More reliably detect and produce culaInsufficientRuntime condition * Fixed: culaShutdown is now safe in a multithreaded context Release 2.0 Final (June 28, 2010) --------------------------------- Premium * Fixed: Improved accuracy of geev for some specific matrices All Versions * Feature: CUDA Runtime upgraded to CUDA 3.1 * Fixed: Fortran interface properly accepts floating point constant arguments * Fixed: Properly detect if an insufficient runtime or driver is installed Release 2.0 Preview (May 21, 2010) ---------------------------------- Premium * Feature: Implemented dsposv and zcposv (iteratively refined solvers) * Feature: Implemented complex versions of syev (symmetric eigenvalues) * Feature: Implemented complex versions of syevx (expert version of syev) * Feature: Implemented complex versions of syrdb (symmetric reduction to tridiagonal form) * Feature: Implemented complex versions of steqr (eigenvalues and vectors of a symmetric tridiagonal matrix) * Improved: Improved performance of syrdb by up to 30% * Improved: Improved performance of syev by up to 20% * Improved: Improved performance of potrf by up to 55% * Improved: Improved performance of most routines in D/Z precision by up to 30% All Versions * Feature: CUDA runtime upgraded to 3.1 Beta * Feature: Support for Fermi-class GPUs * Feature: Mac OS X version supports all complex and double precision functionality Release 1.3a Final (April 19, 2010) ----------------------------------- Premium * Fixed: Increased stability of syev, syevx, and stebz for certain types of matrices * Improved: Improved accuracy of stebz Release 1.3 Final (April 8, 2010) --------------------------------- Basic * Improved: Removed performance degradation for large NRHS in gesv * Improved: Increased performance of gesv by up to 45% * Fixed: gglse properly handles P=0 case Premium * Feature: Benchmark example now supports double precision * Improved: Increased performance of getrs by up to 50% * Improved: Increased performance of posv by up to 45% * Improved: Increased performance of potrf by up to 10% * Improved: Increased performance of trtrs by up to 60% All Versions * Improved: Mac OSX builds have install_name rpath set Release 1.2 Final (February 17, 2010) ------------------------------------- Premium * Feature: Implemented syev (symmetric eigenvalues) * Feature: Implemented syevx (expert version of syev) * Feature: Implemented syrdb (symmetric reduction to tridiagonal form) * Feature: Implemented stebz (calculate eigenvalues of a symmetric matrix) * Feature: Implemented steqr (eigenvalues and vectors of a symmetric tridiagonal matrix) * Feature: Implemented geqrs (system solve from QR data) * Feature: Implemented geqlf (QL factorize) * Feature: Implemented orgql/ungql (Q matrix generate from LQ data) * Feature: Implemented ormql/unmql (multiply by Q matrix from LQ data) * Feature: Implemented ds/zc gesv routine (iteratively refined gesv) * Feature: Implemented ggrqf (generalized RQ factorization of two matrices) * Improved: Increased performance of bdsqr by 10% * Improved: Increased performance of gebrd by up to 100% All Versions * Improved: Increased performance of getrf by up to 50% for square matrices * Improved: Increased performance of getrf by up to 30% for non-square matrices * Improved: Increased performance of geqrf by up to 10% * Improved: gesvd produces significantly more accurate unitary matrices * Improved: gesvd/bdsqr memory requirements reduced significantly * Fixed: gesvd produces correct unitary matrices for all data inputs * Fixed: Fortran device interface is now functional * Fixed: getrf now continues to factorize after encountering a singularity Release 1.1b Final (January 6, 2009) ------------------------------------ Premium * Fixed: Interface of getrs properly interprets the 'N' job code Release 1.1a Final (December 21, 2009) -------------------------------------- All Versions * Improved: Benchmark example easier to use and provides more user control * Improved: System info script (sysinfo.sh) now properly reports GPU on 195-series driver * Improved: All error codes are thorougly described in the Programmer's Guide * Fixed: OS X builds now no longer reference unnecessary external libraries * Fixed: All routines properly accept job codes in both upper- and lower-case * Fixed: Potential infinite loop when allocating mixed-precision data * Fixed: Now reporting host out-of-memory condition as culaInsufficientMemory * Fixed: RHEL 4.7 builds include proper dependent libraries Premium * Fixed: cudaDeviceMalloc underallocates for non-float data types Release 1.1 Final (November 25, 2009) ------------------------------------- All Versions * Improved: Removed culaInitialize() delay * Improved: GEQRF performance up to 20% improvement for users with older CPUs * Fixed: Correction to Fortran example Makefile to specify "arch" parameter Basic * Improved: SVD significant memory usage reduction * Improved: GELS stability for N > M case Premium * Improved: GELQF performance increase 2-3x * Improved: GEHRD/GERQF/ORGLQ/ORGQR performance increased by 10-20% * Improved: GEHRD routine accurate for size N==1 Release 1.1 Beta (November 13, 2009) ------------------------------------ All Versions * Feature: Mac OS X 10.5 Leopard "preview" release - single precision only * Feature: New "Bridge" interface provides for easy and seamless porting of existing LAPACK/MKL/ACML applications (see doc/bridge_interface.txt) * Feature: New document describing full CULA API * Feature: New function culaSelectDevice to set executing device * Feature: New "gesv" example shows operation of all S/C/D/Z data types * Feature: New "multigpu" example showing multi-GPU CULA operation * Feature: New "bridge" example showing usage of the Bridge interface * Improved: SVD optimized for non-square cases * Improved: Documentation clarified on error conditions and codes * Improved: Stronger error reporting from example projects * Improved: culaInitialize detects and reports if driver/runtime version are inadequate * Improved: Documentation clearer on thread safety issues * Fixed: CULA can now handle extremely non-square matrices (eg 500000x16) * Fixed: An error in the "benchmark" example causing it to ignore user arguments * Fixed: Properly reporting cudaErrorMemoryValueTooLarge as culaInsufficientMemory Basic * Improved: GESV performance increased by up to 30% * Improved: Stability of GELS in certain cases * Improved: Stability of SVD in certain cases Premium * Feature: Implemented geev (general Eigensolver) in S/D/C/Z precisions * Feature: Implemented gehrd (general Hessenberg reduction) in S/D/C/Z precisions * Feature: Implemented orghr * Feature: .hpp headers have name overloads of ORG/UNG functions * Fixed: Host interface "ORG" functions different results from device interface Release 1.0 Final (September 30, 2009) -------------------------------------- Basic * Feature: All functions feature complex variants * Fixed: Crash related to getrs pivot array Premium * Feature: All functions implemented in all supported data types Release 1.0 Beta 3 (September 15, 2009) --------------------------------------- All Versions * Feature: New documentation section on specific routine conventions * Improved: Updated sysinfo script with more descriptive output * Improved: Added example that demonstrates the device interface * Fixed: Various corrections for small-matrix inputs, especially M=N=1 * Fixed: culaInitialize now sets environment variable KMP_DUPLICATE_LIB_OK Basic * Feature: Complex geqrf included * Feature: Added culaGetDeviceCount to report the number of available devices * Feature: Added culaGetDeviceInfo to report information about a device * Feature: Added culaGetExecutingDevice to report the executing device * Fixed: Further corrections for unitary output in gesvd for all job codes Premium * Feature: New functions culaDeviceMalloc/culaDeviceFree in culadevice.h * Fixed: Orglq and orgqr should behave more reliably Release 1.0 Beta 2 (August 27, 2009) ------------------------------------ All Versions * Feature: Including both 32- and 64-bit libraries on 64-bit Linux release * Feature: Now shipping precompiled Benchmark example on Linux builds * Feature: Troubleshooting section added to Programmer's Guide * Feature: Added scripts that report system information to `examples` folder * Improved: Error output for examples is now more descriptive * Improved: Documentation is more specific about configuring system runtime * Fixed: Incompatibilities with gcc 4.2 and earlier; gcc 4.1 is now compatible Basic * Improved: gesvd was optimized for up to a 60% speedup over Beta 1 * Fixed: Error in geqrf for matrices of M << N * Fixed: Error in gesvd where some matrices would yield non-unitary U and Vt Premium * Feature: Implemented getri * Feature: Implemented potrf * Feature: Implemented potrs * Feature: Implemented posv * Feature: Implemented trtrs * Improved: orglq was optimized for up to a 700% speedup Release 1.0 Beta 1 (August 13, 2009) ------------------------------------ All Versions * Feature: Support Windows XP 32/64 * Feature: Support Linux 32/64 Basic * Feature: Implemented gels * Feature: Implemented geqrf * Feature: Implemented gesv * Feature: Implemented gesvd * Feature: Implemented getrf * Feature: Implemented gglse Premium * Feature: Implemented gebrd * Feature: Implemented getrs * Feature: Implemented trtrs * Feature: Implemented gelqf * Feature: Implemented gerqf * Feature: Implemented orgqr * Feature: Implemented orglq * Feature: Implemented orgbr * Feature: Implemented ormqr * Feature: Implemented ormlq * Feature: Implemented ormrq * Feature: Implemented bdsqr -------------------------------------------------------------------------------- More Information -------------------------------------------------------------------------------- For more information on the CULAtools family of products, please visit our webpage at http://www.culatools.com To provide feedback, please visit http://www.culatools.com/forums and post in the appropriate forum topic.