Question: LAPACK-LITE consequences for Numpy speed (win32 platfor m)

Apologies for asking an overly vague question like this but: on an Intel/win32 platform (I only have a windows version of Gauss), I am comparing Numpy matrix inversion with that of Gauss (very much the same type of software as Matlab which at least some of you used to work with). As the size of the matrix to be inverted increases, the speed of Numpy appears to asymptote (on my machine) to about half that of Gauss. For small matrices, it is much worse than that because of the various overheads that are dealt with by Numpy. Would this speed differential be largely eliminated if I was not using LAPACK-LITE? If so I will try to figure my way through hooking into Intel's MKL - anyone got hints on doing this - I saw mention of it in the mailing list archives. Would I be better off, speed-wise, eschewing win32 altogether and using native LAPACK and BLAS libraries on my Linux box? This is relevant to me in the context of a multivariate Kalman filtering module that I am working up to replace one I have been using on the Gauss platform for years. The Numpy version of my filter has a very similar logic structure to that of Gauss but is drastically slower. I have only been working with Numpy for a month or so which may mean that my code is relatively innefficient. I have been assuming that Numpy - as an interpreted language is mainly slowed by looping structures. Thanks in advance, Geoff Shuetrim ____________________________________________________________________________ ________ A simple version of the filter is given below: (Note that I have modified Matrix.py in my installation to include a transpose method for the Matrix class, T()). # *************************************************************** # kalman.py module by Geoff Shuetrim # # Please note - this code is thoroughly untested at this stage # # You may copy and use this module as you see fit with no # guarantee implied provided you keep this notice in all copies. # *************************************************************** # Minimization routines """kalman.py Routines to implement Kalman filtering and smoothing for multivariate linear state space representations of time-series models. Notation follows that contained in Harvey, A.C. (1989) "Forecasting Structural Time Series Models and the Kalman Filter". Filter --- Filter - condition on data to date """ # Import the necessary modules to use NumPy import math from Numeric import * from LinearAlgebra import * from RandomArray import * import MLab from Matrix import * # Initialise module constants Num = Numeric max = MLab.max min = MLab.min abs = Num.absolute __version__="0.0.1" # filtration constants _obs = 100 _k = 1 _n = 1 _s = 1 # Filtration global arrays _y = Matrix(cumsum(standard_normal((_obs,1,_n)))) _v = Matrix(zeros((_obs,1,_n),Float64)) _Z = Matrix(ones((_obs,_k,_n),Float64)) + 1.0 _d = Matrix(zeros((_obs,1,_n),Float64)) _H = Matrix(zeros((_obs,_n,_n),Float64)) + 1.0 _T = Matrix(zeros((_obs,_k,_k),Float64)) + 1.0 _c = Matrix(zeros((_obs,1,_k),Float64)) _R = Matrix(zeros((_obs,_k,_s),Float64)) + 1.0 _Q = Matrix(zeros((_obs,_s,_s),Float64)) + 1.0 _a = Matrix(zeros((_obs,1,_k),Float64)) _a0 = Matrix(zeros((_k,1),Float64)) _ap = _a _as = _a _P = Matrix(zeros((_obs,_k,_k),Float64)) _P0 = Matrix(zeros((_k,_k),Float64)) _Pp = _P _Ps = _P _LL = Matrix(zeros((_obs,1,1),Float64)) def Filter(): # Kalman filtering routine _ap[0] = _T[0] * _a0 + _c[0] _Pp[0] = _T[0] * _P0 * _T[0].T() + _R[0] * _Q[0] * _R[0].T() for t in range(1,_obs-1): _ap[t] = _T[t] * _a[t-1] + _c[t] _Pp[t] = _T[t] * _P0 * _T[t].T() + _R[t] * _Q[t] * _R[t].T() Ft = _Z[t] * _Pp[t] * _Z[t].T() + _H[t] Ft_inverse = inverse(Ft) _v[t] = _y[t] - _Z[t] * _ap[t] - _d[t] _a[t] = _ap[t] + _Pp[t] * _Z[t].T() * Ft_inverse * _v[t].T() _P[t] = _Pp[t] - _Pp[t].T() * _Z[t].T() * Ft_inverse * _Z[t] * _Pp[t] _LL[t] = -0.5 * (log(2*pi) + log(determinant(Ft)) + _v[t] * Ft_inverse * _v[t].T()) Filter() ____________________________________________________________________________ ________ ********************************************************************** " This email is intended only for the use of the individual or entity named above and may contain information that is confidential and privileged. If you are not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this Email is strictly prohibited. When addressed to our clients, any opinions or advice contained in this Email are subject to the terms and conditions expressed in the governing KPMG client engagement letter. If you have received this Email in error, please notify us immediately by return email or telephone +61 2 93357000 and destroy the original message. Thank You. " **********************************************************************...

First, have you tested the execution time of inverse() alone in NumPy and Gauss, or only with the appended function. There are some ways to speed up the function without using a specialized library. The function also has a possible error. (see comments in the function). In general the penality for array creation (memory allocation) is rather big in Numpy, especially if you use long expressions, as the result of every term is stored in a newly created array. To speed up these kinds of operations use the inplace operations with some workspace arrays, like:
# Translating _ap[0] = _T[0] * _a0 + _c[0] _ap[0]=add(multiply(_T[0],_a0,_ap[0]),_c[0],_ap[0])
Then it should be possible to code this without a loop at all. This would also speed up the function. HTH, __Janko Shuetrim, Geoff writes:

I use Windows NT/2000 (at gunpoint), and I've gotten much better performance than NumPy's Lite code by using the Intel MKL. I keep meaning to post these results to an external HP Labs web page, but we're currently engaged in an internal snit about what those pages are allowed to look like. Essentially, with NumPy 17 and Python 1.5, I've been able to: (1) Replace NumPy BLAS with Intel's MKL BLAS, after a number of routine renaming edits in the NumPy code. (Apparently "cblas" is an irresistible prefix.) This turns out to be pretty easy. (2) Replace portions of LAPACK Lite with the corresponding routines from Intel's MKL. I've been hacking away at this on a piecemeal basis, because the MKL uses Fortran calling conventions and array-ordering. In most of the MKL routines, the usual LAPACK flags for transposition are available to compensate. (3) Add new LAPACK functionality using LAPACK routines from the MKL that are not included in LAPACK Lite. I've been meaning to pick this project up again, since I want to use the Intel FFT routines in some new code that I'm writing. The chief benefit of the MKL libraries is the ability to easily use multiple processors and threads simply by setting a couple of environment variables. Until recently (when MATLAB finally started using LAPACK), I was able to gain significantly better performance than MATLAB using the Intel BLAS/LAPACK, even without multiple processors. Now the Intel libraries are only about 10-25% faster, if my watch can be believed. I have _no idea_ how the Intel MKL performs relative to the native Linux BLAS/LAPACK distributions, but I know that (for example) the MKL is about 50% faster than the linear algebra routines shipped by NAG. Although I haven't figured out the standard NumPy distribution process yet (I still use that pesky Visual Studio IDE), I'll elevate the development of a general-purpose integration of the Intel MKL into NumPy to an Official New Year's Resolution. I made my last ONYR about ten years ago, and was able to give up channel-surfing using the TV remote control. So, I have demonstrated an ability to deliver in this department; the realization that we've just started a new millennium only adds to the pressure. Presumably, when I get this finished (I'll target the end of the month), I'll be able to find a place to post it. ============================ Ray Beausoleil Hewlett-Packard Laboratories mailto:beausol@hpl.hp.com 425-883-6648 Office 425-957-4951 Telnet 425-941-2566 Mobile ============================ At 05:56 PM 1/2/2001 +1100, Shuetrim, Geoff wrote:
============================ Ray Beausoleil Hewlett-Packard Laboratories mailto:beausol@hpl.hp.com 425-883-6648 Office 425-957-4951 Telnet 425-941-2566 Mobile ============================

First, have you tested the execution time of inverse() alone in NumPy and Gauss, or only with the appended function. There are some ways to speed up the function without using a specialized library. The function also has a possible error. (see comments in the function). In general the penality for array creation (memory allocation) is rather big in Numpy, especially if you use long expressions, as the result of every term is stored in a newly created array. To speed up these kinds of operations use the inplace operations with some workspace arrays, like:
# Translating _ap[0] = _T[0] * _a0 + _c[0] _ap[0]=add(multiply(_T[0],_a0,_ap[0]),_c[0],_ap[0])
Then it should be possible to code this without a loop at all. This would also speed up the function. HTH, __Janko Shuetrim, Geoff writes:

I use Windows NT/2000 (at gunpoint), and I've gotten much better performance than NumPy's Lite code by using the Intel MKL. I keep meaning to post these results to an external HP Labs web page, but we're currently engaged in an internal snit about what those pages are allowed to look like. Essentially, with NumPy 17 and Python 1.5, I've been able to: (1) Replace NumPy BLAS with Intel's MKL BLAS, after a number of routine renaming edits in the NumPy code. (Apparently "cblas" is an irresistible prefix.) This turns out to be pretty easy. (2) Replace portions of LAPACK Lite with the corresponding routines from Intel's MKL. I've been hacking away at this on a piecemeal basis, because the MKL uses Fortran calling conventions and array-ordering. In most of the MKL routines, the usual LAPACK flags for transposition are available to compensate. (3) Add new LAPACK functionality using LAPACK routines from the MKL that are not included in LAPACK Lite. I've been meaning to pick this project up again, since I want to use the Intel FFT routines in some new code that I'm writing. The chief benefit of the MKL libraries is the ability to easily use multiple processors and threads simply by setting a couple of environment variables. Until recently (when MATLAB finally started using LAPACK), I was able to gain significantly better performance than MATLAB using the Intel BLAS/LAPACK, even without multiple processors. Now the Intel libraries are only about 10-25% faster, if my watch can be believed. I have _no idea_ how the Intel MKL performs relative to the native Linux BLAS/LAPACK distributions, but I know that (for example) the MKL is about 50% faster than the linear algebra routines shipped by NAG. Although I haven't figured out the standard NumPy distribution process yet (I still use that pesky Visual Studio IDE), I'll elevate the development of a general-purpose integration of the Intel MKL into NumPy to an Official New Year's Resolution. I made my last ONYR about ten years ago, and was able to give up channel-surfing using the TV remote control. So, I have demonstrated an ability to deliver in this department; the realization that we've just started a new millennium only adds to the pressure. Presumably, when I get this finished (I'll target the end of the month), I'll be able to find a place to post it. ============================ Ray Beausoleil Hewlett-Packard Laboratories mailto:beausol@hpl.hp.com 425-883-6648 Office 425-957-4951 Telnet 425-941-2566 Mobile ============================ At 05:56 PM 1/2/2001 +1100, Shuetrim, Geoff wrote:
============================ Ray Beausoleil Hewlett-Packard Laboratories mailto:beausol@hpl.hp.com 425-883-6648 Office 425-957-4951 Telnet 425-941-2566 Mobile ============================
participants (3)
-
Janko Hauser
-
Raymond Beausoleil
-
Shuetrim, Geoff