Hello, I am new to Python (coming from R and Matlab/Octave). I was preparing to write my usual compute pdf of a really high dimensional (e.g. 10000 dimensions) Gaussian code in Python but I noticed that numpy had a function for computing the log determinant in these situations. Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well? Thank you. Numpy and scipy look pretty awesome. - dan
On 2010-08-30, at 11:28 AM, Daniel Elliott wrote:
Hello,
I am new to Python (coming from R and Matlab/Octave). I was preparing to write my usual compute pdf of a really high dimensional (e.g. 10000 dimensions) Gaussian code in Python but I noticed that numpy had a function for computing the log determinant in these situations.
Yep. Keep in mind this is a fairly recent addition, in 1.5 I think, so if you ship code make sure to list this dependency.
Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well?
There's a function for the inverse, but you almost never want to use it, especially if your goal is the multivariate normal density. A basic explanation of why is available here: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/ In the case of the multivariate normal density the covariance is assumed to be positive definite, and thus a Cholesky decomposition is appropriate. scipy.linalg.solve() (NOT numpy.linalg.solve()) with the sym_pos=True argument will do this for you. What do you mean by a "multinomial normal"? Do you mean multivariate normal? Offhand I can't remember if it exists in scipy.stats, but I know there's code for it in PyMC. David
Hi, I've been lurking for a while here but never really introduced myself. I'm a mathematician in Brazil working with optimization and numerical analysis and I'm looking into scipy/numpy basically because I want to ditch matlab. I'm just curious as to why you say "scipy.linalg.solve(), NOT numpy.linalg.solve()". Can you explain the reason for this? I find myself looking for information such as this on the internet but I rarely find real documentation for these things, and I seem to have so many performance issues with python... I'm curious to see what I'm missing here. Thanks, and sorry if I hijacked the thread, - Melissa On Mon, Aug 30, 2010 at 3:59 PM, David Warde-Farley <dwf@cs.toronto.edu> wrote:
On 2010-08-30, at 11:28 AM, Daniel Elliott wrote:
Hello,
I am new to Python (coming from R and Matlab/Octave). I was preparing to write my usual compute pdf of a really high dimensional (e.g. 10000 dimensions) Gaussian code in Python but I noticed that numpy had a function for computing the log determinant in these situations.
Yep. Keep in mind this is a fairly recent addition, in 1.5 I think, so if you ship code make sure to list this dependency.
Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well?
There's a function for the inverse, but you almost never want to use it, especially if your goal is the multivariate normal density. A basic explanation of why is available here: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
In the case of the multivariate normal density the covariance is assumed to be positive definite, and thus a Cholesky decomposition is appropriate. scipy.linalg.solve() (NOT numpy.linalg.solve()) with the sym_pos=True argument will do this for you.
What do you mean by a "multinomial normal"? Do you mean multivariate normal? Offhand I can't remember if it exists in scipy.stats, but I know there's code for it in PyMC.
David _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
-- Melissa Weber Mendonça -- "Knowledge is knowing a tomato is a fruit; wisdom is knowing you don't put tomato in fruit salad."
On 2010-08-30, at 5:42 PM, Melissa Mendonça wrote:
I'm just curious as to why you say "scipy.linalg.solve(), NOT numpy.linalg.solve()". Can you explain the reason for this?
Oh, the performance will be similar, provided you've linked against a good BLAS. It's just that the NumPy version doesn't have the sym_pos keyword. As for why that is: http://ask.scipy.org/en/topic/10-what-s-the-difference-between-numpy-linalg-... David
On Mon, Aug 30, 2010 at 3:42 PM, Melissa Mendonça <melissawm@gmail.com>wrote:
Hi,
I've been lurking for a while here but never really introduced myself. I'm a mathematician in Brazil working with optimization and numerical analysis and I'm looking into scipy/numpy basically because I want to ditch matlab.
I'm just curious as to why you say "scipy.linalg.solve(), NOT numpy.linalg.solve()". Can you explain the reason for this? I find myself looking for information such as this on the internet but I rarely find real documentation for these things, and I seem to have so many performance issues with python... I'm curious to see what I'm missing here.
What sort of performance issues are you having? Chuck
Hi Melissa, On 30 August 2010 17:42, Melissa Mendonça <melissawm@gmail.com> wrote:
I've been lurking for a while here but never really introduced myself. I'm a mathematician in Brazil working with optimization and numerical analysis and I'm looking into scipy/numpy basically because I want to ditch matlab.
Welcome to the list! I hope you will find the switch to numpy and scipy rewarding above and beyond not-being-matlab. Please feel free to ask questions on the list; as you've probably seen, we get lots of questions with simple but non-obvious answers, and a few really tough questions.
I'm just curious as to why you say "scipy.linalg.solve(), NOT numpy.linalg.solve()". Can you explain the reason for this? I find myself looking for information such as this on the internet but I rarely find real documentation for these things, and I seem to have so many performance issues with python... I'm curious to see what I'm missing here.
I agree that the documentation is a little hard to find one's way around sometimes. The numpy documentation project has done a wonderful job of providing detailed documentation for all the functions in numpy, but there's not nearly as much documentation giving a general orientation (which functions are better, what the relationship is with scipy and matplotlib). The scipy documentation project is unfortunately still getting started. What sorts of performance issues have you had? python has some important limitations, but there are often ways to work around them. Sometimes there isn't an efficient way to do things, though, and in those cases we'd like to think about whether numpy/scipy can be improved. (Bulk linear algebra - solving large numbers of small problems - stands out as an example.) Numerical optimization is another - we know the optimizers built into scipy have serious limitations, and welcome ideas on improving them.
Thanks, and sorry if I hijacked the thread,
No problem, and welcome again to the list. Anne
On 2010-08-30, at 11:28 AM, Daniel Elliott wrote:
Large matrices (e.g. 10K x 10K)
Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well?
There's a function for the inverse, but you almost never want to use it, especially if your goal is the multivariate normal density. A basic explanation of why is available here: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
In the case of the multivariate normal density the covariance is assumed to be
Thanks for the reply. David Warde-Farley <dwf <at> cs.toronto.edu> writes: positive definite, and thus a
Cholesky decomposition is appropriate. scipy.linalg.solve() (NOT numpy.linalg.solve()) with the sym_pos=True argument will do this for you.
You don't think this will choke on a large (e.g. 10K x 10K) covariance matrix? Given what you know about how it computes the log determinant and how the Cholesky decomposition, do you suppose I would be better off using eigen- decomposition to do this since I will also get the determinant from the sum of the logs of the eigenvalues?
What do you mean by a "multinomial normal"? Do you mean multivariate normal? Offhand I can't remember if it exists in scipy.stats, but I know there's code for it in PyMC.
Oops, I meant multivariate normal. I found code in the pymix library and so far I like what I see there. I am sure it works well for most cases. However, with the exception of one R library, I have never found a packaged implementation that works on covariance matrices as large as what I am talking about above.
On 08/31/2010 11:19 AM, Dan Elliott wrote:
Thanks for the reply.
On 2010-08-30, at 11:28 AM, Daniel Elliott wrote:
Large matrices (e.g. 10K x 10K)
Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well?
There's a function for the inverse, but you almost never want to use it, especially if your goal is the multivariate normal density. A basic explanation of why is available here: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
In the case of the multivariate normal density the covariance is assumed to be
David Warde-Farley<dwf<at> cs.toronto.edu> writes: positive definite, and thus a
Cholesky decomposition is appropriate. scipy.linalg.solve() (NOT numpy.linalg.solve()) with the sym_pos=True argument will do this for you.
You don't think this will choke on a large (e.g. 10K x 10K) covariance matrix?
It will work if you have enough memory. I have worked with slightly bigger matrices, but I have 12 Gb on my machine. You need some patience, though :) cheers, David
On Mon, Aug 30, 2010 at 8:19 PM, Dan Elliott <danelliottster@gmail.com>wrote:
Thanks for the reply.
On 2010-08-30, at 11:28 AM, Daniel Elliott wrote:
Large matrices (e.g. 10K x 10K)
Is there a function for performing the inverse or even the pdf of a multinomial normal in these situations as well?
There's a function for the inverse, but you almost never want to use it, especially if your goal is the multivariate normal density. A basic explanation of why is available here: http://www.johndcook.com/blog/2010/01/19/dont-invert-that-matrix/
In the case of the multivariate normal density the covariance is assumed to be
David Warde-Farley <dwf <at> cs.toronto.edu> writes: positive definite, and thus a
Cholesky decomposition is appropriate. scipy.linalg.solve() (NOT numpy.linalg.solve()) with the sym_pos=True argument will do this for you.
You don't think this will choke on a large (e.g. 10K x 10K) covariance matrix?
Should work, give it a shot. It's an n^3 problem, so might take a bit.
Given what you know about how it computes the log determinant and how the Cholesky decomposition, do you suppose I would be better off using eigen- decomposition to do this since I will also get the determinant from the sum of the logs of the eigenvalues?
I don't see what the connection with the determinant is. The log determinant will be calculated using the ordinary LU decomposition as that works for more general matrices. Chuck
On 2010-08-30, at 10:36 PM, Charles R Harris wrote:
I don't see what the connection with the determinant is. The log determinant will be calculated using the ordinary LU decomposition as that works for more general matrices.
I think he means that if he needs both the determinant and to solve the system, it might be more efficient to do the SVD, obtain the determinant from the diagonal values, and obtain the solution by multiplying by U D^-1 V^T? David
David Warde-Farley <dwf <at> cs.toronto.edu> writes:
On 2010-08-30, at 10:36 PM, Charles R Harris wrote: I think he means that if he needs both the determinant and to solve the system, it might be more efficient to do the SVD, obtain the determinant from the diagonal values, and obtain the solution by multiplying by U D^-1 V^T?
Thank you, that is what I meant. Poorly worded on my part. In particular, I am writing code to invert a very large covariance matrix. I think David has some good information in another post in this thread. - dan
On Tue, Aug 31, 2010 at 4:52 PM, Dan Elliott <danelliottster@gmail.com>wrote:
David Warde-Farley <dwf <at> cs.toronto.edu> writes:
On 2010-08-30, at 10:36 PM, Charles R Harris wrote: I think he means that if he needs both the determinant and to solve the system, it might be more efficient to do the SVD, obtain the determinant from the diagonal values, and obtain the solution by multiplying by U D^-1 V^T?
Thank you, that is what I meant. Poorly worded on my part.
In particular, I am writing code to invert a very large covariance matrix. I think David has some good information in another post in this thread.
Where did the covariance array come from? It may be the case that you can use a much smaller one, for instance in PCA of images. Chuck
is it really the covariance matrix you want to invert? Or do you want to compute something like x^T C^{-1} x, where x is an array of size N and C an array of size (N,N)? It would also be interesting to know how the covariance matrix gets computed and what its condition number is, at least approximately. On Wed, Sep 1, 2010 at 1:58 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Aug 31, 2010 at 4:52 PM, Dan Elliott <danelliottster@gmail.com> wrote:
David Warde-Farley <dwf <at> cs.toronto.edu> writes:
On 2010-08-30, at 10:36 PM, Charles R Harris wrote: I think he means that if he needs both the determinant and to solve the system, it might be more efficient to do the SVD, obtain the determinant from the diagonal values, and obtain the solution by multiplying by U D^-1 V^T?
Thank you, that is what I meant. Poorly worded on my part.
In particular, I am writing code to invert a very large covariance matrix. I think David has some good information in another post in this thread.
Where did the covariance array come from? It may be the case that you can use a much smaller one, for instance in PCA of images.
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Wow, this is great! Thanks for all the great questions. Sebastian Walter <sebastian.walter <at> gmail.com> writes:
is it really the covariance matrix you want to invert? Or do you want to compute something like x^T C^{-1} x, where x is an array of size N and C an array of size (N,N)?
Yes, this is what I am computing. I am computing the pdf of a very high- dimensional multivariate normal. Is there a specialized method to compute this?
It would also be interesting to know how the covariance matrix gets computed and what its condition number is, at least approximately.
Chuck and Sebastian are on to me. I should look into computing this using the small covariance matrix. I suppose I could even postpone evaluation of this covariance matrix by passing in the data matrix instead of the covariance matrix. What do you guys think? As for the condition number, the matrix will, initially, be ill-conditioned because the dimension of the data will far outnumber the number of data points. However, I am writing this code on the way to writing code that fits a mixture of Gaussians but can use any number of covariance regularization techniques to "fix" the covariance matrices. Basically, I am writing code that does this: http://www.cs.colostate.edu/~dane/papers/ANNIE2010.pdf. I am guessing you math and stats guys might shake your head at my clumsy, computer science ways... :) However, in addition to the two, simple regularization techniques described in that paper, I also want to be able to do things like mixture of factor analyzers and mixture of principal component analyzers and so on by simply supplying different regularization techniques. FYI, I am currently adding this to PyMix. Does this seem like a good package to extend?
On Wed, Sep 1, 2010 at 1:58 AM, Charles R Harris <charlesr.harris <at> gmail.com> wrote:
On Tue, Aug 31, 2010 at 4:52 PM, Dan Elliott <danelliottster <at> gmail.com> wrote:
David Warde-Farley <dwf <at> cs.toronto.edu> writes:
On 2010-08-30, at 10:36 PM, Charles R Harris wrote: I think he means that if he needs both the determinant and to solve the system, it might be more efficient to do the SVD, obtain the determinant from the diagonal values, and obtain the solution by multiplying by U D^-1 V^T? Thank you, that is what I meant. Poorly worded on my part.
In particular, I am writing code to invert a very large covariance matrix. I think David has some good information in another post in this thread.
Where did the covariance array come from? It may be the case that you can use a much smaller one, for instance in PCA of images.
Chuck
Yes, this is what I am computing. I am computing the pdf of a very high- dimensional multivariate normal. Is there a specialized method to compute this?
If you use cho_solve and cho_factor from scipy.linalg, you can proceed like this: cx = X - m sqmahal = (cx*cho_solve(cho_factor(S),cx.T).T).sum(axis=1) where X is the data (n x p), m is mean (1 x p) and S is the covariance (p x p). We could do this twice as fast using a hypothetical method tri_solve that calls LAPACK subroutine DTRTRS. cx = X - m sqmahal = (tri_solve(cho_factor(S),cx.T).T**2).sum(axis=1) It would be twice as fast because cho_solve does 50% redundant work in the calculation of Mahalanobis distances, as we can skip the backsubstitution. You can e.g. get DTRTRS from Intel MKL, ACML (free download from AMD) or GotoBLAS. Just downloading ACML and using ctypes (or f2py or Cython) to get DTRTRS is easy. If you do this you probably also want to expose LAPACK's Cholesky subroutine DPOTRS from ACML or Intel MKL, to make sure you get everything as fast as possible (that would also avoid the dependency on SciPy). I.e. call DPOTRS on S, compute CX, call DTRTRS, then take the sum of squares along the first axis (of CX that DTRTRS modified inplace). That is the fastest way to compute Mahalanobis distances I can think of... Make sure you use an optimized LAPACK library: ACML, Intel MKL and GotoBLAS are the only one worth looking at. ATLAS is much slower, and forget about Netlib's reference implementations of BLAS and LAPACK. I hope the SciPy dev team can be persuaded to include a wrapper for DTRTRS in the future. It is after all extremely useful for Mahalanobis distances, and thus for any use of linear models in statistics. Sturla Molden
Thu, 09 Sep 2010 18:18:29 +0200, Sturla Molden wrote: [clip]
I hope the SciPy dev team can be persuaded to include a wrapper for DTRTRS in the future. It is after all extremely useful for Mahalanobis distances, and thus for any use of linear models in statistics.
I don't see reasons why not to include such a wrapper, so little persuasion is needed. The only thing is that someone should spend some time implementing this suggestion (and I probably won't -- I don't really need that feature myself, and there are many other things that need to be done). -- Pauli Virtanen
Sturla Molden <sturla <at> molden.no> writes:
Yes, this is what I am computing. I am computing the pdf of a very high- dimensional multivariate normal. Is there a specialized method to compute this?
If you use cho_solve and cho_factor from scipy.linalg, you can proceed like this:
cx = X - m sqmahal = (cx*cho_solve(cho_factor(S),cx.T).T).sum(axis=1)
where X is the data (n x p), m is mean (1 x p) and S is the covariance (p x p).
We could do this twice as fast using a hypothetical method tri_solve that calls LAPACK subroutine DTRTRS.
Sorry for the laziness of this question... As a reminder, I am working with high-dimensional data (10K dimensions). I was computing the log of the MVN pdf because the probabilities would almost all go to zero. Do you suppose the method you have shown me will be numerically stable (the probabilities will be small but they stay above zero)? By the way, I would be happy to implement the method you desire in a couple months if you are willing to do a little hand-holding. Thanks for the excellent suggestion. - dan
On 2010-08-30, at 10:19 PM, Dan Elliott wrote:
You don't think this will choke on a large (e.g. 10K x 10K) covariance matrix?
That depends. Is it very close to being rank deficient?That would be my main concern. NumPy/LAPACK will have no trouble Cholesky-decomposing a matrix this big, presuming the matrix is well-behaved/full rank. Depending on your hardware it will be slow, but the Cholesky decomposition will be faster (I think usually by about a factor of 2) than other methods that do not assume positive-definiteness. At any rate, inverting the matrix explicitly will take longer and be quite a bit worse in terms of the eventual accuracy of your result.
Given what you know about how it computes the log determinant and how the Cholesky decomposition, do you suppose I would be better off using eigen- decomposition to do this since I will also get the determinant from the sum of the logs of the eigenvalues?
You could do it this way, though I am not certain of the stability concerns (and I'm in the middle of a move so my copy of Golub and Van Loan is packed up somewhere). I know that the SVD is used in numpy.leastsq, so it can't be that bad. I've never used covariance matrices quite so big that computing the determinant was a significant cost. One thing to note is that you should definitely not be solving the system for every single RHS vector separately. scipy.linalg.solve supports matrix RHSes, and will only do the matrix factorization (be it LU or Cholesky) once, requiring only the O(n^2) forward/backsubstitution to be done repeatedly. This will result in significant savings: In [5]: X = randn(10000,20000) In [6]: Y = dot(X, X.T) In [7]: timeit scipy.linalg.solve(Y, randn(10000), sym_pos=True) 10 loops, best of 3: 78.6 s per loop In [8]: timeit scipy.linalg.solve(Y, randn(10000, 50), sym_pos=True) 10 loops, best of 3: 81.6 s per loop David
participants (10)
-
Anne Archibald
-
Charles R Harris
-
Dan Elliott
-
Daniel Elliott
-
David
-
David Warde-Farley
-
Melissa Mendonça
-
Pauli Virtanen
-
Sebastian Walter
-
Sturla Molden