Zero row in SVD's unitary matrix on some Mac's
I am working to make many of NumPy's matrix decomposition routines available in Sage. As part of testing a new routine, we have found some odd behavior with the singular value decomposition. On certain Mac's, the numpy built in Sage will return the second of the unitary matrices with a row of all zeros, so of course, it can't be unitary. For the configurations in question, this erroneous output happens only for certain sized matrices and for those sizes it always occurs. The smallest sizes are 3 x 4, 4 x 5, 5 x 6, 5 x 7, 6 x 7, 6 x 8, 6 x 9, 7 x 8, 7 x 9, 8 x 9. The fault is not in Sage code per se, as it can be reproduced by running Sage's python and using numpy directly. It could be possible Sage is not building numpy correctly, we have not tested a standalone version of numpy since this problem seems to be limited to very few configurations. The initial report, and a confirmation, are both on Macs where Sage is built using gcc 4.0.1 and gcc 4.2.1. The test that uncovered this situation was introduced two alpha releases back, and has not failed for testers on Linux or newer Macs. The svd routine itself has been in Sage for about three years without exhibiting any problems, but maybe the cases above were not tested. I do not own a Mac, so testing out scenarios involves sending suggestions to the two folks who have reported failures. Many more details and complete transcripts are at: http://trac.sagemath.org/sage_trac/ticket/11248 Any thoughts or advice to help us understand this would be greatly appreciated. Thanks in advance. Rob -- Robert A. Beezer Professor Department of Mathematics and Computer Science University of Puget Sound 1500 N Warner Tacoma, WA 98416-1043 beezer@ups.edu http://buzzard.ups.edu Voice: 253.879.3564 Fax: 253.879.3522
On Mon, 25 Apr 2011 10:16:13 -0700, Rob Beezer wrote: [clip]
Many more details and complete transcripts are at: http://trac.sagemath.org/sage_trac/ticket/11248
Any thoughts or advice to help us understand this would be greatly appreciated.
The Numpy routine is a very thin wrapper of LAPACK's ZGESDD, and probably cannot have any bugs of this kind, so the problem is most likely with the LAPACK and BLAS libraries you use. You will probably be able to reproduce the problem also with an equivalent Fortran/C snippet calling LAPACK directly. Problems like this in BLAS/LAPACK are somewhat difficult to track. You could try switching to another BLAS library (or, if you use ATLAS, compile it differently) and checking if the problem disappears. -- Pauli Virtanen
On 25. apr. 2011, at 19.57, Pauli Virtanen wrote:
On Mon, 25 Apr 2011 10:16:13 -0700, Rob Beezer wrote: [clip]
Many more details and complete transcripts are at: http://trac.sagemath.org/sage_trac/ticket/11248
Any thoughts or advice to help us understand this would be greatly appreciated.
The Numpy routine is a very thin wrapper of LAPACK's ZGESDD, and probably cannot have any bugs of this kind, so the problem is most likely with the LAPACK and BLAS libraries you use. You will probably be able to reproduce the problem also with an equivalent Fortran/C snippet calling LAPACK directly.
Problems like this in BLAS/LAPACK are somewhat difficult to track. You could try switching to another BLAS library (or, if you use ATLAS, compile it differently) and checking if the problem disappears.
-- Pauli Virtanen
I cannot claim anything concrete, but I did notice something seemingly really odd about the blas+lapack that ships with macosx. In our light scattering simulation code we use energy conservation as a test of consistency. When running small test runs on my macbook pro, the energy would be significantly less well conserved compared to when the exact same simulation was run on a linux box. No harm done, we don't care about the exact results on the laptop, but it did seem odd. We do not rely on SVD, however, we only use LU+backsubstitution (cgesv). I have a vague feeling that there is something odd about lapack+blas on macosx, but I do not have any hard evidence. Paul
Thanks for the reply, Pauli. I suspected this might be the case, but was hoping maybe this was something that had been seen before. I've included your suggestions on the bug report for Sage. Rob On 04/25/2011 10:57 AM, Pauli Virtanen wrote:
The Numpy routine is a very thin wrapper of LAPACK's ZGESDD, and probably cannot have any bugs of this kind, so the problem is most likely with the LAPACK and BLAS libraries you use. You will probably be able to reproduce the problem also with an equivalent Fortran/C snippet calling LAPACK directly.
Problems like this in BLAS/LAPACK are somewhat difficult to track. You could try switching to another BLAS library (or, if you use ATLAS, compile it differently) and checking if the problem disappears.
On 4/25/11 12:57 PM, Pauli Virtanen wrote:
On Mon, 25 Apr 2011 10:16:13 -0700, Rob Beezer wrote: [clip]
Many more details and complete transcripts are at: http://trac.sagemath.org/sage_trac/ticket/11248
Any thoughts or advice to help us understand this would be greatly appreciated.
The Numpy routine is a very thin wrapper of LAPACK's ZGESDD, and probably cannot have any bugs of this kind, so the problem is most likely with the LAPACK and BLAS libraries you use. You will probably be able to reproduce the problem also with an equivalent Fortran/C snippet calling LAPACK directly.
Problems like this in BLAS/LAPACK are somewhat difficult to track. You could try switching to another BLAS library (or, if you use ATLAS, compile it differently) and checking if the problem disappears.
I was just looking up the documentation for ZGESDD and noticed that the value we have for rwork in the numpy call [1] does not match the Lapack docs. This was changed in Lapack 3.2.2 [2]. I've submitted a pull request: https://github.com/numpy/numpy/pull/78 but I have not tested the change. I doubt this fix will fix the problem on this thread, but it makes sense to make the change anyway. Thanks, Jason [1] https://github.com/numpy/numpy/blob/master/numpy/linalg/linalg.py#L1300 [2] http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=1779, or http://www.netlib.org/lapack/lapack-3.2.2.html (bug 0046)
On Tue, 26 Apr 2011 10:33:07 -0500, Jason Grout wrote: [clip]
I was just looking up the documentation for ZGESDD and noticed that the value we have for rwork in the numpy call [1] does not match the Lapack docs. This was changed in Lapack 3.2.2 [2]. I've submitted a pull request:
https://github.com/numpy/numpy/pull/78
but I have not tested the change.
I doubt this fix will fix the problem on this thread, but it makes sense to make the change anyway.
I think it is worth testing if this change fixes it. Note that current LAPACK docs have the number 7 there instead of 5, so there may really be a bug here. Pauli
On 04/26/2011 11:25 AM, Pauli Virtanen wrote:
I think it is worth testing if this change fixes it. Note that current LAPACK docs have the number 7 there instead of 5, so there may really be a bug here.
It appears to me the problem is more subtle than just a strict comparison between the number of rows and the number of columns. http://trac.sagemath.org/sage_trac/attachment/ticket/11248/broken.png shows a plot of matrix sizes (number of rows on the horizontal, number of columns on vertical). This is on an OSX 10.5 machine and whereever there is a dot then the SVD seems to *always* produce a non-unitary V, and the absence of a dot indicates no failures were found. So maybe a workspace size could be the culprit?
On 4/25/11 12:57 PM, Pauli Virtanen wrote:
The Numpy routine is a very thin wrapper of LAPACK's ZGESDD, and probably cannot have any bugs of this kind,
As noted in my other message, I've been digging through the ZGESDD docs to understand it better. Here is the doc string for what becomes the V^T matrix: * VT (output) COMPLEX*16 array, dimension (LDVT,N) * If JOBZ = 'A' or JOBZ = 'O' and M >= N, VT contains the * N-by-N unitary matrix V**H; * if JOBZ = 'S', VT contains the first min(M,N) rows of * V**H (the right singular vectors, stored rowwise); * if JOBZ = 'O' and M < N, or JOBZ = 'N', VT is not referenced. Notice that VT is supposed to contain the N-by-N V**H if M>=N (i.e., more rows than columns). In our problem cases, we have M<N. I looked at the numpy linalg code and it doesn't seem to check to see if M<N before returning the V matrix. Is this a problem? When I run an example C program that links to the OSX Accelerate framework, I get a V matrix that has a zero row at the bottom. I got these results by: 1. Download the example found at http://software.intel.com/sites/products/documentation/hpc/mkl/lapack/mkl_la... 2. Change the zgesdd call to have first argument "A" (so it mimics the default numpy call) 3. Change print_matrix( "Right singular vectors (stored rowwise)", m, n, vt, ldvt ); to print_matrix( "Right singular vectors (stored rowwise)", n, n, vt, ldvt ); (this is to print out the full n-by-n matrix, instead of just the first m rows) 4. Compile the program with "gcc zgesdd_ex.c -framework Accelerate" 5. Run the program: "./a.out" And indeed, I get a 0 row as the last row of the V**H matrix However, when I do the SVD of this same matrix using numpy (which I *think* uses the Accelerate framework the way we compile it in Sage), I get a full V matrix: In [13]: I=1j In [14]: a=numpy.asarray([( -5.40+ I* 7.40), ( 6.00+ I* 6.38), ( 9.91+ I* 0.16), ( -5.28+ I* -4.16), ( 1.09+ I* 1.55), ( 2.60+ I* 0.07), ( 3.98+ I* -5.26), ( 2.03+ I* 1.11), ( 9.88+ I* 1.91), ( 4.92+ I* 6.31), ( -2.11+ I* 7.39), ( -9.81+ I* -8.98)],dtype=complex).reshape(3,4) In [15]: numpy.linalg.svd(a)Out[15]: (array([[ 0.54742764+0.j , 0.76302168+0.j , -0.34368721+0.j ], [-0.03507684-0.15148438j, 0.27097680-0.22637514j, 0.54572628-0.74386208j], [ 0.81299016+0.12325614j, -0.52311095-0.13956616j, 0.13357577-0.1135282j ]]), array([ 21.75519279, 16.59545017, 3.97327576]), array([[ 0.23160531+0.20669796j, 0.36590896+0.3864613j , 0.24259328+0.32833854j, -0.56133932-0.37233547j], [-0.57911906+0.40329699j, 0.10921398+0.17242422j, 0.59673801-0.27492812j, 0.15998810+0.05510835j], [ 0.60420072+0.12337134j, -0.18988750+0.29722068j, 0.39210635+0.19697635j, 0.45451433+0.31015037j], [-0.08335334+0.13557583j, 0.69622499-0.25685378j, -0.15350399+0.43075518j, 0.46295826+0.02290113j]])) That seems odd. I must be missing something. It also seems odd that numpy returns V as a full 4x4 matrix even when M<N. Thanks, Jason
On 4/26/11 11:07 AM, Jason Grout wrote:
And indeed, I get a 0 row as the last row of the V**H matrix
I just double-checked things one last time and saw that I actually hadn't changed the first argument of zgesdd to "A" in the program that I actually ran. So with this change, I get a nonzero last row of the V**H matrix from the C call to zgesdd. So everything is consistent between the C call to zgesdd and the numpy svd call. So now my remaining question is: if the Lapack docs only say that V**H is the full n-by-n matrix if M>=N, why is numpy returning it even if M<N? Thanks, Jason
On 4/26/11 11:12 AM, Jason Grout wrote:
On 4/26/11 11:07 AM, Jason Grout wrote:
And indeed, I get a 0 row as the last row of the V**H matrix
I just double-checked things one last time and saw that I actually hadn't changed the first argument of zgesdd to "A" in the program that I actually ran. So with this change, I get a nonzero last row of the V**H matrix from the C call to zgesdd. So everything is consistent between the C call to zgesdd and the numpy svd call.
So now my remaining question is: if the Lapack docs only say that V**H is the full n-by-n matrix if M>=N, why is numpy returning it even if M<N?
One more post talking to myself... I notice that the zgesvd routine docs guarantee that the V returned is unitary, regardless of the size of A. So this might be another argument for calling zgesvd instead of zgesdd. Thanks, Jason
On 4/26/11 11:22 AM, Jason Grout wrote:
On 4/26/11 11:12 AM, Jason Grout wrote:
On 4/26/11 11:07 AM, Jason Grout wrote:
And indeed, I get a 0 row as the last row of the V**H matrix
I just double-checked things one last time and saw that I actually hadn't changed the first argument of zgesdd to "A" in the program that I actually ran. So with this change, I get a nonzero last row of the V**H matrix from the C call to zgesdd. So everything is consistent between the C call to zgesdd and the numpy svd call.
So now my remaining question is: if the Lapack docs only say that V**H is the full n-by-n matrix if M>=N, why is numpy returning it even if M<N?
One more post talking to myself...
I notice that the zgesvd routine docs guarantee that the V returned is unitary, regardless of the size of A. So this might be another argument for calling zgesvd instead of zgesdd.
Okay, just one more data point. Our people that are seeing the problem with numpy returning a non-unitary V also see a non-unitary V being returned by the test C call to zgesdd. In other words, it really appears that zgesdd follows the Lapack docs, and if rows<columns, the returned V is not necessarily unitary, but may contain a zero row. This makes numpy's assumptions in using zgesdd false. You can see this report at http://trac.sagemath.org/sage_trac/ticket/11248#comment:25 Thanks, Jason
Tue, 26 Apr 2011 11:36:19 -0500, Jason Grout wrote: [clip]
Okay, just one more data point. Our people that are seeing the problem with numpy returning a non-unitary V also see a non-unitary V being returned by the test C call to zgesdd. In other words, it really appears that zgesdd follows the Lapack docs, and if rows<columns, the returned V is not necessarily unitary, but may contain a zero row. This makes numpy's assumptions in using zgesdd false.
You can see this report at http://trac.sagemath.org/sage_trac/ticket/11248#comment:25
What LAPACK promises is not very clear. Earlier on the the man page: JOBZ (input) CHARACTER*1 Specifies options for computing all or part of the matrix U: = 'A': all M columns of U and all N rows of V**H are returned in the arrays U and VT; = 'S': the first min(M,N) columns of U and the first min(M,N) rows of V**H are returned in the arrays U and VT; = 'O': If M >= N, the first N columns of U are over‐ written in the array A and all rows of V**H are returned in the array VT; otherwise, all columns of U are returned in the array U and the first M rows of V**H are overwritten in the array A; = 'N': no columns of U or rows of V**H are computed. Looks to me the statement should be parsed as: return_n_rows = (jobz == 'A') or (jobz == 'O' and m >= n) So the current usage should be OK. (At least, as long as jobz == 'A', in the other cases, I don't think the wrapper does the right thing.) But apparently either there's a bug, or the LAPACK man page needs to be understood as you say. Pauli
On 4/26/11 11:49 AM, Pauli Virtanen wrote:
Tue, 26 Apr 2011 11:36:19 -0500, Jason Grout wrote: [clip]
Okay, just one more data point. Our people that are seeing the problem with numpy returning a non-unitary V also see a non-unitary V being returned by the test C call to zgesdd. In other words, it really appears that zgesdd follows the Lapack docs, and if rows<columns, the returned V is not necessarily unitary, but may contain a zero row. This makes numpy's assumptions in using zgesdd false.
You can see this report at http://trac.sagemath.org/sage_trac/ticket/11248#comment:25
What LAPACK promises is not very clear. Earlier on the the man page:
JOBZ (input) CHARACTER*1 Specifies options for computing all or part of the matrix U: = 'A': all M columns of U and all N rows of V**H are returned in the arrays U and VT; = 'S': the first min(M,N) columns of U and the first min(M,N) rows of V**H are returned in the arrays U and VT; = 'O': If M>= N, the first N columns of U are over‐ written in the array A and all rows of V**H are returned in the array VT; otherwise, all columns of U are returned in the array U and the first M rows of V**H are overwritten in the array A; = 'N': no columns of U or rows of V**H are computed.
Looks to me the statement should be parsed as:
return_n_rows = (jobz == 'A') or (jobz == 'O' and m>= n)
So the current usage should be OK. (At least, as long as jobz == 'A', in the other cases, I don't think the wrapper does the right thing.)
But apparently either there's a bug, or the LAPACK man page needs to be understood as you say.
Ah, you're right that it makes sense to parse their statement that way too, so I'm not so sure what Lapack really is saying anymore either. If it's parsed the way you propose (which makes sense given the JOBZ description), I think it points to a bug in the Accelerate Lapack on the affected platforms, as we get the same zero row when we call the function directly from C, without numpy, python, or Sage in the middle. The updated rwork calculation makes no difference with a 3x4 matrix (both the old calculation and the new calculation give 66 in the 3x4 case), so I don't think that is affecting anything. Jason
On Tue, 26 Apr 2011 14:52:52 -0500, Jason Grout wrote: [clip]
The updated rwork calculation makes no difference with a 3x4 matrix (both the old calculation and the new calculation give 66 in the 3x4 case), so I don't think that is affecting anything.
Actually, there *is* a difference for the 3*4 matrix: old = 3*3*5 + 5*3 = 60 new = 3*3*5 + 7*3 = 66 The old calculation had 5 instead of 7 in the formula --- I don't know if it was written according to an older version of the documentation, or if was is simply a bug. Pauli
On 4/26/11 3:29 PM, Pauli Virtanen wrote:
On Tue, 26 Apr 2011 14:52:52 -0500, Jason Grout wrote: [clip]
The updated rwork calculation makes no difference with a 3x4 matrix (both the old calculation and the new calculation give 66 in the 3x4 case), so I don't think that is affecting anything.
Actually, there *is* a difference for the 3*4 matrix:
old = 3*3*5 + 5*3 = 60 new = 3*3*5 + 7*3 = 66
The old calculation had 5 instead of 7 in the formula --- I don't know if it was written according to an older version of the documentation, or if was is simply a bug.
I was talking about our C example program, based on the example from Intel [1]. Intel already had the 3*3*5+7*3 calculation; that's what I called the "old" calculation. The "new" calculation I was referring to was the "min(M,N)*max(5*min(M,N)+7,2*max(M,N)+2*min(M,N)+1)" calculation from Lapack 3.3.2. Thanks, Jason [1] http://software.intel.com/sites/products/documentation/hpc/mkl/lapack/mkl_la...
On 4/26/11 11:49 AM, Pauli Virtanen wrote:
But apparently either there's a bug, or the LAPACK man page needs to be understood as you say.
I've posted a question to the Lapack forum: http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=2402 Thanks, Jason
On 4/26/11 3:18 PM, Jason Grout wrote:
On 4/26/11 11:49 AM, Pauli Virtanen wrote:
But apparently either there's a bug, or the LAPACK man page needs to be understood as you say.
I've posted a question to the Lapack forum:
We've got a reply now: http://icl.cs.utk.edu/lapack-forum/viewtopic.php?f=2&t=2402 Pauli, you're right that the statement should be parsed as: "If (JOBZ = 'A') or (JOBZ = 'O' and M >= N), VT contains the N-by-N unitary matrix V**H;" so you're right that the function is written correctly in the case we are bringing up (but still has an error in assuming too much when JOBZ is not 'A'). We'll work on providing an example with the Accelerate framework and the zero row. Thanks, Jason
On Tue, Apr 26, 2011 at 6:36 PM, Jason Grout <jason-sage@creativetrax.com> wrote:
On 4/26/11 11:22 AM, Jason Grout wrote:
On 4/26/11 11:12 AM, Jason Grout wrote:
On 4/26/11 11:07 AM, Jason Grout wrote:
And indeed, I get a 0 row as the last row of the V**H matrix
I just double-checked things one last time and saw that I actually hadn't changed the first argument of zgesdd to "A" in the program that I actually ran. So with this change, I get a nonzero last row of the V**H matrix from the C call to zgesdd. So everything is consistent between the C call to zgesdd and the numpy svd call.
So now my remaining question is: if the Lapack docs only say that V**H is the full n-by-n matrix if M>=N, why is numpy returning it even if M<N?
One more post talking to myself...
I notice that the zgesvd routine docs guarantee that the V returned is unitary, regardless of the size of A. So this might be another argument for calling zgesvd instead of zgesdd.
Okay, just one more data point. Our people that are seeing the problem with numpy returning a non-unitary V also see a non-unitary V being returned by the test C call to zgesdd. In other words, it really appears that zgesdd follows the Lapack docs, and if rows<columns, the returned V is not necessarily unitary, but may contain a zero row. This makes numpy's assumptions in using zgesdd false.
You can see this report at http://trac.sagemath.org/sage_trac/ticket/11248#comment:25
If it is an Accelerate bug it got fixed on OS X 10.6. Below is the output with both current master and 1.5.1. It may be worth checking this on 10.5 with the binary Numpy installer from Sourceforge to make sure it's not the Sage build process causing this somehow. Ralf
import numpy import scipy.linalg A = numpy.array( [[1 - 1j, -3j, -2 + 1j, 1, -2 + 3j], ... [ 1 - 1j, -2 + 1j, 1 + 4j, 0, 2 + 1j], ... [ -1, -5 + 1j, -2 + 1j, 1 + 1j, -5 - 4j], ... [-2 + 4j, 2 - 1j, 8 - 4j, 1 - 8j, 3 - 2j]]) U, S, VH = scipy.linalg.svd(A) VH array([[-0.10378056+0.25534799j, 0.22371070-0.07765616j, 0.55736639-0.40276461j, -0.04874824-0.50177439j, 0.32025785-0.19584279j], [-0.28067375-0.0116824j , -0.49586800+0.1305711j , 0.14432786+0.08803575j, 0.21076793+0.04575091j, -0.19220732-0.73899332j], [ 0.10136055-0.1681669j , 0.13465050+0.69687595j, 0.19709231-0.03497225j, -0.42056091+0.37188099j, 0.28760069-0.14046187j], [-0.27289262+0.66077235j, 0.06177902+0.25464336j, -0.52050549+0.06307174j, 0.11847581-0.00883692j, 0.35771384-0.05719981j], [ 0.49800436+0.21785287j, -0.32014745+0.07789755j, 0.15924098-0.39775924j, 0.50925102+0.33273359j, 0.14247809+0.14849585j]]) numpy.dot(VH, VH.conj().T) array([[ 1.00000000e+00 +1.38777878e-17j, -8.32667268e-17 +2.84494650e-16j, 1.38777878e-16 -1.45716772e-16j, -1.94289029e-16 +1.38777878e-17j, -3.46944695e-17 +2.08166817e-17j], [ -8.32667268e-17 -2.77555756e-16j, 1.00000000e+00 +0.00000000e+00j, -7.56339436e-16 +1.38777878e-16j, 5.55111512e-17 -1.66533454e-16j, 2.11636264e-16 +1.11022302e-16j], [ 1.38777878e-16 +1.17961196e-16j, -7.56339436e-16 -1.28369537e-16j, 1.00000000e+00 +6.93889390e-18j, -3.88578059e-16 +4.16333634e-17j, -1.04083409e-16 +3.12250226e-17j], [ -1.94289029e-16 -1.04083409e-17j, 5.55111512e-17 +1.99493200e-16j, -3.88578059e-16 -2.08166817e-17j, 1.00000000e+00 -6.93889390e-18j, -9.71445147e-17 -8.15320034e-17j], [ -3.46944695e-17 +2.08166817e-17j, 2.11636264e-16 -1.00613962e-16j, -1.04083409e-16 -4.16333634e-17j, -9.71445147e-17 +6.93889390e-17j, 1.00000000e+00 +0.00000000e+00j]])
On 04/26/2011 10:19 AM, Ralf Gommers wrote:
If it is an Accelerate bug it got fixed on OS X 10.6. Below is the output with both current master and 1.5.1. It may be worth checking this on 10.5 with the binary Numpy installer from Sourceforge to make sure it's not the Sage build process causing this somehow.
Thanks, Ralf. Indeed, we have no reports of problems on OSX 10.6, and it is likely there have been many tests on that platform. So a test of numpy on 10.5, independent of Sage's build process, would be useful. I don't have access to a Mac, but I posted your suggestion to the Sage ticket for this, where hopefully somebody will take it up.
participants (5)
-
Jason Grout
-
Paul Anton Letnes
-
Pauli Virtanen
-
Ralf Gommers
-
Rob Beezer