Matrix rank default tolerance - is it too low?

Hi, I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient: If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns: def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script). I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default: S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol) I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns. Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to: tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7) We see on p 56 (section 2.3.2) that: m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2 So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting: tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) gives about 0.5 percent error (tol / S.min() of 4.4) Using the Mathworks threshold [2]: tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case. So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like: tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7. What do y'all think? Best, Matthew [1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html Output from script: Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 <script> import numpy as np import scipy.linalg as npl M = 40 N = 10 def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min() rel_tols = tols / tols[:, -1][:, None] fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>

On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes. Chuck

Hi, On Thu, Jun 14, 2012 at 8:10 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
The problem for that as a general metric would be that larger values in the data would tend to lead to larger numerical error, but this won't be reflected in the tolerance. For example, running my script with your metric finds 0 errors for the normal distribution var = 1, but 100 percent error for a variance of 1000. Percent undetected current: 0.0, tol / S.min(): 3.545 Percent undetected inf norm: 0.0, tol / S.min(): 2.139 Percent undetected upper bound inf norm: 0.0, tol / S.min(): 5.605 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 11.209 Percent undetected MATLAB: 0.0, tol / S.min(): 141.785 Percent undetected Chuck: 100.0, tol / S.min(): 0.013 My worry is that I haven't sampled the space of all possible matrix sizes and scalings. First pass suggests that the values will be different on different plaforms (at least, between a PPC 32 bit and an Intel 64 bit). I think the tolerance is wrong at the moment, and it looks like the Golub and Van Loan suggestion will not work as written. The MATLAB algorithm is some kind of standard and has been battle tested. If we are going to change, it seems tempting to use that. What do you think?, Matthew

On Fri, Jun 15, 2012 at 6:39 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
On Thu, Jun 14, 2012 at 8:10 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1]
http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo...
[2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
The problem for that as a general metric would be that larger values in the data would tend to lead to larger numerical error, but this won't be reflected in the tolerance. For example, running my script with your metric finds 0 errors for the normal distribution var = 1, but 100 percent error for a variance of 1000.
It's a *relative* condition number. You need to multiply it times the largest singular value to find the cutoff.
Percent undetected current: 0.0, tol / S.min(): 3.545 Percent undetected inf norm: 0.0, tol / S.min(): 2.139 Percent undetected upper bound inf norm: 0.0, tol / S.min(): 5.605 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 11.209 Percent undetected MATLAB: 0.0, tol / S.min(): 141.785 Percent undetected Chuck: 100.0, tol / S.min(): 0.013
My worry is that I haven't sampled the space of all possible matrix sizes and scalings. First pass suggests that the values will be different on different plaforms (at least, between a PPC 32 bit and an Intel 64 bit). I think the tolerance is wrong at the moment, and it looks like the Golub and Van Loan suggestion will not work as written. The MATLAB algorithm is some kind of standard and has been battle tested. If we are going to change, it seems tempting to use that.
Chuck

On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
This is the same as Matlab, right? If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high. -N

Hi, On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e: tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm? If so, how should this be managed? Just changing it may change the output of code using numpy >= 1.5.0, but then again, the threshold is probably incorrect. Fix and break or FutureWarning with something like: def matrix_rank(M, tol=None): where ``tol`` can be a string like ``maxdim``? Best, Matthew

Hi, On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer: tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) There's a discussion of algorithms in: @article{konstantinides1988statistical, title={Statistical analysis of effective singular values in matrix rank determination}, author={Konstantinides, K. and Yao, K.}, journal={Acoustics, Speech and Signal Processing, IEEE Transactions on}, volume={36}, number={5}, pages={757--763}, year={1988}, publisher={IEEE} } Yes, restricted access: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1585&tag=1 Cheers, Matthew

Hi, On Sat, Jun 16, 2012 at 1:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
To add extra confusing flames to this fire, a survey of random matrices of sizes M = (3, 5, 10, 50, 100, 500), N = (3, 5, 10, 50, 100, 500) in all combinations suggests that this last Numerical Recipes algorithm does not give false negatives, and the treshold is generally considerably lower than the MATLAB algorithm. At least for this platform (linux 64 bit), and with only one rank deficient column / row. My feeling is still that the risk of using the matlab version is less, and the risk of too many false positives is relatively small. If anyone disagrees, it might be worth running the test rig for other parameters and platforms, Best, Matthew import numpy as np import numpy.linalg as npl algs = (('current', lambda M, S, eps2: S.max() * eps2), ('inf norm', lambda M, S, eps2: npl.norm(M, np.inf) * eps2 / 2), ('ub inf norm', lambda M, S, eps2: S.max() * eps2 / 2 * np.sqrt(M.shape[1])), ('ub inf norm * 2', lambda M, S, eps2: S.max() * eps2 * np.sqrt(M.shape[1])), ('NR', lambda M, S, eps2: S.max() * eps2 / 2 * np.sqrt(sum(M.shape + (1,)))), ('MATLAB', lambda M, S, eps2: S.max() * eps2 * max(M.shape)), ) def make_deficient(M, N, loc=0, scale=1): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X def doit(algs=algs): results = {} n_iters = 1000 n_algs = len(algs) pcnt_div = n_iters * 100. tols = np.zeros((n_iters, n_algs)) ranks = np.zeros((n_iters, n_algs)) eps2 = np.finfo(float).eps for M in (3, 5, 10, 50, 100, 500): for N in (3, 5, 10, 50, 100, 500): max_rank = min(M, N) svs = np.zeros((n_iters, max_rank)) for loc in (0, 100, 1000): for scale in (1, 100, 1000, 10000): for i in range(n_iters): m = make_deficient(M, N, loc, scale) # The SVD tolerances S = npl.svd(m, compute_uv=False) svs[i] = np.sort(S) for j, alg in enumerate(algs): name, func = alg tols[i, j] = func(m, S, eps2) ranks[i, j] = np.sum(S > tols[i, j]) del m, S rel_tols = tols / svs[:, 0][:, None] key = (M, N, loc, scale) print key pcnts = np.sum(ranks == max_rank, axis=0) / pcnt_div mrtols = np.mean(rel_tols, axis=0) results[key] = (pcnts, mrtols)

On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <
matthew.brett@gmail.com>
wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1]
http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo...
[2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote: the
relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
That's interesting, as something like that with a square root was my first choice for the least squares, but then someone mentioned the NR choice. That was all on the mailing list way several years back when I was fixing up the polynomial fitting routine. The NR reference is on page 517 of the 1986 edition (FORTRAN), which might be hard to come by these days ;)
There's a discussion of algorithms in:
@article{konstantinides1988statistical, title={Statistical analysis of effective singular values in matrix rank determination}, author={Konstantinides, K. and Yao, K.}, journal={Acoustics, Speech and Signal Processing, IEEE Transactions on}, volume={36}, number={5}, pages={757--763}, year={1988}, publisher={IEEE} }
Yes, restricted access: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1585&tag=1
Cheers,
Chuck

Hi, On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1]
http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
That's interesting, as something like that with a square root was my first choice for the least squares, but then someone mentioned the NR choice. That was all on the mailing list way several years back when I was fixing up the polynomial fitting routine. The NR reference is on page 517 of the 1986 edition (FORTRAN), which might be hard to come by these days ;)
Thanks for tracking that down, it's very helpful. For those of you not near a huge University library or your own private copy, p517 says: "A plausible answer to the question "how small is small", is to edit in this fashion all singular values whose ratio to the largest singular value is less then N times the machine precision \epsilon. (You might argue for root N, or a constant, instead of N as the multiple; that starts getting into hardware-dependent questions). Earlier (p510) we see the (General Linear Least Squares) problem being set up as A = (N x M) where N >= M. The 2007 edition replaces the "(You might argue... )" text with: (p 795) "(This is a more conservative recommendation than the default in section 2.6 which scales as N^{1/2})" and this in turn refers to the threshold: tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) (p67) - which is justified as being (p71) " ... a default value based on expected roundoff error". I could not at first glance see any other justification for this threshold in the text. So, how about something like: def matrix_rank(M, tol='maxdim'): ... tol: {'maxdim', 'nr-roundoff'} or float If str, gives threshold strategy for tolerance relative to the maximum singular value, explained below. If float gives absolute tolerance below which singular values assumed zero. For the threshold strategies, we will call the maximum singular value``S.max()`` and the floating point epsilon for the working precision data type ``eps``. Default strategy is 'maxdim' corresponding to ``tol = S.max() * eps * max(M.shape)``. This is the MATLAB default; see also Numerical Recipes 2007. Other options are 'nr-roundoff' (also from Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 * np.sqrt(M.shape[0] + M.shape[1] + 1)``. ? Best, Matthew

Hi, On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote: > > Hi, > > I noticed that numpy.linalg.matrix_rank sometimes gives full rank for > matrices that are numerically rank deficient: > > If I repeatedly make random matrices, then set the first column to be > equal to the sum of the second and third columns: > > def make_deficient(): > X = np.random.normal(size=(40, 10)) > deficient_X = X.copy() > deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > return deficient_X > > then the current numpy.linalg.matrix_rank algorithm returns full rank > (10) in about 8 percent of cases (see appended script). > > I think this is a tolerance problem. The ``matrix_rank`` algorithm > does this by default: > > S = spl.svd(M, compute_uv=False) > tol = S.max() * np.finfo(S.dtype).eps > return np.sum(S > tol) > > I guess we'd we want the lowest tolerance that nearly always or > always > identifies numerically rank deficient matrices. I suppose one way of > looking at whether the tolerance is in the right range is to compare > the calculated tolerance (``tol``) to the minimum singular value > (``S.min()``) because S.min() in our case should be very small and > indicate the rank deficiency. The mean value of tol / S.min() for the > current algorithm, across many iterations, is about 2.8. We might > hope this value would be higher than 1, but not much higher, > otherwise > we might be rejecting too many columns. > > Our current algorithm for tolerance is the same as the 2-norm of M * > eps. We're citing Golub and Van Loan for this, but now I look at our > copy (p 261, last para) - they seem to be suggesting using u * |M| > where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the > Golub > and Van Loan suggestion corresponds to: > > tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 > > This tolerance gives full rank for these rank-deficient matrices in > about 39 percent of cases (tol / S.min() ratio of 1.7) > > We see on p 56 (section 2.3.2) that: > > m, n = M.shape > 1 / sqrt(n) . |M|_{inf} <= |M|_2 > > So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). > Setting: > > tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) > > gives about 0.5 percent error (tol / S.min() of 4.4) > > Using the Mathworks threshold [2]: > > tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) > > There are no false negatives (0 percent rank 10), but tol / S.min() > is > around 110 - so conservative, in this case. > > So - summary - I'm worrying our current threshold is too small, > letting through many rank-deficient matrices without detection. I > may > have misread Golub and Van Loan, but maybe we aren't doing what they > suggest. Maybe what we could use is either the MATLAB threshold or > something like: > > tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) > > - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . > This > gives 0 percent misses and tol / S.min() of 8.7. > > What do y'all think? > > Best, > > Matthew > > [1] > > http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... > [2] http://www.mathworks.com/help/techdoc/ref/rank.html > > Output from script: > > Percent undetected current: 9.8, tol / S.min(): 2.762 > Percent undetected inf norm: 39.1, tol / S.min(): 1.667 > Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 > Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): > 8.734 > Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 > > <script> > import numpy as np > import scipy.linalg as npl > > M = 40 > N = 10 > > def make_deficient(): > X = np.random.normal(size=(M, N)) > deficient_X = X.copy() > if M > N: # Make a column deficient > deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > else: # Make a row deficient > deficient_X[0] = deficient_X[1] + deficient_X[2] > return deficient_X > > matrices = [] > ranks = [] > ranks_inf = [] > ranks_ub_inf = [] > ranks_ub2_inf = [] > ranks_mlab = [] > tols = np.zeros((1000, 6)) > for i in range(1000): > m = make_deficient() > matrices.append(m) > # The SVD tolerances > S = npl.svd(m, compute_uv=False) > S0 = S.max() > # u in Golub and Van Loan == numpy eps / 2 > eps = np.finfo(m.dtype).eps > u = eps / 2 > # Current numpy matrix_rank algorithm > ranks.append(np.linalg.matrix_rank(m)) > # Which is the same as: > tol_s0 = S0 * eps > # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) > # Golub and Van Loan suggestion > tol_inf = npl.norm(m, np.inf) * u > ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) > # Upper bound of |X|_{inf} > tol_ub_inf = tol_s0 * np.sqrt(N) / 2 > ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) > # Times 2 fudge > tol_ub2_inf = tol_s0 * np.sqrt(N) > ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) > # MATLAB algorithm > tol_mlab = tol_s0 * max(m.shape) > ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) > # Collect tols > tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, > S.min() > > rel_tols = tols / tols[:, -1][:, None] > > fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' > max_rank = min(M, N) > for name, ranks, mrt in zip( > ('current', 'inf norm', 'upper bound inf norm', > 'upper bound inf norm * 2', 'MATLAB'), > (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), > rel_tols.mean(axis=0)[:5]): > pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 > print fmt % (name, pcnt, mrt) > </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to prefer:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
That's interesting, as something like that with a square root was my first choice for the least squares, but then someone mentioned the NR choice. That was all on the mailing list way several years back when I was fixing up the polynomial fitting routine. The NR reference is on page 517 of the 1986 edition (FORTRAN), which might be hard to come by these days ;)
Thanks for tracking that down, it's very helpful.
For those of you not near a huge University library or your own private copy, p517 says:
"A plausible answer to the question "how small is small", is to edit in this fashion all singular values whose ratio to the largest singular value is less then N times the machine precision \epsilon. (You might argue for root N, or a constant, instead of N as the multiple; that starts getting into hardware-dependent questions).
Earlier (p510) we see the (General Linear Least Squares) problem being set up as A = (N x M) where N >= M.
The 2007 edition replaces the "(You might argue... )" text with: (p 795)
"(This is a more conservative recommendation than the default in section 2.6 which scales as N^{1/2})"
and this in turn refers to the threshold:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
(p67) - which is justified as being (p71) " ... a default value based on expected roundoff error".
I could not at first glance see any other justification for this threshold in the text.
So, how about something like:
def matrix_rank(M, tol='maxdim'): ...
tol: {'maxdim', 'nr-roundoff'} or float If str, gives threshold strategy for tolerance relative to the maximum singular value, explained below. If float gives absolute tolerance below which singular values assumed zero. For the threshold strategies, we will call the maximum singular value``S.max()`` and the floating point epsilon for the working precision data type ``eps``. Default strategy is 'maxdim' corresponding to ``tol = S.max() * eps * max(M.shape)``. This is the MATLAB default; see also Numerical Recipes 2007. Other options are 'nr-roundoff' (also from Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 * np.sqrt(M.shape[0] + M.shape[1] + 1)``.
?
Thinking about this more, I'm tempted to just go for the matlab / old NR solution, that is: tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) and not offer any other options. My reasoning is that the matrix_rank code is basically trivial; it is a convenience function. If someone wants a better matrix rank it's reasonable to copy these few lines into a new function. So providing multiple options for tolerance seems like overkill. Then the question is whether to use NR-2007: tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.) or the matlab default above. I'm leaning to the matlab version because we can be more or less sure that it does not miss actual numerical rank deficiency and it might be what people are expecting, if they shift from matlab or compare to matlab. The NR-2007 might well be closer to an accurate threshold for numerical rank deficiency, but I haven't tested it with all variants and all platforms, and the reduced false positives seems a minor gain compared to the slight increase in risk of false negatives and difference from matlab. What do y'all think, Matthew

On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com
wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <
matthew.brett@gmail.com>
wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett > <matthew.brett@gmail.com> > wrote: >> >> Hi, >> >> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for >> matrices that are numerically rank deficient: >> >> If I repeatedly make random matrices, then set the first column to be >> equal to the sum of the second and third columns: >> >> def make_deficient(): >> X = np.random.normal(size=(40, 10)) >> deficient_X = X.copy() >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >> return deficient_X >> >> then the current numpy.linalg.matrix_rank algorithm returns full rank >> (10) in about 8 percent of cases (see appended script). >> >> I think this is a tolerance problem. The ``matrix_rank`` algorithm >> does this by default: >> >> S = spl.svd(M, compute_uv=False) >> tol = S.max() * np.finfo(S.dtype).eps >> return np.sum(S > tol) >> >> I guess we'd we want the lowest tolerance that nearly always or >> always >> identifies numerically rank deficient matrices. I suppose one way of >> looking at whether the tolerance is in the right range is to compare >> the calculated tolerance (``tol``) to the minimum singular value >> (``S.min()``) because S.min() in our case should be very small and >> indicate the rank deficiency. The mean value of tol / S.min() for
>> current algorithm, across many iterations, is about 2.8. We might >> hope this value would be higher than 1, but not much higher, >> otherwise >> we might be rejecting too many columns. >> >> Our current algorithm for tolerance is the same as the 2-norm of M * >> eps. We're citing Golub and Van Loan for this, but now I look at our >> copy (p 261, last para) - they seem to be suggesting using u * |M| >> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the >> Golub >> and Van Loan suggestion corresponds to: >> >> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2 >> >> This tolerance gives full rank for these rank-deficient matrices in >> about 39 percent of cases (tol / S.min() ratio of 1.7) >> >> We see on p 56 (section 2.3.2) that: >> >> m, n = M.shape >> 1 / sqrt(n) . |M|_{inf} <= |M|_2 >> >> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). >> Setting: >> >> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n) >> >> gives about 0.5 percent error (tol / S.min() of 4.4) >> >> Using the Mathworks threshold [2]: >> >> tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) >> >> There are no false negatives (0 percent rank 10), but tol / S.min() >> is >> around 110 - so conservative, in this case. >> >> So - summary - I'm worrying our current threshold is too small, >> letting through many rank-deficient matrices without detection. I >> may >> have misread Golub and Van Loan, but maybe we aren't doing what
>> suggest. Maybe what we could use is either the MATLAB threshold or >> something like: >> >> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n) >> >> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . >> This >> gives 0 percent misses and tol / S.min() of 8.7. >> >> What do y'all think? >> >> Best, >> >> Matthew >> >> [1] >> >> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... >> [2] http://www.mathworks.com/help/techdoc/ref/rank.html >> >> Output from script: >> >> Percent undetected current: 9.8, tol / S.min(): 2.762 >> Percent undetected inf norm: 39.1, tol / S.min(): 1.667 >> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 >> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): >> 8.734 >> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477 >> >> <script> >> import numpy as np >> import scipy.linalg as npl >> >> M = 40 >> N = 10 >> >> def make_deficient(): >> X = np.random.normal(size=(M, N)) >> deficient_X = X.copy() >> if M > N: # Make a column deficient >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >> else: # Make a row deficient >> deficient_X[0] = deficient_X[1] + deficient_X[2] >> return deficient_X >> >> matrices = [] >> ranks = [] >> ranks_inf = [] >> ranks_ub_inf = [] >> ranks_ub2_inf = [] >> ranks_mlab = [] >> tols = np.zeros((1000, 6)) >> for i in range(1000): >> m = make_deficient() >> matrices.append(m) >> # The SVD tolerances >> S = npl.svd(m, compute_uv=False) >> S0 = S.max() >> # u in Golub and Van Loan == numpy eps / 2 >> eps = np.finfo(m.dtype).eps >> u = eps / 2 >> # Current numpy matrix_rank algorithm >> ranks.append(np.linalg.matrix_rank(m)) >> # Which is the same as: >> tol_s0 = S0 * eps >> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) >> # Golub and Van Loan suggestion >> tol_inf = npl.norm(m, np.inf) * u >> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) >> # Upper bound of |X|_{inf} >> tol_ub_inf = tol_s0 * np.sqrt(N) / 2 >> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) >> # Times 2 fudge >> tol_ub2_inf = tol_s0 * np.sqrt(N) >> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) >> # MATLAB algorithm >> tol_mlab = tol_s0 * max(m.shape) >> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) >> # Collect tols >> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, >> S.min() >> >> rel_tols = tols / tols[:, -1][:, None] >> >> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' >> max_rank = min(M, N) >> for name, ranks, mrt in zip( >> ('current', 'inf norm', 'upper bound inf norm', >> 'upper bound inf norm * 2', 'MATLAB'), >> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), >> rel_tols.mean(axis=0)[:5]): >> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 >> print fmt % (name, pcnt, mrt) >> </script> > > > The polynomial fitting uses eps times the largest array dimension for > the > relative condition number. IIRC, that choice traces back to numerical > recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems
reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
As extra data, current Numerical Recipes (2007, p 67) appears to
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
That's interesting, as something like that with a square root was my first choice for the least squares, but then someone mentioned the NR choice. That was all on the mailing list way several years back when I was fixing up
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote: the they like a prefer: the
polynomial fitting routine. The NR reference is on page 517 of the 1986 edition (FORTRAN), which might be hard to come by these days ;)
Thanks for tracking that down, it's very helpful.
For those of you not near a huge University library or your own private copy, p517 says:
"A plausible answer to the question "how small is small", is to edit in this fashion all singular values whose ratio to the largest singular value is less then N times the machine precision \epsilon. (You might argue for root N, or a constant, instead of N as the multiple; that starts getting into hardware-dependent questions).
Earlier (p510) we see the (General Linear Least Squares) problem being set up as A = (N x M) where N >= M.
The 2007 edition replaces the "(You might argue... )" text with: (p 795)
"(This is a more conservative recommendation than the default in section 2.6 which scales as N^{1/2})"
and this in turn refers to the threshold:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
(p67) - which is justified as being (p71) " ... a default value based on expected roundoff error".
I could not at first glance see any other justification for this threshold in the text.
So, how about something like:
def matrix_rank(M, tol='maxdim'): ...
tol: {'maxdim', 'nr-roundoff'} or float If str, gives threshold strategy for tolerance relative to the maximum singular value, explained below. If float gives absolute tolerance below which singular values assumed zero. For the threshold strategies, we will call the maximum singular value``S.max()`` and the floating point epsilon for the working precision data type ``eps``. Default strategy is 'maxdim' corresponding to ``tol = S.max() * eps * max(M.shape)``. This is the MATLAB default; see also Numerical Recipes 2007. Other options are 'nr-roundoff' (also from Numerical Recipes 2007) corresponding to ``tol = S.max() * eps / 2 * np.sqrt(M.shape[0] + M.shape[1] + 1)``.
?
Thinking about this more, I'm tempted to just go for the matlab / old NR solution, that is:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
and not offer any other options. My reasoning is that the matrix_rank code is basically trivial; it is a convenience function. If someone wants a better matrix rank it's reasonable to copy these few lines into a new function. So providing multiple options for tolerance seems like overkill.
Then the question is whether to use NR-2007:
tol = S.max() * np.finfo(M.dtype).eps / 2. * np.sqrt(m + n + 1.)
or the matlab default above. I'm leaning to the matlab version because we can be more or less sure that it does not miss actual numerical rank deficiency and it might be what people are expecting, if they shift from matlab or compare to matlab. The NR-2007 might well be closer to an accurate threshold for numerical rank deficiency, but I haven't tested it with all variants and all platforms, and the reduced false positives seems a minor gain compared to the slight increase in risk of false negatives and difference from matlab.
What do y'all think,
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful. Chuck

On Tuesday, June 26, 2012, Charles R Harris wrote:
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com
wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <
matthew.brett@gmail.com>
wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett > <matthew.brett@gmail.com> > wrote: >> >> Hi, >> >> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for >> matrices that are numerically rank deficient: >> >> If I repeatedly make random matrices, then set the first column to be >> equal to the sum of the second and third columns: >> >> def make_deficient(): >> X = np.random.normal(size=(40, 10)) >> deficient_X = X.copy() >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >> return deficient_X >> >> then the current numpy.linalg.matrix_rank algorithm returns full rank >> (10) in about 8 percent of cases (see appended script). >> >> I think this is a tolerance problem. The ``matrix_rank`` algorithm >> does this by default: >> >> S = spl.svd(M, compute_uv=False) >> tol = S.max() * np.finfo(S.dtype).eps >> return np.sum(S > tol) >> >> I guess we'd we want the lowest tolerance that nearly always or >> always >> identifies numerically rank deficient matrices. I suppose one way of >> looking at whether the tolerance is in the right range is to compare >> the calculated tolerance (``tol``) to the minimum singular value >> (``S.min()``) because S.min() in our case should be very small and >> indicate the rank deficiency. The mean value of tol / S.min() for
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote: the
>> current algorithm, across many iterations, is about 2.8. We might >> hope this value would be higher than 1, but not much higher, >> otherwise >> we might be rejecting too many columns. >> >> Our current algorithm for tolerance is the same as the 2-norm of M * >> eps. We're citing Golub and Van Loan for this, but now I look at our >> copy (p 261, last para) - they seem to be suggesting using u * |M| >> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the >> Golub
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful.
Chuck
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice? Cheers! Ben Root

Hi, On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Tuesday, June 26, 2012, Charles R Harris wrote:
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote: > On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris > <charlesr.harris@gmail.com> wrote: >> >> >> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
>> <matthew.brett@gmail.com> >> wrote: >>> >>> Hi, >>> >>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank >>> for >>> matrices that are numerically rank deficient: >>> >>> If I repeatedly make random matrices, then set the first column >>> to be >>> equal to the sum of the second and third columns: >>> >>> def make_deficient(): >>> X = np.random.normal(size=(40, 10)) >>> deficient_X = X.copy() >>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>> return deficient_X >>> >>> then the current numpy.linalg.matrix_rank algorithm returns full >>> rank >>> (10) in about 8 percent of cases (see appended script). >>> >>> I think this is a tolerance problem. The ``matrix_rank`` >>> algorithm >>> does this by default: >>> >>> S = spl.svd(M, compute_uv=False) >>> tol = S.max() * np.finfo(S.dtype).eps >>> return np.sum(S > tol) >>> >>> I guess we'd we want the lowest tolerance that nearly always or >>> always >>> identifies numerically rank deficient matrices. I suppose one >>> way of >>> looking at whether the tolerance is in the right range is to >>> compare >>> the calculated tolerance (``tol``) to the minimum singular value >>> (``S.min()``) because S.min() in our case should be very small >>> and >>> indicate the rank deficiency. The mean value of tol / S.min() for >>> the >>> current algorithm, across many iterations, is about 2.8. We >>> might >>> hope this value would be higher than 1, but not much higher, >>> otherwise >>> we might be rejecting too many columns. >>> >>> Our current algorithm for tolerance is the same as the 2-norm of >>> M * >>> eps. We're citing Golub and Van Loan for this, but now I look at >>> our >>> copy (p 261, last para) - they seem to be suggesting using u * >>> |M| >>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the >>> Golub
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful.
Chuck
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
I guess that matlab is unlikely to change for the same reason that we would be reluctant to change, once we've found an acceptable value. I was thinking that we would say something like: """ The default tolerance is : tol = S.max() * np.finfo(M.dtype).eps * max((m, n)) This corresponds to the tolerance suggested in NR page X, and to the tolerance used by MATLAB at the time of writing (June 2012; see http://www.mathworks.com/help/techdoc/ref/rank.html). """ I don't know whether we would want to track changes made by matlab - maybe we could have that discussion if they do change? Best, Matthew

On Tue, Jun 26, 2012 at 5:46 PM, Matthew Brett <matthew.brett@gmail.com>wrote:
Hi,
On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Tuesday, June 26, 2012, Charles R Harris wrote:
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com
wrote:
Hi,
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com
wrote:
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote: > Hi, > > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> > wrote: >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris >> <charlesr.harris@gmail.com> wrote: >>> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
>>> <matthew.brett@gmail.com> >>> wrote: >>>> >>>> Hi, >>>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank >>>> for >>>> matrices that are numerically rank deficient: >>>> >>>> If I repeatedly make random matrices, then set the first column >>>> to be >>>> equal to the sum of the second and third columns: >>>> >>>> def make_deficient(): >>>> X = np.random.normal(size=(40, 10)) >>>> deficient_X = X.copy() >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >>>> return deficient_X >>>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full >>>> rank >>>> (10) in about 8 percent of cases (see appended script). >>>> >>>> I think this is a tolerance problem. The ``matrix_rank`` >>>> algorithm >>>> does this by default: >>>> >>>> S = spl.svd(M, compute_uv=False) >>>> tol = S.max() * np.finfo(S.dtype).eps >>>> return np.sum(S > tol) >>>> >>>> I guess we'd we want the lowest tolerance that nearly always or >>>> always >>>> identifies numerically rank deficient matrices. I suppose one >>>> way of >>>> looking at whether the tolerance is in the right range is to >>>> compare >>>> the calculated tolerance (``tol``) to the minimum singular value >>>> (``S.min()``) because S.min() in our case should be very small >>>> and >>>> indicate the rank deficiency. The mean value of tol / S.min() for >>>> the >>>> current algorithm, across many iterations, is about 2.8. We >>>> might >>>> hope this value would be higher than 1, but not much higher, >>>> otherwise >>>> we might be rejecting too many columns. >>>> >>>> Our current algorithm for tolerance is the same as the 2-norm of >>>> M * >>>> eps. We're citing Golub and Van Loan for this, but now I look at >>>> our >>>> copy (p 261, last para) - they seem to be suggesting using u * >>>> |M| >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the >>>> Golub
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful.
Chuck
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
I guess that matlab is unlikely to change for the same reason that we would be reluctant to change, once we've found an acceptable value.
I was thinking that we would say something like:
""" The default tolerance is :
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
This corresponds to the tolerance suggested in NR page X, and to the tolerance used by MATLAB at the time of writing (June 2012; see http://www.mathworks.com/help/techdoc/ref/rank.html). """
I don't know whether we would want to track changes made by matlab - maybe we could have that discussion if they do change?
I wouldn't bother tracking Matlab, but I think the alternative threshold could be mentioned in the notes. Something like A less conservative threshold is ... Maybe mention that because of numerical uncertainty there will always be a chance that the computed rank could be wrong, but that with the conservative threshold the rank is very unlikely to be less than the computed rank. Chuck

Hi, On Tue, Jun 26, 2012 at 5:04 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Jun 26, 2012 at 5:46 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Tuesday, June 26, 2012, Charles R Harris wrote:
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett <matthew.brett@gmail.com> wrote: > > Hi, > > On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett > <matthew.brett@gmail.com> > wrote: > > Hi, > > > > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> > > wrote: > >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris > >> <charlesr.harris@gmail.com> wrote: > >>> > >>> > >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
> >>> <matthew.brett@gmail.com> > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full > >>>> rank > >>>> for > >>>> matrices that are numerically rank deficient: > >>>> > >>>> If I repeatedly make random matrices, then set the first > >>>> column > >>>> to be > >>>> equal to the sum of the second and third columns: > >>>> > >>>> def make_deficient(): > >>>> X = np.random.normal(size=(40, 10)) > >>>> deficient_X = X.copy() > >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] > >>>> return deficient_X > >>>> > >>>> then the current numpy.linalg.matrix_rank algorithm returns > >>>> full > >>>> rank > >>>> (10) in about 8 percent of cases (see appended script). > >>>> > >>>> I think this is a tolerance problem. The ``matrix_rank`` > >>>> algorithm > >>>> does this by default: > >>>> > >>>> S = spl.svd(M, compute_uv=False) > >>>> tol = S.max() * np.finfo(S.dtype).eps > >>>> return np.sum(S > tol) > >>>> > >>>> I guess we'd we want the lowest tolerance that nearly always > >>>> or > >>>> always > >>>> identifies numerically rank deficient matrices. I suppose one > >>>> way of > >>>> looking at whether the tolerance is in the right range is to > >>>> compare > >>>> the calculated tolerance (``tol``) to the minimum singular > >>>> value > >>>> (``S.min()``) because S.min() in our case should be very small > >>>> and > >>>> indicate the rank deficiency. The mean value of tol / S.min() > >>>> for > >>>> the > >>>> current algorithm, across many iterations, is about 2.8. We > >>>> might > >>>> hope this value would be higher than 1, but not much higher, > >>>> otherwise > >>>> we might be rejecting too many columns. > >>>> > >>>> Our current algorithm for tolerance is the same as the 2-norm > >>>> of > >>>> M * > >>>> eps. We're citing Golub and Van Loan for this, but now I look > >>>> at > >>>> our > >>>> copy (p 261, last para) - they seem to be suggesting using u * > >>>> |M| > >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think > >>>> the > >>>> Golub
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful.
Chuck
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
I guess that matlab is unlikely to change for the same reason that we would be reluctant to change, once we've found an acceptable value.
I was thinking that we would say something like:
""" The default tolerance is :
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
This corresponds to the tolerance suggested in NR page X, and to the tolerance used by MATLAB at the time of writing (June 2012; see http://www.mathworks.com/help/techdoc/ref/rank.html). """
I don't know whether we would want to track changes made by matlab - maybe we could have that discussion if they do change?
I wouldn't bother tracking Matlab, but I think the alternative threshold could be mentioned in the notes. Something like
A less conservative threshold is ...
Maybe mention that because of numerical uncertainty there will always be a chance that the computed rank could be wrong, but that with the conservative threshold the rank is very unlikely to be less than the computed rank.
Sounds good to me. Would anyone object to a pull request with these changes (matlab tolerance default, description in docstring)? Cheers, Matthew

Hi, On Tue, Jun 26, 2012 at 7:29 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Tue, Jun 26, 2012 at 5:04 PM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Tue, Jun 26, 2012 at 5:46 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root@ou.edu> wrote:
On Tuesday, June 26, 2012, Charles R Harris wrote:
On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris <charlesr.harris@gmail.com> wrote: > > > On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett > <matthew.brett@gmail.com> > wrote: >> >> Hi, >> >> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett >> <matthew.brett@gmail.com> >> wrote: >> > Hi, >> > >> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> >> > wrote: >> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris >> >> <charlesr.harris@gmail.com> wrote: >> >>> >> >>> >> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
>> >>> <matthew.brett@gmail.com> >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full >> >>>> rank >> >>>> for >> >>>> matrices that are numerically rank deficient: >> >>>> >> >>>> If I repeatedly make random matrices, then set the first >> >>>> column >> >>>> to be >> >>>> equal to the sum of the second and third columns: >> >>>> >> >>>> def make_deficient(): >> >>>> X = np.random.normal(size=(40, 10)) >> >>>> deficient_X = X.copy() >> >>>> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] >> >>>> return deficient_X >> >>>> >> >>>> then the current numpy.linalg.matrix_rank algorithm returns >> >>>> full >> >>>> rank >> >>>> (10) in about 8 percent of cases (see appended script). >> >>>> >> >>>> I think this is a tolerance problem. The ``matrix_rank`` >> >>>> algorithm >> >>>> does this by default: >> >>>> >> >>>> S = spl.svd(M, compute_uv=False) >> >>>> tol = S.max() * np.finfo(S.dtype).eps >> >>>> return np.sum(S > tol) >> >>>> >> >>>> I guess we'd we want the lowest tolerance that nearly always >> >>>> or >> >>>> always >> >>>> identifies numerically rank deficient matrices. I suppose one >> >>>> way of >> >>>> looking at whether the tolerance is in the right range is to >> >>>> compare >> >>>> the calculated tolerance (``tol``) to the minimum singular >> >>>> value >> >>>> (``S.min()``) because S.min() in our case should be very small >> >>>> and >> >>>> indicate the rank deficiency. The mean value of tol / S.min() >> >>>> for >> >>>> the >> >>>> current algorithm, across many iterations, is about 2.8. We >> >>>> might >> >>>> hope this value would be higher than 1, but not much higher, >> >>>> otherwise >> >>>> we might be rejecting too many columns. >> >>>> >> >>>> Our current algorithm for tolerance is the same as the 2-norm >> >>>> of >> >>>> M * >> >>>> eps. We're citing Golub and Van Loan for this, but now I look >> >>>> at >> >>>> our >> >>>> copy (p 261, last para) - they seem to be suggesting using u * >> >>>> |M| >> >>>> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think >> >>>> the >> >>>> Golub
I'm fine with that, and agree that it is likely to lead to fewer folks wondering why Matlab and numpy are different. A good explanation in the function documentation would be useful.
Chuck
One potential problem is that it implies that it will always be the same as any version of matlab's tolerance. What if they change it in a future release? How likely are we to even notice?
I guess that matlab is unlikely to change for the same reason that we would be reluctant to change, once we've found an acceptable value.
I was thinking that we would say something like:
""" The default tolerance is :
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
This corresponds to the tolerance suggested in NR page X, and to the tolerance used by MATLAB at the time of writing (June 2012; see http://www.mathworks.com/help/techdoc/ref/rank.html). """
I don't know whether we would want to track changes made by matlab - maybe we could have that discussion if they do change?
I wouldn't bother tracking Matlab, but I think the alternative threshold could be mentioned in the notes. Something like
A less conservative threshold is ...
Maybe mention that because of numerical uncertainty there will always be a chance that the computed rank could be wrong, but that with the conservative threshold the rank is very unlikely to be less than the computed rank.
Sounds good to me. Would anyone object to a pull request with these changes (matlab tolerance default, description in docstring)?
Pull request here: https://github.com/numpy/numpy/pull/357 Cheers, Matthew

On Sat, Jun 16, 2012 at 9:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
If so, how should this be managed? Just changing it may change the output of code using numpy >= 1.5.0, but then again, the threshold is probably incorrect.
Fix and break or FutureWarning with something like:
def matrix_rank(M, tol=None):
where ``tol`` can be a string like ``maxdim``?
I dunno, I don't think we should do a big deprecation dance for every bug fix. Is this a bug fix, so numpy will simply start producing more accurate results on a given problem? I guess there isn't really a right answer here (though claiming that [a, b, a+b] is full-rank is clearly broken, and the matlab algorithm seems reasonable for answering the specific question of whether a matrix is full rank), so we'll have to hope some users speak up... -N

On Sat, Jun 16, 2012 at 4:39 PM, Nathaniel Smith <njs@pobox.com> wrote:
On Sat, Jun 16, 2012 at 9:03 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs@pobox.com> wrote:
On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris <charlesr.harris@gmail.com> wrote:
On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett@gmail.com> wrote:
Hi,
I noticed that numpy.linalg.matrix_rank sometimes gives full rank for matrices that are numerically rank deficient:
If I repeatedly make random matrices, then set the first column to be equal to the sum of the second and third columns:
def make_deficient(): X = np.random.normal(size=(40, 10)) deficient_X = X.copy() deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] return deficient_X
then the current numpy.linalg.matrix_rank algorithm returns full rank (10) in about 8 percent of cases (see appended script).
I think this is a tolerance problem. The ``matrix_rank`` algorithm does this by default:
S = spl.svd(M, compute_uv=False) tol = S.max() * np.finfo(S.dtype).eps return np.sum(S > tol)
I guess we'd we want the lowest tolerance that nearly always or always identifies numerically rank deficient matrices. I suppose one way of looking at whether the tolerance is in the right range is to compare the calculated tolerance (``tol``) to the minimum singular value (``S.min()``) because S.min() in our case should be very small and indicate the rank deficiency. The mean value of tol / S.min() for the current algorithm, across many iterations, is about 2.8. We might hope this value would be higher than 1, but not much higher, otherwise we might be rejecting too many columns.
Our current algorithm for tolerance is the same as the 2-norm of M * eps. We're citing Golub and Van Loan for this, but now I look at our copy (p 261, last para) - they seem to be suggesting using u * |M| where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub and Van Loan suggestion corresponds to:
tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
This tolerance gives full rank for these rank-deficient matrices in about 39 percent of cases (tol / S.min() ratio of 1.7)
We see on p 56 (section 2.3.2) that:
m, n = M.shape 1 / sqrt(n) . |M|_{inf} <= |M|_2
So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n). Setting:
tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
gives about 0.5 percent error (tol / S.min() of 4.4)
Using the Mathworks threshold [2]:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
There are no false negatives (0 percent rank 10), but tol / S.min() is around 110 - so conservative, in this case.
So - summary - I'm worrying our current threshold is too small, letting through many rank-deficient matrices without detection. I may have misread Golub and Van Loan, but maybe we aren't doing what they suggest. Maybe what we could use is either the MATLAB threshold or something like:
tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
- so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This gives 0 percent misses and tol / S.min() of 8.7.
What do y'all think?
Best,
Matthew
[1] http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilo... [2] http://www.mathworks.com/help/techdoc/ref/rank.html
Output from script:
Percent undetected current: 9.8, tol / S.min(): 2.762 Percent undetected inf norm: 39.1, tol / S.min(): 1.667 Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367 Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734 Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
<script> import numpy as np import scipy.linalg as npl
M = 40 N = 10
def make_deficient(): X = np.random.normal(size=(M, N)) deficient_X = X.copy() if M > N: # Make a column deficient deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2] else: # Make a row deficient deficient_X[0] = deficient_X[1] + deficient_X[2] return deficient_X
matrices = [] ranks = [] ranks_inf = [] ranks_ub_inf = [] ranks_ub2_inf = [] ranks_mlab = [] tols = np.zeros((1000, 6)) for i in range(1000): m = make_deficient() matrices.append(m) # The SVD tolerances S = npl.svd(m, compute_uv=False) S0 = S.max() # u in Golub and Van Loan == numpy eps / 2 eps = np.finfo(m.dtype).eps u = eps / 2 # Current numpy matrix_rank algorithm ranks.append(np.linalg.matrix_rank(m)) # Which is the same as: tol_s0 = S0 * eps # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0)) # Golub and Van Loan suggestion tol_inf = npl.norm(m, np.inf) * u ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf)) # Upper bound of |X|_{inf} tol_ub_inf = tol_s0 * np.sqrt(N) / 2 ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf)) # Times 2 fudge tol_ub2_inf = tol_s0 * np.sqrt(N) ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf)) # MATLAB algorithm tol_mlab = tol_s0 * max(m.shape) ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab)) # Collect tols tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
rel_tols = tols / tols[:, -1][:, None]
fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f' max_rank = min(M, N) for name, ranks, mrt in zip( ('current', 'inf norm', 'upper bound inf norm', 'upper bound inf norm * 2', 'MATLAB'), (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab), rel_tols.mean(axis=0)[:5]): pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100 print fmt % (name, pcnt, mrt) </script>
The polynomial fitting uses eps times the largest array dimension for the relative condition number. IIRC, that choice traces back to numerical recipes.
Chuck - sorry - I didn't understand what you were saying, and now I think you were proposing the MATLAB algorithm. I can't find that in Numerical Recipes - can you? It would be helpful as a reference.
This is the same as Matlab, right?
Yes, I believe so, i.e:
tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
from my original email.
If the Matlab condition is the most conservative, then it seems like a reasonable choice -- conservative is good so long as your false positive rate doesn't become to high, and presumably Matlab has enough user experience to know whether the false positive rate is too high.
Are we agreeing to go for the Matlab algorithm?
If so, how should this be managed? Just changing it may change the output of code using numpy >= 1.5.0, but then again, the threshold is probably incorrect.
Fix and break or FutureWarning with something like:
def matrix_rank(M, tol=None):
where ``tol`` can be a string like ``maxdim``?
I dunno, I don't think we should do a big deprecation dance for every bug fix. Is this a bug fix, so numpy will simply start producing more accurate results on a given problem? I guess there isn't really a right answer here (though claiming that [a, b, a+b] is full-rank is clearly broken, and the matlab algorithm seems reasonable for answering the specific question of whether a matrix is full rank), so we'll have to hope some users speak up...
I don't see a problem changing this as a bugfix. statsmodels still has, I think, the original scipy.stats.models version for rank which is still much higher for any non-huge array and float, cond=1.0e-12. Josef
-N _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (5)
-
Benjamin Root
-
Charles R Harris
-
josef.pktd@gmail.com
-
Matthew Brett
-
Nathaniel Smith