[Numpy-discussion] Matrix rank default tolerance - is it too low?
Charles R Harris
charlesr.harris at gmail.com
Fri Jun 15 22:51:13 EDT 2012
On Fri, Jun 15, 2012 at 6:39 PM, Matthew Brett <matthew.brett at gmail.com>wrote:
> Hi,
>
> On Thu, Jun 14, 2012 at 8:10 PM, Charles R Harris
> <charlesr.harris at gmail.com> wrote:
> >
> >
> > On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett at gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
> >> matrices that are numerically rank deficient:
> >>
> >> If I repeatedly make random matrices, then set the first column to be
> >> equal to the sum of the second and third columns:
> >>
> >> def make_deficient():
> >> X = np.random.normal(size=(40, 10))
> >> deficient_X = X.copy()
> >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >> return deficient_X
> >>
> >> then the current numpy.linalg.matrix_rank algorithm returns full rank
> >> (10) in about 8 percent of cases (see appended script).
> >>
> >> I think this is a tolerance problem. The ``matrix_rank`` algorithm
> >> does this by default:
> >>
> >> S = spl.svd(M, compute_uv=False)
> >> tol = S.max() * np.finfo(S.dtype).eps
> >> return np.sum(S > tol)
> >>
> >> I guess we'd we want the lowest tolerance that nearly always or always
> >> identifies numerically rank deficient matrices. I suppose one way of
> >> looking at whether the tolerance is in the right range is to compare
> >> the calculated tolerance (``tol``) to the minimum singular value
> >> (``S.min()``) because S.min() in our case should be very small and
> >> indicate the rank deficiency. The mean value of tol / S.min() for the
> >> current algorithm, across many iterations, is about 2.8. We might
> >> hope this value would be higher than 1, but not much higher, otherwise
> >> we might be rejecting too many columns.
> >>
> >> Our current algorithm for tolerance is the same as the 2-norm of M *
> >> eps. We're citing Golub and Van Loan for this, but now I look at our
> >> copy (p 261, last para) - they seem to be suggesting using u * |M|
> >> where u = (p 61, section 2.4.2) eps / 2. (see [1]). I think the Golub
> >> and Van Loan suggestion corresponds to:
> >>
> >> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
> >>
> >> This tolerance gives full rank for these rank-deficient matrices in
> >> about 39 percent of cases (tol / S.min() ratio of 1.7)
> >>
> >> We see on p 56 (section 2.3.2) that:
> >>
> >> m, n = M.shape
> >> 1 / sqrt(n) . |M|_{inf} <= |M|_2
> >>
> >> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).
> Setting:
> >>
> >> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
> >>
> >> gives about 0.5 percent error (tol / S.min() of 4.4)
> >>
> >> Using the Mathworks threshold [2]:
> >>
> >> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
> >>
> >> There are no false negatives (0 percent rank 10), but tol / S.min() is
> >> around 110 - so conservative, in this case.
> >>
> >> So - summary - I'm worrying our current threshold is too small,
> >> letting through many rank-deficient matrices without detection. I may
> >> have misread Golub and Van Loan, but maybe we aren't doing what they
> >> suggest. Maybe what we could use is either the MATLAB threshold or
> >> something like:
> >>
> >> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
> >>
> >> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
> >> gives 0 percent misses and tol / S.min() of 8.7.
> >>
> >> What do y'all think?
> >>
> >> Best,
> >>
> >> Matthew
> >>
> >> [1]
> >>
> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
> >> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
> >>
> >> Output from script:
> >>
> >> Percent undetected current: 9.8, tol / S.min(): 2.762
> >> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
> >> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
> >> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
> >> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
> >>
> >> <script>
> >> import numpy as np
> >> import scipy.linalg as npl
> >>
> >> M = 40
> >> N = 10
> >>
> >> def make_deficient():
> >> X = np.random.normal(size=(M, N))
> >> deficient_X = X.copy()
> >> if M > N: # Make a column deficient
> >> deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
> >> else: # Make a row deficient
> >> deficient_X[0] = deficient_X[1] + deficient_X[2]
> >> return deficient_X
> >>
> >> matrices = []
> >> ranks = []
> >> ranks_inf = []
> >> ranks_ub_inf = []
> >> ranks_ub2_inf = []
> >> ranks_mlab = []
> >> tols = np.zeros((1000, 6))
> >> for i in range(1000):
> >> m = make_deficient()
> >> matrices.append(m)
> >> # The SVD tolerances
> >> S = npl.svd(m, compute_uv=False)
> >> S0 = S.max()
> >> # u in Golub and Van Loan == numpy eps / 2
> >> eps = np.finfo(m.dtype).eps
> >> u = eps / 2
> >> # Current numpy matrix_rank algorithm
> >> ranks.append(np.linalg.matrix_rank(m))
> >> # Which is the same as:
> >> tol_s0 = S0 * eps
> >> # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
> >> # Golub and Van Loan suggestion
> >> tol_inf = npl.norm(m, np.inf) * u
> >> ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
> >> # Upper bound of |X|_{inf}
> >> tol_ub_inf = tol_s0 * np.sqrt(N) / 2
> >> ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
> >> # Times 2 fudge
> >> tol_ub2_inf = tol_s0 * np.sqrt(N)
> >> ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
> >> # MATLAB algorithm
> >> tol_mlab = tol_s0 * max(m.shape)
> >> ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
> >> # Collect tols
> >> tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
> >>
> >> rel_tols = tols / tols[:, -1][:, None]
> >>
> >> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
> >> max_rank = min(M, N)
> >> for name, ranks, mrt in zip(
> >> ('current', 'inf norm', 'upper bound inf norm',
> >> 'upper bound inf norm * 2', 'MATLAB'),
> >> (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
> >> rel_tols.mean(axis=0)[:5]):
> >> pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
> >> print fmt % (name, pcnt, mrt)
> >> </script>
> >
> >
> > The polynomial fitting uses eps times the largest array dimension for the
> > relative condition number. IIRC, that choice traces back to numerical
> > recipes.
>
> The problem for that as a general metric would be that larger values
> in the data would tend to lead to larger numerical error, but this
> won't be reflected in the tolerance. For example, running my script
> with your metric finds 0 errors for the normal distribution var = 1,
> but 100 percent error for a variance of 1000.
>
>
It's a *relative* condition number. You need to multiply it times the
largest singular value to find the cutoff.
> Percent undetected current: 0.0, tol / S.min(): 3.545
> Percent undetected inf norm: 0.0, tol / S.min(): 2.139
> Percent undetected upper bound inf norm: 0.0, tol / S.min(): 5.605
> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 11.209
> Percent undetected MATLAB: 0.0, tol / S.min(): 141.785
> Percent undetected Chuck: 100.0, tol / S.min(): 0.013
>
> My worry is that I haven't sampled the space of all possible matrix
> sizes and scalings. First pass suggests that the values will be
> different on different plaforms (at least, between a PPC 32 bit and an
> Intel 64 bit). I think the tolerance is wrong at the moment, and it
> looks like the Golub and Van Loan suggestion will not work as written.
> The MATLAB algorithm is some kind of standard and has been battle
> tested. If we are going to change, it seems tempting to use that.
>
>
Chuck
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120615/471e172f/attachment.html>
More information about the NumPy-Discussion
mailing list