[Numpy-discussion] Matrix rank default tolerance - is it too low?

Fri Jun 15 20:39:54 EDT 2012

Hi,

On Thu, Jun 14, 2012 at 8:10 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
>
>
> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett <matthew.brett at gmail.com>
> wrote:
>>
>> Hi,
>>
>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank for
>> matrices that are numerically rank deficient:
>>
>> If I repeatedly make random matrices, then set the first column to be
>> equal to the sum of the second and third columns:
>>
>> def make_deficient():
>>    X = np.random.normal(size=(40, 10))
>>    deficient_X = X.copy()
>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>    return deficient_X
>>
>> then the current numpy.linalg.matrix_rank algorithm returns full rank
>> (10) in about 8 percent of cases (see appended script).
>>
>> I think this is a tolerance problem.  The ``matrix_rank`` algorithm
>> does this by default:
>>
>> S = spl.svd(M, compute_uv=False)
>> tol = S.max() * np.finfo(S.dtype).eps
>> return np.sum(S > tol)
>>
>> I guess we'd we want the lowest tolerance that nearly always or always
>> identifies numerically rank deficient matrices.  I suppose one way of
>> looking at whether the tolerance is in the right range is to compare
>> the calculated tolerance (``tol``) to the minimum singular value
>> (``S.min()``) because S.min() in our case should be very small and
>> indicate the rank deficiency. The mean value of tol / S.min() for the
>> current algorithm, across many iterations, is about 2.8.  We might
>> hope this value would be higher than 1, but not much higher, otherwise
>> we might be rejecting too many columns.
>>
>> Our current algorithm for tolerance is the same as the 2-norm of M *
>> eps.  We're citing Golub and Van Loan for this, but now I look at our
>> copy (p 261, last para) - they seem to be suggesting using u * |M|
>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the Golub
>> and Van Loan suggestion corresponds to:
>>
>> tol = np.linalg.norm(M, np.inf) * np.finfo(M.dtype).eps / 2
>>
>> This tolerance gives full rank for these rank-deficient matrices in
>> about 39 percent of cases (tol / S.min() ratio of 1.7)
>>
>> We see on p 56 (section 2.3.2) that:
>>
>> m, n = M.shape
>> 1 / sqrt(n) . |M|_{inf} <= |M|_2
>>
>> So we can get an upper bound on |M|_{inf} with |M|_2 * sqrt(n).  Setting:
>>
>> tol = S.max() * np.finfo(M.dtype).eps / 2 * np.sqrt(n)
>>
>> gives about 0.5 percent error (tol / S.min() of 4.4)
>>
>> Using the Mathworks threshold [2]:
>>
>> tol = S.max() * np.finfo(M.dtype).eps * max((m, n))
>>
>> There are no false negatives (0 percent rank 10), but tol / S.min() is
>> around 110 - so conservative, in this case.
>>
>> So - summary - I'm worrying our current threshold is too small,
>> letting through many rank-deficient matrices without detection.  I may
>> have misread Golub and Van Loan, but maybe we aren't doing what they
>> suggest.  Maybe what we could use is either the MATLAB threshold or
>> something like:
>>
>> tol = S.max() * np.finfo(M.dtype).eps * np.sqrt(n)
>>
>> - so 2 * the upper bound for the inf norm = 2 * |M|_2 * sqrt(n) . This
>> gives 0 percent misses and tol / S.min() of 8.7.
>>
>> What do y'all think?
>>
>> Best,
>>
>> Matthew
>>
>> [1]
>> http://matthew-brett.github.com/pydagogue/floating_error.html#machine-epsilon
>> [2] http://www.mathworks.com/help/techdoc/ref/rank.html
>>
>> Output from script:
>>
>> Percent undetected current: 9.8, tol / S.min(): 2.762
>> Percent undetected inf norm: 39.1, tol / S.min(): 1.667
>> Percent undetected upper bound inf norm: 0.5, tol / S.min(): 4.367
>> Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 8.734
>> Percent undetected MATLAB: 0.0, tol / S.min(): 110.477
>>
>> <script>
>> import numpy as np
>> import scipy.linalg as npl
>>
>> M = 40
>> N = 10
>>
>> def make_deficient():
>>    X = np.random.normal(size=(M, N))
>>    deficient_X = X.copy()
>>    if M > N: # Make a column deficient
>>        deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>>    else: # Make a row deficient
>>        deficient_X[0] = deficient_X[1] + deficient_X[2]
>>    return deficient_X
>>
>> matrices = []
>> ranks = []
>> ranks_inf = []
>> ranks_ub_inf = []
>> ranks_ub2_inf = []
>> ranks_mlab = []
>> tols = np.zeros((1000, 6))
>> for i in range(1000):
>>    m = make_deficient()
>>    matrices.append(m)
>>    # The SVD tolerances
>>    S = npl.svd(m, compute_uv=False)
>>    S0 = S.max()
>>    # u in Golub and Van Loan == numpy eps / 2
>>    eps = np.finfo(m.dtype).eps
>>    u = eps / 2
>>    # Current numpy matrix_rank algorithm
>>    ranks.append(np.linalg.matrix_rank(m))
>>    # Which is the same as:
>>    tol_s0 = S0 * eps
>>    # ranks.append(np.linalg.matrix_rank(m, tol=tol_s0))
>>    # Golub and Van Loan suggestion
>>    tol_inf = npl.norm(m, np.inf) * u
>>    ranks_inf.append(np.linalg.matrix_rank(m, tol=tol_inf))
>>    # Upper bound of |X|_{inf}
>>    tol_ub_inf = tol_s0 * np.sqrt(N) / 2
>>    ranks_ub_inf.append(np.linalg.matrix_rank(m, tol=tol_ub_inf))
>>    # Times 2 fudge
>>    tol_ub2_inf = tol_s0 * np.sqrt(N)
>>    ranks_ub2_inf.append(np.linalg.matrix_rank(m, tol=tol_ub2_inf))
>>    # MATLAB algorithm
>>    tol_mlab = tol_s0 * max(m.shape)
>>    ranks_mlab.append(np.linalg.matrix_rank(m, tol=tol_mlab))
>>    # Collect tols
>>    tols[i] = tol_s0, tol_inf, tol_ub_inf, tol_ub2_inf, tol_mlab, S.min()
>>
>> rel_tols = tols / tols[:, -1][:, None]
>>
>> fmt = 'Percent undetected %s: %3.1f, tol / S.min(): %2.3f'
>> max_rank = min(M, N)
>> for name, ranks, mrt in zip(
>>    ('current', 'inf norm', 'upper bound inf norm',
>>     'upper bound inf norm * 2', 'MATLAB'),
>>    (ranks, ranks_inf, ranks_ub_inf, ranks_ub2_inf, ranks_mlab),
>>    rel_tols.mean(axis=0)[:5]):
>>    pcnt = np.sum(np.array(ranks) == max_rank) / 1000. * 100
>>    print fmt % (name, pcnt, mrt)
>> </script>
>
>
> The polynomial fitting uses eps times the largest array dimension for the
> relative condition number. IIRC, that choice traces back to numerical
> recipes.

The problem for that as a general metric would be that larger values
in the data would tend to lead to larger numerical error, but this
won't be reflected in the tolerance.  For example, running my script
with your metric finds 0 errors for the normal distribution var = 1,
but 100 percent error for a variance of 1000.

Percent undetected current: 0.0, tol / S.min(): 3.545
Percent undetected inf norm: 0.0, tol / S.min(): 2.139
Percent undetected upper bound inf norm: 0.0, tol / S.min(): 5.605
Percent undetected upper bound inf norm * 2: 0.0, tol / S.min(): 11.209
Percent undetected MATLAB: 0.0, tol / S.min(): 141.785
Percent undetected Chuck: 100.0, tol / S.min(): 0.013

My worry is that I haven't sampled the space of all possible matrix
sizes and scalings.  First pass suggests that the values will be
different on different plaforms (at least, between a PPC 32 bit and an
Intel 64 bit).   I think the tolerance is wrong at the moment, and it
looks like the Golub and Van Loan suggestion will not work as written.
  The MATLAB algorithm is some kind of standard and has been battle
tested.  If we are going to change, it seems tempting to use that.

What do you think?,

Matthew