[Numpy-discussion] Matrix rank default tolerance - is it too low?

Tue Jun 26 19:46:20 EDT 2012

Hi,

On Tue, Jun 26, 2012 at 4:39 PM, Benjamin Root <ben.root at ou.edu> wrote:
>
>
> On Tuesday, June 26, 2012, Charles R Harris wrote:
>>
>>
>>
>> On Tue, Jun 26, 2012 at 3:42 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>>
>> Hi,
>>
>> On Mon, Jun 18, 2012 at 3:50 PM, Matthew Brett <matthew.brett at gmail.com>
>> wrote:
>> > Hi,
>> >
>> > On Sun, Jun 17, 2012 at 7:22 PM, Charles R Harris
>> > <charlesr.harris at gmail.com> wrote:
>> >>
>> >>
>> >> On Sat, Jun 16, 2012 at 2:33 PM, Matthew Brett
>> >> <matthew.brett at gmail.com>
>> >> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> On Sat, Jun 16, 2012 at 8:03 PM, Matthew Brett
>> >>> <matthew.brett at gmail.com>
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > On Sat, Jun 16, 2012 at 10:40 AM, Nathaniel Smith <njs at pobox.com>
>> >>> > wrote:
>> >>> >> On Fri, Jun 15, 2012 at 4:10 AM, Charles R Harris
>> >>> >> <charlesr.harris at gmail.com> wrote:
>> >>> >>>
>> >>> >>>
>> >>> >>> On Thu, Jun 14, 2012 at 8:06 PM, Matthew Brett
One potential problem is that it implies that it will always be the
same as any version of matlab's tolerance.  What if they change it in
a future release? How likely are we to even notice?

>> >>> >>> <matthew.brett at gmail.com>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Hi,
>> >>> >>>>
>> >>> >>>> I noticed that numpy.linalg.matrix_rank sometimes gives full rank
>> >>> >>>> for
>> >>> >>>> matrices that are numerically rank deficient:
>> >>> >>>>
>> >>> >>>> If I repeatedly make random matrices, then set the first column
>> >>> >>>> to be
>> >>> >>>> equal to the sum of the second and third columns:
>> >>> >>>>
>> >>> >>>> def make_deficient():
>> >>> >>>>    X = np.random.normal(size=(40, 10))
>> >>> >>>>    deficient_X = X.copy()
>> >>> >>>>    deficient_X[:, 0] = deficient_X[:, 1] + deficient_X[:, 2]
>> >>> >>>>    return deficient_X
>> >>> >>>>
>> >>> >>>> then the current numpy.linalg.matrix_rank algorithm returns full
>> >>> >>>> rank
>> >>> >>>> (10) in about 8 percent of cases (see appended script).
>> >>> >>>>
>> >>> >>>> I think this is a tolerance problem.  The ``matrix_rank``
>> >>> >>>> algorithm
>> >>> >>>> does this by default:
>> >>> >>>>
>> >>> >>>> S = spl.svd(M, compute_uv=False)
>> >>> >>>> tol = S.max() * np.finfo(S.dtype).eps
>> >>> >>>> return np.sum(S > tol)
>> >>> >>>>
>> >>> >>>> I guess we'd we want the lowest tolerance that nearly always or
>> >>> >>>> always
>> >>> >>>> identifies numerically rank deficient matrices.  I suppose one
>> >>> >>>> way of
>> >>> >>>> looking at whether the tolerance is in the right range is to
>> >>> >>>> compare
>> >>> >>>> the calculated tolerance (``tol``) to the minimum singular value
>> >>> >>>> (``S.min()``) because S.min() in our case should be very small
>> >>> >>>> and
>> >>> >>>> indicate the rank deficiency. The mean value of tol / S.min() for
>> >>> >>>> the
>> >>> >>>> current algorithm, across many iterations, is about 2.8.  We
>> >>> >>>> might
>> >>> >>>> hope this value would be higher than 1, but not much higher,
>> >>> >>>> otherwise
>> >>> >>>> we might be rejecting too many columns.
>> >>> >>>>
>> >>> >>>> Our current algorithm for tolerance is the same as the 2-norm of
>> >>> >>>> M *
>> >>> >>>> eps.  We're citing Golub and Van Loan for this, but now I look at
>> >>> >>>> our
>> >>> >>>> copy (p 261, last para) - they seem to be suggesting using u *
>> >>> >>>> |M|
>> >>> >>>> where u = (p 61, section 2.4.2) eps /  2. (see [1]). I think the
>> >>> >>>> Golub
>>
>>
>> I'm fine with that, and agree that it is likely to lead to fewer folks
>> wondering why Matlab and numpy are different. A good explanation in the
>> function documentation would be useful.
>>
>> Chuck
>>
>
> One potential problem is that it implies that it will always be the same as
> any version of matlab's tolerance.  What if they change it in a future
> release? How likely are we to even notice?

I guess that matlab is unlikely to change for the same reason that we
would be reluctant to change, once we've found an acceptable value.

I was thinking that we would say something like:

"""
The default tolerance is :

 tol = S.max() * np.finfo(M.dtype).eps * max((m, n))

This corresponds to the tolerance suggested in NR page X, and to the
tolerance used by MATLAB at the time of writing (June 2012; see
http://www.mathworks.com/help/techdoc/ref/rank.html).
"""

I don't know whether we would want to track changes made by matlab -
maybe we could have that discussion if they do change?

Best,

Matthew