[Numpy-discussion] Fwd: Multi-distribution Linux wheels - please test

Tue Feb 9 15:08:25 EST 2016

On 09.02.2016 21:01, Nathaniel Smith wrote:
> On Tue, Feb 9, 2016 at 11:37 AM, Julian Taylor
> <jtaylor.debian at googlemail.com> wrote:
>> On 09.02.2016 04:59, Nathaniel Smith wrote:
>>> On Mon, Feb 8, 2016 at 6:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> On Mon, Feb 8, 2016 at 6:04 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>>> On Mon, Feb 8, 2016 at 5:26 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>>>> On Mon, Feb 8, 2016 at 4:37 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>>>> [...]
>>>>>>> I can't replicate the segfault with manylinux wheels and scipy.  On
>>>>>>> the other hand, I get a new test error for numpy from manylinux, scipy
>>>>>>> from manylinux, like this:
>>>>>>>
>>>>>>> $ python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>>>
>>>>>>> ======================================================================
>>>>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>>>>>> ----------------------------------------------------------------------
>>>>>>> Traceback (most recent call last):
>>>>>>>   File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line
>>>>>>> 197, in runTest
>>>>>>>     self.test(*self.arg)
>>>>>>>   File "/usr/local/lib/python2.7/dist-packages/scipy/linalg/tests/test_decomp.py",
>>>>>>> line 658, in eigenhproblem_general
>>>>>>>     assert_array_almost_equal(diag2_, ones(diag2_.shape[0]), DIGITS[dtype])
>>>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>>>> line 892, in assert_array_almost_equal
>>>>>>>     precision=decimal)
>>>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>>>> line 713, in assert_array_compare
>>>>>>>     raise AssertionError(msg)
>>>>>>> AssertionError:
>>>>>>> Arrays are not almost equal to 4 decimals
>>>>>>>
>>>>>>> (mismatch 100.0%)
>>>>>>>  x: array([ 0.,  0.,  0.], dtype=float32)
>>>>>>>  y: array([ 1.,  1.,  1.])
>>>>>>>
>>>>>>> ----------------------------------------------------------------------
>>>>>>> Ran 1507 tests in 14.928s
>>>>>>>
>>>>>>> FAILED (KNOWNFAIL=4, SKIP=1, failures=1)
>>>>>>>
>>>>>>> This is a very odd error, which we don't get when running over a numpy
>>>>>>> installed from source, linked to ATLAS, and doesn't happen when
>>>>>>> running the tests via:
>>>>>>>
>>>>>>> nosetests /usr/local/lib/python2.7/dist-packages/scipy/linalg
>>>>>>>
>>>>>>> So, something about the copy of numpy (linked to openblas) is
>>>>>>> affecting the results of scipy (also linked to openblas), and only
>>>>>>> with a particular environment / test order.
>>>>>>>
>>>>>>> If you'd like to try and see whether y'all can do a better job of
>>>>>>> debugging than me:
>>>>>>>
>>>>>>> # Run this script inside a docker container started with this incantation:
>>>>>>> # docker run -ti --rm ubuntu:12.04 /bin/bash
>>>>>>> apt-get update
>>>>>>> apt-get install -y python curl
>>>>>>> apt-get install libpython2.7  # this won't be necessary with next
>>>>>>> iteration of manylinux wheel builds
>>>>>>> curl -LO https://bootstrap.pypa.io/get-pip.py
>>>>>>> python get-pip.py
>>>>>>> pip install -f https://nipy.bic.berkeley.edu/manylinux numpy scipy nose
>>>>>>> python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>>
>>>>>> I just tried this and on my laptop it completed without error.
>>>>>>
>>>>>> Best guess is that we're dealing with some memory corruption bug
>>>>>> inside openblas, so it's getting perturbed by things like exactly what
>>>>>> other calls to openblas have happened (which is different depending on
>>>>>> whether numpy is linked to openblas), and which core type openblas has
>>>>>> detected.
>>>>>>
>>>>>> On my laptop, which *doesn't* show the problem, running with
>>>>>> OPENBLAS_VERBOSE=2 says "Core: Haswell".
>>>>>>
>>>>>> Guess the next step is checking what core type the failing machines
>>>>>> use, and running valgrind... anyone have a good valgrind suppressions
>>>>>> file?
>>>>>
>>>>> My machine (which does give the failure) gives
>>>>>
>>>>> Core: Core2
>>>>>
>>>>> with OPENBLAS_VERBOSE=2
>>>>
>>>> Yep, that allows me to reproduce it:
>>>>
>>>> root at f7153f0cc841:/# OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=Core2 python
>>>> -c 'import scipy.linalg; scipy.linalg.test()'
>>>> Core: Core2
>>>> [...]
>>>> ======================================================================
>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>>> ----------------------------------------------------------------------
>>>> [...]
>>>>
>>>> So this is indeed sounding like an OpenBLAS issue... next stop
>>>> valgrind, I guess :-/
>>>
>>> Here's the valgrind output:
>>>   https://gist.github.com/njsmith/577d028e79f0a80d2797
>>>
>>> There's a lot of it, but no smoking guns have jumped out at me :-/
>>>
>>> -n
>>>
>>
>> plenty of smoking guns, e.g.:
>>
>> .............==3695== Invalid read of size 8
>> 3417    ==3695==    at 0x7AAA9C0: daxpy_k_CORE2 (in
>> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>> 3418    ==3695==    by 0x76BEEFC: ger_kernel (in
>> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>> 3419    ==3695==    by 0x788F618: exec_blas (in
>> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>> 3420    ==3695==    by 0x76BF099: dger_thread (in
>> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>> 3421    ==3695==    by 0x767DC37: dger_ (in
>> /usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
>>
>>
>> I think I have reported that to openblas already, they said do that
>> intentionally, though last I checked they are missing the code that
>> verifies this is actually allowed (if your not crossing a page you can
>> read beyond the boundaries). Its pretty likely its a pointless micro
>> optimization, you normally only use that trick for string functions
>> where you don't know the size of the string.
> 
> Yeah, I thought that was intentional, and we're not getting a segfault
> so I don't think they're hitting any page boundaries. It's possible
> they're screwing it up and somehow the random data they're reading can
> affect the results, and that's why we get the wrong answer sometimes,
> but that's just a wild guess.

with openblas everything is possible, especially this exact type of issue.
See e.g.:
https://github.com/xianyi/OpenBLAS/issues/171
here it loaded too much data, partly uninitialized, and if its filled
with nan it spreads into the actually used data.
That was a lot of fun to debug, and openblas is riddled with this stuff...

e.g. here my favourite comment in openblas (which is probably the source
of https://github.com/scipy/scipy/issues/5528):

51	  /* make it volatile because some function (ex: dgemv_n.S) */
      \
52	  /* do not restore all register */
      \
https://github.com/xianyi/OpenBLAS/blob/develop/common_stackalloc.h#L51