[Numpy-discussion] Fwd: Multi-distribution Linux wheels - please test

Matthew Brett matthew.brett at gmail.com
Tue Feb 9 14:52:35 EST 2016


On Mon, Feb 8, 2016 at 7:59 PM, Nathaniel Smith <njs at pobox.com> wrote:
> On Mon, Feb 8, 2016 at 6:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Mon, Feb 8, 2016 at 6:04 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>> On Mon, Feb 8, 2016 at 5:26 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> On Mon, Feb 8, 2016 at 4:37 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>> [...]
>>>>> I can't replicate the segfault with manylinux wheels and scipy.  On
>>>>> the other hand, I get a new test error for numpy from manylinux, scipy
>>>>> from manylinux, like this:
>>>>>
>>>>> $ python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>
>>>>> ======================================================================
>>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>>>> ----------------------------------------------------------------------
>>>>> Traceback (most recent call last):
>>>>>   File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line
>>>>> 197, in runTest
>>>>>     self.test(*self.arg)
>>>>>   File "/usr/local/lib/python2.7/dist-packages/scipy/linalg/tests/test_decomp.py",
>>>>> line 658, in eigenhproblem_general
>>>>>     assert_array_almost_equal(diag2_, ones(diag2_.shape[0]), DIGITS[dtype])
>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>> line 892, in assert_array_almost_equal
>>>>>     precision=decimal)
>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>> line 713, in assert_array_compare
>>>>>     raise AssertionError(msg)
>>>>> AssertionError:
>>>>> Arrays are not almost equal to 4 decimals
>>>>>
>>>>> (mismatch 100.0%)
>>>>>  x: array([ 0.,  0.,  0.], dtype=float32)
>>>>>  y: array([ 1.,  1.,  1.])
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> Ran 1507 tests in 14.928s
>>>>>
>>>>> FAILED (KNOWNFAIL=4, SKIP=1, failures=1)
>>>>>
>>>>> This is a very odd error, which we don't get when running over a numpy
>>>>> installed from source, linked to ATLAS, and doesn't happen when
>>>>> running the tests via:
>>>>>
>>>>> nosetests /usr/local/lib/python2.7/dist-packages/scipy/linalg
>>>>>
>>>>> So, something about the copy of numpy (linked to openblas) is
>>>>> affecting the results of scipy (also linked to openblas), and only
>>>>> with a particular environment / test order.
>>>>>
>>>>> If you'd like to try and see whether y'all can do a better job of
>>>>> debugging than me:
>>>>>
>>>>> # Run this script inside a docker container started with this incantation:
>>>>> # docker run -ti --rm ubuntu:12.04 /bin/bash
>>>>> apt-get update
>>>>> apt-get install -y python curl
>>>>> apt-get install libpython2.7  # this won't be necessary with next
>>>>> iteration of manylinux wheel builds
>>>>> curl -LO https://bootstrap.pypa.io/get-pip.py
>>>>> python get-pip.py
>>>>> pip install -f https://nipy.bic.berkeley.edu/manylinux numpy scipy nose
>>>>> python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>
>>>> I just tried this and on my laptop it completed without error.
>>>>
>>>> Best guess is that we're dealing with some memory corruption bug
>>>> inside openblas, so it's getting perturbed by things like exactly what
>>>> other calls to openblas have happened (which is different depending on
>>>> whether numpy is linked to openblas), and which core type openblas has
>>>> detected.
>>>>
>>>> On my laptop, which *doesn't* show the problem, running with
>>>> OPENBLAS_VERBOSE=2 says "Core: Haswell".
>>>>
>>>> Guess the next step is checking what core type the failing machines
>>>> use, and running valgrind... anyone have a good valgrind suppressions
>>>> file?
>>>
>>> My machine (which does give the failure) gives
>>>
>>> Core: Core2
>>>
>>> with OPENBLAS_VERBOSE=2
>>
>> Yep, that allows me to reproduce it:
>>
>> root at f7153f0cc841:/# OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=Core2 python
>> -c 'import scipy.linalg; scipy.linalg.test()'
>> Core: Core2
>> [...]
>> ======================================================================
>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>> ----------------------------------------------------------------------
>> [...]
>>
>> So this is indeed sounding like an OpenBLAS issue... next stop
>> valgrind, I guess :-/
>
> Here's the valgrind output:
>   https://gist.github.com/njsmith/577d028e79f0a80d2797
>
> There's a lot of it, but no smoking guns have jumped out at me :-/

Could you send me instructions on replicating the valgrind run, I'll
run on on the actual Core2 machine...

Matthew



More information about the NumPy-Discussion mailing list