[Numpy-discussion] Fwd: Multi-distribution Linux wheels - please test

Tue Feb 9 14:37:26 EST 2016

On 09.02.2016 04:59, Nathaniel Smith wrote:
> On Mon, Feb 8, 2016 at 6:07 PM, Nathaniel Smith <njs at pobox.com> wrote:
>> On Mon, Feb 8, 2016 at 6:04 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>> On Mon, Feb 8, 2016 at 5:26 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>>> On Mon, Feb 8, 2016 at 4:37 PM, Matthew Brett <matthew.brett at gmail.com> wrote:
>>>> [...]
>>>>> I can't replicate the segfault with manylinux wheels and scipy.  On
>>>>> the other hand, I get a new test error for numpy from manylinux, scipy
>>>>> from manylinux, like this:
>>>>>
>>>>> $ python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>>
>>>>> ======================================================================
>>>>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>>>>> ----------------------------------------------------------------------
>>>>> Traceback (most recent call last):
>>>>>   File "/usr/local/lib/python2.7/dist-packages/nose/case.py", line
>>>>> 197, in runTest
>>>>>     self.test(*self.arg)
>>>>>   File "/usr/local/lib/python2.7/dist-packages/scipy/linalg/tests/test_decomp.py",
>>>>> line 658, in eigenhproblem_general
>>>>>     assert_array_almost_equal(diag2_, ones(diag2_.shape[0]), DIGITS[dtype])
>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>> line 892, in assert_array_almost_equal
>>>>>     precision=decimal)
>>>>>   File "/usr/local/lib/python2.7/dist-packages/numpy/testing/utils.py",
>>>>> line 713, in assert_array_compare
>>>>>     raise AssertionError(msg)
>>>>> AssertionError:
>>>>> Arrays are not almost equal to 4 decimals
>>>>>
>>>>> (mismatch 100.0%)
>>>>>  x: array([ 0.,  0.,  0.], dtype=float32)
>>>>>  y: array([ 1.,  1.,  1.])
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>> Ran 1507 tests in 14.928s
>>>>>
>>>>> FAILED (KNOWNFAIL=4, SKIP=1, failures=1)
>>>>>
>>>>> This is a very odd error, which we don't get when running over a numpy
>>>>> installed from source, linked to ATLAS, and doesn't happen when
>>>>> running the tests via:
>>>>>
>>>>> nosetests /usr/local/lib/python2.7/dist-packages/scipy/linalg
>>>>>
>>>>> So, something about the copy of numpy (linked to openblas) is
>>>>> affecting the results of scipy (also linked to openblas), and only
>>>>> with a particular environment / test order.
>>>>>
>>>>> If you'd like to try and see whether y'all can do a better job of
>>>>> debugging than me:
>>>>>
>>>>> # Run this script inside a docker container started with this incantation:
>>>>> # docker run -ti --rm ubuntu:12.04 /bin/bash
>>>>> apt-get update
>>>>> apt-get install -y python curl
>>>>> apt-get install libpython2.7  # this won't be necessary with next
>>>>> iteration of manylinux wheel builds
>>>>> curl -LO https://bootstrap.pypa.io/get-pip.py
>>>>> python get-pip.py
>>>>> pip install -f https://nipy.bic.berkeley.edu/manylinux numpy scipy nose
>>>>> python -c 'import scipy.linalg; scipy.linalg.test()'
>>>>
>>>> I just tried this and on my laptop it completed without error.
>>>>
>>>> Best guess is that we're dealing with some memory corruption bug
>>>> inside openblas, so it's getting perturbed by things like exactly what
>>>> other calls to openblas have happened (which is different depending on
>>>> whether numpy is linked to openblas), and which core type openblas has
>>>> detected.
>>>>
>>>> On my laptop, which *doesn't* show the problem, running with
>>>> OPENBLAS_VERBOSE=2 says "Core: Haswell".
>>>>
>>>> Guess the next step is checking what core type the failing machines
>>>> use, and running valgrind... anyone have a good valgrind suppressions
>>>> file?
>>>
>>> My machine (which does give the failure) gives
>>>
>>> Core: Core2
>>>
>>> with OPENBLAS_VERBOSE=2
>>
>> Yep, that allows me to reproduce it:
>>
>> root at f7153f0cc841:/# OPENBLAS_VERBOSE=2 OPENBLAS_CORETYPE=Core2 python
>> -c 'import scipy.linalg; scipy.linalg.test()'
>> Core: Core2
>> [...]
>> ======================================================================
>> FAIL: test_decomp.test_eigh('general ', 6, 'F', True, False, False, (2, 4))
>> ----------------------------------------------------------------------
>> [...]
>>
>> So this is indeed sounding like an OpenBLAS issue... next stop
>> valgrind, I guess :-/
> 
> Here's the valgrind output:
>   https://gist.github.com/njsmith/577d028e79f0a80d2797
> 
> There's a lot of it, but no smoking guns have jumped out at me :-/
> 
> -n
> 

plenty of smoking guns, e.g.:

.............==3695== Invalid read of size 8
3417	==3695==    at 0x7AAA9C0: daxpy_k_CORE2 (in
/usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
3418	==3695==    by 0x76BEEFC: ger_kernel (in
/usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
3419	==3695==    by 0x788F618: exec_blas (in
/usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
3420	==3695==    by 0x76BF099: dger_thread (in
/usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)
3421	==3695==    by 0x767DC37: dger_ (in
/usr/local/lib/python2.7/dist-packages/numpy/.libs/libopenblas.so.0)

I think I have reported that to openblas already, they said do that
intentionally, though last I checked they are missing the code that
verifies this is actually allowed (if your not crossing a page you can
read beyond the boundaries). Its pretty likely its a pointless micro
optimization, you normally only use that trick for string functions
where you don't know the size of the string.

Your code also indicates it ran on core2, while the issues occur on
sandybridge, maybe valgrind messes with the cpu detection so it won't
show anything.