[Numpy-discussion] advanced indexing bug with huge arrays?

Thu Jan 26 15:35:58 EST 2012

On Thu, Jan 26, 2012 at 17:39, Sturla Molden <sturla at molden.no> wrote:
> Den 24.01.2012 17:19, skrev David Warde-Farley:
>>
>> Hmm. Seeing as the width of a C long is inconsistent, does this imply that
>> the random number generator will produce different results on different
>> platforms?
>
> If it does, it is a C programming mistake. C code should never depend on
> the exact size of a long, only it's minimum size.  ISO C defines other
> datatypes if an exact integer size is needed (include stdint.h), but
> ANSI C used for NumPy does not.

I think you're subtly misunderstanding his question. He's not asking
if the code is written such that it semantically requires a long to
have one specific size or another (and indeed, it is not). However, it
is true that the code may behave differently for the same inputs on
different machines with different long sizes. Namely, some part of the
computation may overflow on 32-bit longs while giving an accurate
answer with 64-bit longs. They just have different domains of accuracy
over their inputs. It is not necessarily a mistake to take advantage
of the extra room when it is available. That is the reason that Python
ints are C longs and why numpy follows suit.

But unfortunately, it is true that at least some of the distributions
do have different behavior when using 64-bit longs than when using
32-bit longs. Here is an example of drawing from a binomial
distribution with a large N on a 32-bit process and comparing it with
results from a 64-bit process:

[~]$ $PYBIN/python
Enthought Python Distribution -- www.enthought.com
Version: 7.1-2 (32-bit)

Python 2.7.2 |EPD 7.1-2 (32-bit)| (default, Jul 27 2011, 13:29:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "packages", "demo" or "enthought" for more information.
>>> import numpy as np
>>> prng = np.random.RandomState(1234567890)
>>> x32 = prng.binomial(500000000, 0.5, size=10000)
>>> x64 = np.load('x64.npy')
>>> np.save('x32.npy', x32)
>>> bad = (x64 != x32)
>>> bad.sum()
9449
>>> bad[-9449:]
array([ True,  True,  True, ...,  True,  True,  True], dtype=bool)
>>> bad[-9449:].sum()
9449

binomial() is using a rejection algorithm for this set of inputs. For
each draw, it is going to generate random numbers until they meet a
certain condition. Both the 32-bit process and the 64-bit process draw
the same exact numbers until the 552nd draw. Then, I suspect, there is
an integer overflow in the 32-bit process causing the rejection
algorithm to terminate either earlier or later than it otherwise
should. Since the two processes have consumed different amounts of
random numbers, the underlying uniform PRNG is no longer in the same
state, so all of the numbers thereafter will be different.

It's not clear to me how problematic this is. I haven't seen any
difference when using reasonable input values (N=500000000 is a
ridiculous number to be using with a binomial distribution). If I'm
right that there is an overflow when using the 32-bit longs, then the
results should not be trusted anyways, so there is no point in
comparing them to the 64-bit results. It's just that the domain of
validity with a 32-bit long is a bit smaller than when using a 64-bit
long. The deviation of x32[551] from the mean is larger than the
maximum deviation from the 64-bit results, so it is reasonably likely
that the draw is just bogus.

>>> np.max(abs(x64 - 250000000))
44519
>>> x32[551] - 250000000
47368

Often, the acceptance criterion is something of the form (X <
something) while expecting X to be positive. An integer overflow would
introduce a negative value somewhere in the computation and could
easily "pass" this acceptance criterion when it really shouldn't have
if the intermediate computations had been done without overflow.

If anyone wants to debug this more thoroughly, this bit of code will
get the PRNG into exactly the right state to see the difference on the
next binomial() draw:

>>> import numpy as np
>>> prng = np.random.RandomState(1234567890)
>>> blah = prng.binomial(500000000, 0.5, size=551)

If you run python under gdb, you can then set a breakpoint in
rk_binomial_btpe() in distributions.c to step through the next call to
prng.binomial(). Sometimes you can fix these issues in a rejection
algorithm by checking for overflow and rejecting those cases.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco