[Numpy-discussion] Default type for functions that accumulate integers

Mon Jan 2 21:46:08 EST 2017

On Mon, Jan 2, 2017 at 6:27 PM, Charles R Harris
<charlesr.harris at gmail.com> wrote:
> Hi All,
>
> Currently functions like trace use the C long type as the default
> accumulator for integer types of lesser precision:
>
>> dtype : dtype, optional
>>     Determines the data-type of the returned array and of the accumulator
>>     where the elements are summed. If dtype has the value None and `a` is
>>     of integer type of precision less than the default integer
>>     precision, then the default integer precision is used. Otherwise,
>>     the precision is the same as that of `a`.
>
>
> The problem with this is that the precision of long varies with the platform
> so that the result varies,  see gh-8433 for a complaint about this. There
> are two possible alternatives that seem reasonable to me:
>
> Use 32 bit accumulators on 32 bit platforms and 64 bit accumulators on 64
> bit platforms.
> Always use 64 bit accumulators.

This is a special case of a more general question: right now we use
the default integer precision (i.e., what you get from np.array([1]),
or np.arange, or np.dtype(int)), and it turns out that the default
integer precision itself varies in confusing ways, and this is a
common source of bugs. Specifically: right now it's 32-bit on 32-bit
builds, and 64-bit on 64-bit builds, except on Windows where it's
always 32-bit. This matches the default precision of Python 2 'int'.

So some options include:
- make the default integer precision 64-bits everywhere
- make the default integer precision 32-bits on 32-bit systems, and
64-bits on 64-bit systems (including Windows)
- leave the default integer precision the same, but make accumulators
64-bits everywhere
- leave the default integer precision the same, but make accumulators
64-bits on 64-bit systems (including Windows)
- ...

Given the prevalence of 64-bit systems these days, and the fact that
the current setup makes it very easy to write code that seems to work
when tested on a 64-bit system but that silently returns incorrect
results on 32-bit systems, it sure would be nice if we could switch to
a 64-bit default everywhere. (You could still get 32-bit integers, of
course, you'd just have to ask for them explicitly.)

Things we'd need to know more about before making a decision:
- compatibility: if we flip this switch, how much code breaks? In
general correct numpy-using code has to be prepared to handle
np.dtype(int) being 64-bits, and in fact there might be more code that
accidentally assumes that np.dtype(int) is always 64-bits than there
is code that assumes it is always 32-bits. But that's theory; to know
how bad this is we would need to try actually running some projects
test suites and see whether they break or not.
- speed: there's probably some cost to using 64-bit integers on 32-bit
systems; how big is the penalty in practice?

-n

-- 
Nathaniel J. Smith -- https://vorpus.org