[Numpy-discussion] Default type for functions that accumulate integers

Tue Jan 3 14:59:47 EST 2017

On Mon, 2 Jan 2017 18:46:08 -0800
Nathaniel Smith <njs at pobox.com> wrote:
> 
> So some options include:
> - make the default integer precision 64-bits everywhere
> - make the default integer precision 32-bits on 32-bit systems, and
> 64-bits on 64-bit systems (including Windows)

Either of those two would be the best IMO.

Intuitively, I think people would expect 32-bit ints in 32-bit
processes by default, and 64-bit ints in 64-bit processes likewise. So
I would slightly favour the latter option.

> - leave the default integer precision the same, but make accumulators
> 64-bits everywhere
> - leave the default integer precision the same, but make accumulators
> 64-bits on 64-bit systems (including Windows)

Both of these options introduce a confusing discrepancy.

> - speed: there's probably some cost to using 64-bit integers on 32-bit
> systems; how big is the penalty in practice?

Ok, I have fired up a Windows VM to compare 32-bit and 64-bit builds.
Numpy version is 1.11.2, Python version is 3.5.2.  Keep in mind those
are Anaconda builds of Numpy, with MKL enabled for linear algebra;
YMMV.

For each benchmark, the first number is the result on the 32-bit build,
the second number on the 64-bit build.

Simple arithmetic
-----------------

>>> v = np.ones(1024**2, dtype='int32')

>>> %timeit v + v            # 1.73 ms per loop | 1.78 ms per loop
>>> %timeit v * v            # 1.77 ms per loop | 1.79 ms per loop
>>> %timeit v // v           # 5.89 ms per loop | 5.39 ms per loop

>>> v = np.ones(1024**2, dtype='int64')

>>> %timeit v + v            # 3.54 ms per loop | 3.54 ms per loop
>>> %timeit v * v            # 5.61 ms per loop | 3.52 ms per loop
>>> %timeit v // v           # 17.1 ms per loop | 13.9 ms per loop

Linear algebra
--------------

>>> m = np.ones((1024,1024), dtype='int32')

>>> %timeit m @ m            # 556 ms per loop  | 569 ms per loop

>>> m = np.ones((1024,1024), dtype='int64')

>>> %timeit m @ m            # 3.81 s per loop  | 1.01 s per loop

Sorting
-------

>>> v = np.random.RandomState(42).randint(1000, size=1024**2).astype('int32')

>>> %timeit np.sort(v)       # 43.4 ms per loop | 44 ms per loop

>>> v = np.random.RandomState(42).randint(1000, size=1024**2).astype('int64')

>>> %timeit np.sort(v)       # 61.5 ms per loop | 45.5 ms per loop

Indexing
--------

>>> v = np.ones(1024**2, dtype='int32')

>>> %timeit v[v[::-1]]       # 2.38 ms per loop | 4.63 ms per loop

>>> v = np.ones(1024**2, dtype='int64')

>>> %timeit v[v[::-1]]       # 6.9 ms per loop  | 3.63 ms per loop

Quick summary:
- for very simple operations, 32b and 64b builds can have the same perf
  on each given bitwidth (though speed is uniformly halved on 64-bit
  integers when the given operation is SIMD-vectorized)
- for more sophisticated operations (such as element-wise
  multiplication or division, or quicksort, but much more so on the
  matrix product), 32b builds are competitive with 64b builds on 32-bit
  ints, but lag behind on 64-bit ints
- for indexing, it's desirable to use a "native" width integer,
  regardless of whether that means 32- or 64-bit

Of course the numbers will vary depend on the platform (read:
compiler), but some aspects of this comparison will probably translate
to other platforms.

Regards

Antoine.