[Numpy-discussion] poor performance of sum with sub-machine-word integer types

Tue Jun 21 13:17:38 EDT 2011

On Tue, Jun 21, 2011 at 9:46 AM, Zachary Pincus <zachary.pincus at yale.edu> wrote:
> Hello all,
>
> As a result of the "fast greyscale conversion" thread, I noticed an anomaly with numpy.ndararray.sum(): summing along certain axes is much slower with sum() than versus doing it explicitly, but only with integer dtypes and when the size of the dtype is less than the machine word. I checked in 32-bit and 64-bit modes and in both cases only once the dtype got as large as that did the speed difference go away. See below...
>
> Is this something to do with numpy or something inexorable about machine / memory architecture?
>
> Zach
>
> Timings -- 64-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 2.57 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.75 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 131 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 6.37 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 100 loops, best of 3: 16.6 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 15.1 ms per loop
>
>
>
> Timings -- 32-bit mode:
> ----------------------
> In [2]: i = numpy.ones((1024,1024,4), numpy.int8)
> In [3]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 138 ms per loop
> In [4]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 3.68 ms per loop
>
> In [5]: i = numpy.ones((1024,1024,4), numpy.int16)
> In [6]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 140 ms per loop
> In [7]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 4.17 ms per loop
>
> In [8]: i = numpy.ones((1024,1024,4), numpy.int32)
> In [9]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 22.4 ms per loop
> In [10]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 100 loops, best of 3: 12.2 ms per loop
>
> In [11]: i = numpy.ones((1024,1024,4), numpy.int64)
> In [12]: timeit i.sum(axis=-1)
> 10 loops, best of 3: 29.2 ms per loop
> In [13]: timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
> 10 loops, best of 3: 23.8 ms per loop

One difference is that i.sum() changes the output dtype of int input
when the int dtype is less than the default int dtype:

    >> i.dtype
       dtype('int32')
    >> i.sum(axis=-1).dtype
       dtype('int64') #  <-- dtype changed
    >> (i[...,0]+i[...,1]+i[...,2]+i[...,3]).dtype
       dtype('int32')

Here are my timings

    >> i = numpy.ones((1024,1024,4), numpy.int32)
    >> timeit i.sum(axis=-1)
    1 loops, best of 3: 278 ms per loop
    >> timeit i[...,0]+i[...,1]+i[...,2]+i[...,3]
    100 loops, best of 3: 12.1 ms per loop
    >> import bottleneck as bn
    >> timeit bn.func.nansum_3d_int32_axis2(i)
    100 loops, best of 3: 8.27 ms per loop

Does making an extra copy of the input explain all of the speed
difference (is this what np.sum does internally?):

    >> timeit i.astype(numpy.int64)
    10 loops, best of 3: 29.2 ms per loop

No.

Initializing the output also adds some time:

    >> timeit np.empty((1024,1024,4), dtype=np.int32)
    100000 loops, best of 3: 2.67 us per loop
    >> timeit np.empty((1024,1024,4), dtype=np.int64)
    100000 loops, best of 3: 12.8 us per loop

Switching back and forth between the input and output array takes more
"memory" time too with int64 arrays compared to int32.