[Numpy-discussion] Index Array Performance

Tue Feb 14 04:58:02 EST 2012

On Mon, Feb 13, 2012 at 23:23, Marcel Oliver
<m.oliver at jacobs-university.de> wrote:
> Hi,
>
> I have a short piece of code where the use of an index array "feels
> right", but incurs a severe performance penalty: It's about an order
> of magnitude slower than all other operations with arrays of that
> size.
>
> It comes up in a piece of code which is doing a large number of "on
> the fly" histograms via
>
>  hist[i,j] += 1
>
> where i is an array with the bin index to be incremented and j is
> simply enumerating the histograms.  I attach a full short sample code
> below which shows how it's being used in context, and corresponding
> timeit output from the critical code section.

Other people have explained that yes, applying index arrays is slow. I
would just like to add the tangential point that this code does not
behave the way that you think it does. You cannot make histograms like
this. The statement "hist[i,j] += 1" gets broken down into three
separate statements by the Python compiler:

  tmp = hist.__getitem__((i,j))
  tmp = tmp.__iadd__(1)
  hist.__setitem__((i,j), tmp)

Note that tmp is a new array with copies of the data in hist at the
(i,j) locations, possibly multiple copies if the i index has
repetitions. Each one of these copies gets incremented by 1, then the
__setitem__() will apply each of those in turn to the appropriate cell
in hist, each one simply overwriting the previous one.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth."
  -- Umberto Eco