[Numpy-discussion] Proposal for making of Numarray a real Numeric 'NG'

Fri Jan 21 08:49:10 EST 2005

On Jan 21, 2005, at 8:13 AM, Francesc Altet wrote:

> Hi List,
>
> I would like to make a formal proposal regarding with the subject of
> previous discussions in that list. This message is a bit long, but I've
> tried my best to expose my thoughts as clearly as possible.
>
[...]
I think Francesc has summarized things very well and offered up some
good ideas for how to proceed in speeding up small array performance.
Particularly key is understanding exactly where the the time is going
in the processing. We (read Todd, really) has some suspicions about what
the bottlenecks are and I'll include his conclusions about these below.
I just caution that getting good benchmark information to determine
this correctly can be more difficult that it would first seem for the
reasons he mentions. But anyway I'm certainly supportive of getting an
effort going to address this issue (again, we can give support as we
described before, but if it is to be done in the near term, others
will have to actually do most of the work). A wiki page sounds like
a good idea, and it probably should be hosted on scipy.org. If we see
any response to this I'll ask to have one set up.

> The real problem for Numarray: Object Creation Time
> ===================================================
>
> On the other hand, the main drawback of Numarray vs Numeric is, in my
> opinion, its poor performance regarding object creation. This might 
> look
> like a banal thing at first glance, but it is not in many cases. One 
> example
> recently reported in this list is:
>
>>>> from timeit import Timer
>>>> setup = 'import Numeric; a = Numeric.arange(2000);a.shape=(1000,2)'
>>>> Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100)
> 0.12782907485961914
>>>> setup = 'import numarray; a = 
>>>> numarray.arange(2000);a.shape=(1000,2)'
>>>> Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100)
> 1.2013700008392334
>
> So, numarray performs 10 times slower than Numeric not because its 
> indexing
> access code would be 10 times slower, but mainly due to the fact that 
> object
> creation is between 5 and 10 times slower, and the loop above implies 
> an
> object creation on each iteration.
>
> Other case of use where object creation time is important can be seen 
> in
> [4].
>
It probably is perhaps too narrow to focus on just array creation. It 
likely
is the biggest factor but there may be other issues as well. For the 
above
case it's possible that the indexing mechanism itself can be speeded 
up, and
that is likely part of the ratio of speeds being 5 to 10 times slower.

Todd's comments:

Here's a little input for how someone can continue looking at this.

Here's the semi-solid info I have at the moment on ufunc execution
time;  included within it is a breakdown of some of the costs in the
C-API function NA_NewAllFromBuffer() located in newarray.ch.  I haven't
been working on this;  this is where I left off.

My timing module, numarray.teacup, may be useful to someone else trying
to measure timing;  the accuracy of the measurements is questionable
either due to bugs or the intrusiveness of the inline code disturbing
the processor cache (it does dictionary operations for each timing
measurement).  I tried to design it so that timing measurements can be
nested,  with limited success.  Nevertheless, as a rough guide that
provides microsecond level measurements,  I have found it useful.  It
only works on linux.

Build numarray like this:

% rm -rf build
% python setup.py install --timing --force

Then do this to see the cost of the generated output array in an add():

>>> import numarray as na
>>> a = na.arange(10)
>>> b = a.copy()
>>> for i in range(101):
...     jnk = na.add(a,b)
...
>>> import numarray.teacup as tc
>>> tc.report()
Src/_ufuncmodule.c              _cache_exec2 fast               count:  
  101  avg_usec:    4.73 cycles: 0
Src/_ufuncmodule.c              _cache_lookup2 broadcast        count:  
  101  avg_usec:    4.46 cycles: 0
Src/_ufuncmodule.c              _cache_lookup2 hit or miss      count:  
  101  avg_usec:   27.50 cycles: 6
Src/_ufuncmodule.c              _cache_lookup2 hit output       count:  
  100  avg_usec:   25.22 cycles: 5
Src/_ufuncmodule.c              _cache_lookup2 internal         count:  
  101  avg_usec:    5.20 cycles: 0
Src/_ufuncmodule.c              _cache_lookup2 miss             count:  
    0  avg_usec:     nan cycles: 0
Src/_ufuncmodule.c              cached_dispatch2 exec           count:  
  101  avg_usec:   13.65 cycles: 1
Src/_ufuncmodule.c              cached_dispatch2 lookup         count:  
  101  avg_usec:   37.35 cycles: 9
Src/_ufuncmodule.c              cached_dispatch2 overall        count:  
  101  avg_usec:   53.36 cycles: 12
Src/libnumarraymodule.c         NewArray __new__                count:  
  304  avg_usec:    8.12 cycles: 0
Src/libnumarraymodule.c         NewArray buffer                 count:  
  304  avg_usec:    5.37 cycles: 0
Src/libnumarraymodule.c         NewArray misc                   count:  
  304  avg_usec:    0.25 cycles: 0
Src/libnumarraymodule.c         NewArray type                   count:  
  304  avg_usec:    0.27 cycles: 0
Src/libnumarraymodule.c         NewArray update                 count:  
  304  avg_usec:    1.16 cycles: 0
Src/libteacupmodule.c           calibration nested              
count:999999  avg_usec:   -0.00 cycles: 1
Src/libteacupmodule.c           calibration top                 
count:999999  avg_usec:   -0.00 cycles: 0

I would caution anyone working on this that there are at least three
locations in the code (some of it redundancy inserted for the purpose of
performance optimization, some of it the consequences of having a class
hierarchy) that need to be considered:  _ndarraymodule.c,
_numarraymodule.c, and newarray.ch.

My suspicions:

1.  Having an independent buffer/memory object rather than just
mallocing the array storage.  This is expensive in a lot of ways:  it's
an extra hidden object and also a Python function call.  The ways I've
thought of for Eliminating this add complexity and make numarray even
more modal than it already is.

2. Being implemented as a new style class;  this is an unknown cost and
involves the creation of still other extra objects,  like the object()
dictionary,  but presumably that has been fairly well optimized already.
Calling up the object hierarchy to build the object (__new__) probably
has additional overheads.

Things to try:

1. Retain a free-list/cache of small objects and re-use them rather than
creating/destroying all the time.  Use a constant storage size and fit
any small array into that space.  I think this is the killer technique
that would solve the small array problem without kludging up everything
else.  Do this first,  and only then see if (2) or (3) need to be done.

2. Flatten the class hiearchy more (at least for construction) and
remove any redundancy by refactoring.

3. Build in a malloc/free mode for array storage which bypasses the
memorymodule completely and creates buffer objects when _data is
accessed.  Use the OWN_DATA bit in arrayobject.flags.

> The real problem for Numarray: Object Creation Time
>> ===================================================
>>
>> On the other hand, the main drawback of Numarray vs Numeric is, in my
>> opinion, its poor performance regarding object creation. This might
>> look
>> like a banal thing at first glance, but it is not in many cases. One
>> example
>> recently reported in this list is:
>>
>>>>> from timeit import Timer
>>>>> setup = 'import Numeric; a = Numeric.arange(2000);a.shape=(1000,2)'
>>>>> Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100)
>> 0.12782907485961914
>>>>> setup = 'import numarray; a =
>>>>> numarray.arange(2000);a.shape=(1000,2)'
>>>>> Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100)
>> 1.2013700008392334
>>
>> So, numarray performs 10 times slower than Numeric not because its
>> indexing
>> access code would be 10 times slower, but mainly due to the fact that
>> object
>> creation is between 5 and 10 times slower, and the loop above implies
>> an
>> object creation on each iteration.
>>
>> Other case of use where object creation time is important can be seen
>> in
>> [4].

One thing to note here is that NumArray() is really used to create
numarray arrays,  while array() is used to create Numeric arrays.  In
numarray,  array() is a Python function which can be optimized to C in
its own right.  That alone will not fix the problem though.  NumArray
itself must be optimized.

>>
>> Proposal for making of Numarray a real Numeric 'NG' (Next Generation)
>> =====================================================================
>>
>> Provided that the most important reason (IMO) to not consider Numarray
>> to be
>> a good replacement of Numeric is object creation time, I would like to
>> propose a coordinated effort to solve this precise issue.

I think that is one place to optimize,  and the best I'm aware of,  but
there's a lot of Python in numarray,  and a single "." is enough to blow
performance out of the water.  I think this problem is easily solveable
for small arrays with a modest effort.  There are a lot of others though
(moving the NumArray number protocol to C is one that comes to mind.)