On Jan 21, 2005, at 8:13 AM, Francesc Altet wrote:
Hi List,
I would like to make a formal proposal regarding with the subject of previous discussions in that list. This message is a bit long, but I've tried my best to expose my thoughts as clearly as possible.
[...] I think Francesc has summarized things very well and offered up some good ideas for how to proceed in speeding up small array performance. Particularly key is understanding exactly where the the time is going in the processing. We (read Todd, really) has some suspicions about what the bottlenecks are and I'll include his conclusions about these below. I just caution that getting good benchmark information to determine this correctly can be more difficult that it would first seem for the reasons he mentions. But anyway I'm certainly supportive of getting an effort going to address this issue (again, we can give support as we described before, but if it is to be done in the near term, others will have to actually do most of the work). A wiki page sounds like a good idea, and it probably should be hosted on scipy.org. If we see any response to this I'll ask to have one set up.
The real problem for Numarray: Object Creation Time ===================================================
On the other hand, the main drawback of Numarray vs Numeric is, in my opinion, its poor performance regarding object creation. This might look like a banal thing at first glance, but it is not in many cases. One example recently reported in this list is:
from timeit import Timer setup = 'import Numeric; a = Numeric.arange(2000);a.shape=(1000,2)' Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100) 0.12782907485961914 setup = 'import numarray; a = numarray.arange(2000);a.shape=(1000,2)' Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100) 1.2013700008392334
So, numarray performs 10 times slower than Numeric not because its indexing access code would be 10 times slower, but mainly due to the fact that object creation is between 5 and 10 times slower, and the loop above implies an object creation on each iteration.
Other case of use where object creation time is important can be seen in [4].
It probably is perhaps too narrow to focus on just array creation. It likely is the biggest factor but there may be other issues as well. For the above case it's possible that the indexing mechanism itself can be speeded up, and that is likely part of the ratio of speeds being 5 to 10 times slower. Todd's comments: Here's a little input for how someone can continue looking at this. Here's the semi-solid info I have at the moment on ufunc execution time; included within it is a breakdown of some of the costs in the C-API function NA_NewAllFromBuffer() located in newarray.ch. I haven't been working on this; this is where I left off. My timing module, numarray.teacup, may be useful to someone else trying to measure timing; the accuracy of the measurements is questionable either due to bugs or the intrusiveness of the inline code disturbing the processor cache (it does dictionary operations for each timing measurement). I tried to design it so that timing measurements can be nested, with limited success. Nevertheless, as a rough guide that provides microsecond level measurements, I have found it useful. It only works on linux. Build numarray like this: % rm -rf build % python setup.py install --timing --force Then do this to see the cost of the generated output array in an add():
import numarray as na a = na.arange(10) b = a.copy() for i in range(101): ... jnk = na.add(a,b) ... import numarray.teacup as tc tc.report() Src/_ufuncmodule.c _cache_exec2 fast count: 101 avg_usec: 4.73 cycles: 0 Src/_ufuncmodule.c _cache_lookup2 broadcast count: 101 avg_usec: 4.46 cycles: 0 Src/_ufuncmodule.c _cache_lookup2 hit or miss count: 101 avg_usec: 27.50 cycles: 6 Src/_ufuncmodule.c _cache_lookup2 hit output count: 100 avg_usec: 25.22 cycles: 5 Src/_ufuncmodule.c _cache_lookup2 internal count: 101 avg_usec: 5.20 cycles: 0 Src/_ufuncmodule.c _cache_lookup2 miss count: 0 avg_usec: nan cycles: 0 Src/_ufuncmodule.c cached_dispatch2 exec count: 101 avg_usec: 13.65 cycles: 1 Src/_ufuncmodule.c cached_dispatch2 lookup count: 101 avg_usec: 37.35 cycles: 9 Src/_ufuncmodule.c cached_dispatch2 overall count: 101 avg_usec: 53.36 cycles: 12 Src/libnumarraymodule.c NewArray __new__ count: 304 avg_usec: 8.12 cycles: 0 Src/libnumarraymodule.c NewArray buffer count: 304 avg_usec: 5.37 cycles: 0 Src/libnumarraymodule.c NewArray misc count: 304 avg_usec: 0.25 cycles: 0 Src/libnumarraymodule.c NewArray type count: 304 avg_usec: 0.27 cycles: 0 Src/libnumarraymodule.c NewArray update count: 304 avg_usec: 1.16 cycles: 0 Src/libteacupmodule.c calibration nested count:999999 avg_usec: -0.00 cycles: 1 Src/libteacupmodule.c calibration top count:999999 avg_usec: -0.00 cycles: 0
I would caution anyone working on this that there are at least three locations in the code (some of it redundancy inserted for the purpose of performance optimization, some of it the consequences of having a class hierarchy) that need to be considered: _ndarraymodule.c, _numarraymodule.c, and newarray.ch. My suspicions: 1. Having an independent buffer/memory object rather than just mallocing the array storage. This is expensive in a lot of ways: it's an extra hidden object and also a Python function call. The ways I've thought of for Eliminating this add complexity and make numarray even more modal than it already is. 2. Being implemented as a new style class; this is an unknown cost and involves the creation of still other extra objects, like the object() dictionary, but presumably that has been fairly well optimized already. Calling up the object hierarchy to build the object (__new__) probably has additional overheads. Things to try: 1. Retain a free-list/cache of small objects and re-use them rather than creating/destroying all the time. Use a constant storage size and fit any small array into that space. I think this is the killer technique that would solve the small array problem without kludging up everything else. Do this first, and only then see if (2) or (3) need to be done. 2. Flatten the class hiearchy more (at least for construction) and remove any redundancy by refactoring. 3. Build in a malloc/free mode for array storage which bypasses the memorymodule completely and creates buffer objects when _data is accessed. Use the OWN_DATA bit in arrayobject.flags.
The real problem for Numarray: Object Creation Time
===================================================
On the other hand, the main drawback of Numarray vs Numeric is, in my opinion, its poor performance regarding object creation. This might look like a banal thing at first glance, but it is not in many cases. One example recently reported in this list is:
from timeit import Timer setup = 'import Numeric; a = Numeric.arange(2000);a.shape=(1000,2)' Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100) 0.12782907485961914 setup = 'import numarray; a = numarray.arange(2000);a.shape=(1000,2)' Timer('for i in range(len(a)): row=a[i]', setup).timeit(number=100) 1.2013700008392334
So, numarray performs 10 times slower than Numeric not because its indexing access code would be 10 times slower, but mainly due to the fact that object creation is between 5 and 10 times slower, and the loop above implies an object creation on each iteration.
Other case of use where object creation time is important can be seen in [4].
One thing to note here is that NumArray() is really used to create numarray arrays, while array() is used to create Numeric arrays. In numarray, array() is a Python function which can be optimized to C in its own right. That alone will not fix the problem though. NumArray itself must be optimized.
Proposal for making of Numarray a real Numeric 'NG' (Next Generation) =====================================================================
Provided that the most important reason (IMO) to not consider Numarray to be a good replacement of Numeric is object creation time, I would like to propose a coordinated effort to solve this precise issue.
I think that is one place to optimize, and the best I'm aware of, but there's a lot of Python in numarray, and a single "." is enough to blow performance out of the water. I think this problem is easily solveable for small arrays with a modest effort. There are a lot of others though (moving the NumArray number protocol to C is one that comes to mind.)