
As a followup to my previous post here is a discussion and overview of the current NumPy2 design as I have it in my head and partially implemented in the numpy2 module on the CVS tree at numpy's sourceforge site. The design of NumPy2 is quite simple and tries to balance speed with flexibility and modifiability. An outline of the design follows (the names can change, they are only reference at the moment) Three classes replacing the current C structures: ArrayType --- replaces PyArray_Descr # not implemented yet Ufunc --- replaces PyUfuncObject NDArray --- replaces PyArrayObject The purpose of each class is to encapsulate interfaces to allow code re-use for similar operations. NDArray # This is implemented except for the # operations (ufuncs) ======================== The most concrete class is the NDArray class --- it just needs coding to make it happen. The other classes still need some design work to efficiently handle mixed type operations and additions of new types to the system. The NDArray base class gives an N-Dimensional array interpretation to a Python buffer (a segment of memory, an m-mapped file, a PIL image, etc.). It provides this interpretation with three special attributes: self.rank --- the dimension of the array (hard-coded changeable limit of 10). self._data --- a buffer object pointing to the data self._structure --- a buffer object pointing to an array of INTEGERS which holds the dimensions and strides information (INTEGER) is a platform-dependent type #defined in compiled code self._descr --- Python class describing the type. To interact seamlessly with the C-API and be recognized as "an array" all subclasses must either export an __array__ method which creates a suitable NDArray or not interfere with these provided attributes. Note that the same data segment can be viewed in several different ways. The NDArray will have default implementations for the numeric operations that will resemble the current implementation. But, it will be easy to subclass the array to handle these operations as you'd like without losing the ability to use the data in that array in extension modules which assume array inputs. Two other attributes are worth mentioning: self.CONTIGUOUS # this can be determined from the _structure information # but it is useful to keep a flag around indicating the # status. This tells you whether or not you can # walk through the entire array an element at a # time with a single for loop. self.FORTRANVIEW # This basically indicates how the array will view it's # shape when asked and indexed (This does not change the # _structure information). An array of "shape" # (10,3,5,7) when # FORTRANVIEW is 0 will be an array of "shape" (7,5,3,10) # when FORTRANVIEW is 1 ArrayType: # This is not implemented yet. =============================== This class is to replace the PyArray_Descr structure in current Numerical Python. As a result, it must contain the information: self.name ---- some kind of object to identify it (a string) self.elsize ---- size of an item of this kind. self.cast ---- a dictionary of compiled functions with at least one entry called to cast this type to at least one other type self.getitem ---- (Compiled) function self.setitem ---- (Compiled) function self.zeros ---- needed for the zeros command to include arrays of Python Objects. It points to the representation of zero for this type. Currently, the above is not implemented, yet. What is implemented is a module _arraytypes which exposes to Python the PyArray_Descr structure so that it can be used. The idea of adding new types to Numerical Python without having to change all of the code is appealing to me, however. Ufunc: # This is partly implemented ========================================== There are two ideas here. I've partly implemented the first one which I'll explain. The second was presented to me by Paul Barrett. Ufunc's are encapsulations of the N-D looping construct and the broadcasting rules of Numerical Python. The N-D looping construct is limited to the fixed but arbitrary 10 dimensions as given above for C code but can be arbitrary if a Python function is called at each iteration. I explain Ufuncs in a piece called Ufuncexplain.txt which is on the CVS tree. Here is a quote that explains broadcasting rules: 1) If input arrays do not have the same rank. The array with lower rank will be prepended with ones until ranks agree. 2) If an input array has length one, then "duplicate" the elements along that dimension so that input shapes agree. Example: A is an array of shape (10,) B is an array of shape (3,10) A * B will return an array of shape (3,10): - A is interpreted as shape (1,10) - the columns of A are "broadcast" across the rows of B Thus the output is (3,10): [ A*B[0] A*B[1] A*B[2] ] Note that A is not actually extended to a (3,10) array. It merely behaves as if it had been. The element-wise math operators are implemented using Ufuncs. SpecialFuncs is a Python package at http://oliphant.netpedia.net that impelements a whole range of special functions using the Ufunc formalism. I also include in that package a general arraymap function which can turn any Python function into a broadcasting "ufunc-mimic" This code does not have the rank 10 limit on the number of dimensions on the inputs -- but it might be slower than the current implementation. My current implementation assumes that the Ufunc instantiator will provide two functions: a select function and a compute function (either of these can be in C or Python). The select function and the compute function work together. The select function determines the type of the outputs based on the input types, while the compute function takes the inputs and outputs (and their types) and computes the ouptut. This is done on either an entire block of memory (optimized ufuncs) or one-element-at-a-time (unoptimized ufuncs---Python coded ufuncs for example). This allows for efficient coding and the possibility of mixed type arithmetic with a more complicated creation process. It also may be hard to add new types and have them function as you'd like without modifying others already-defined ufuncs. But, I know this idea will work and I can see my way through it. I've already implemented an "addition" function using this method. Another idea that has been presented is to instantiate a Ufunc with only one function that is entered into a "dispatch table" or dictionary of functions keyed by the ArrayType class much like the Multimethod approach that has been discussed. I like this idea, but I do not see the details (I haven't thought about it too much) and do not know if we can actually make it work --- Frankly, I think we can and it will result in a better system. What hasn't been thought through is exactly what is entered into the "function" dictionary and when is it called. Some have suggested that it is called "immediately" upon ufunc call, but this would eliminate the benefits of the encapsulated broadcasting rules. An alternative would be to call it after the "broadcasting rule encapsulation" as been done. In other words use it as a replacement for the current array of functions in the Ufunc implementation. With the appropriate mix of C-modules and Python code I think this could be done quite elegantly. The other issue that has to be worked out (again) is that obviously this table will not be filled out in every case (for a function with 10 inputs and 10 outputs with 16 types we are talking 16^20 different entries in the table and will be sparse) so what is done when there is not an entry for a particular combination will require some thought. I think we could allow multiple behaviors according to some attribute of the ufunc (casting, exception raising, etc.,) is set. Many people might fear this would inhibit code re-use but I have not seen convincing examples. So, that is a brief overview of the state of things. It doesn't try to cover everything, but it should give you enough of a perspective to understand the code that is on the CVS tree under module numpy2. I have been using C for the compiled code because it is easy to interface to and it has the widest platform support and because Python istelf is written in C. Anybody with specific questions (including offers to help) can feel free to contact me or post to this list. Thanks, Travis Oliphant