[Numpy-discussion] Generator arrays
oliphant at enthought.com
Thu Jan 27 19:01:27 EST 2011
Just to start the conversation, and to find out who is interested, I would like to informally propose generator arrays for NumPy 2.0. This concept has as one use-case, the deferred arrays that Mark Wiebe has proposed. But, it also allows for "compressed arrays", on-the-fly computed arrays, and streamed or generated arrays.
Basically, the modification I would like to make is to have an array flag (MEMORY) that when set means that the data attribute of a numpy array is a pointer to the address in memory where the data begins with the strides attribute pointing to a C-array of integers (in other words, all current arrays are MEMORY arrays)
But, when the MEMORY flag is not set, the data attribute instead points to a length-2 C-array of pointers to functions
[read(N, output_address, self->index_iter, self->extra), write(N, input_address, self->index_iter, self->extra)]
Either of these could then be NULL (i.e. if write is NULL, then the array must be read-only).
When the MEMORY flag is not set, the strides member of the ndarray structure is a pointer to the index_iter object (which could be anything that the particular read and write methods need it to be).
The array structure should also get a member to hold the "extra" argument (which would hold any state that the array needed to hold on to in order to correctly perform the read or write operations --- i.e. it could hold an execution graph for deferred evaluation).
The index_iter structure is anything that the read and write methods need to correctly identify *where* to write. Now, clearly, we could combine index_iter and extra into just one "structure" that holds all needed state for read and write to work correctly. The reason I propose two slots is because at least mentally in the use case of having these structures be calculation graphs, one of these structures is involved in "computing the location to read/write" and the other is involved in "computing what to read/write"
The idea is fairly simple, but with some very interesting potential features:
* lazy evaluation (of indexing, ufuncs, etc.)
* fancy indexing as views instead of copies (really just another example of lazy evaluation)
* compressed arrays
* generated arrays (from computation or streamed data)
* infinite arrays
* computed arrays
* missing-data arrays
* ragged arrays (shape would be the bounding box --- which makes me think of ragged arrays as examples of masked arrays).
* arrays that view PIL data.
One could build an array with a (logically) infinite number of elements (we could use -2 in the shape tuple to indicate that).
We don't need examples of all of these features for NumPy 2.0 to be released, because to really make this useful, we would need to modify all "calculation" code to produce a NON MEMORY array. What to do here still needs a lot of thought and experimentation.
But, I can think about a situation where all NumPy calculations that produce arrays provide the option that when they are done inside of a particular context, a user-supplied behavior over-rides the default return. I want to study what Mark is proposing and understand his new iterator at a deeper level before providing more thoughts here.
That's the gist of what I am thinking about. I would love feedback and comments.
The other things I would like to see in NumPy 2.0 that have not been discussed lately (that could affect the ABI) are:
* a geometry member to the data structure (that allows labels to dimensions and axes to be provided -- ala data_array)
* small array performance improvements that Mark Wiebe has suggested (including the addition of an optional low-level loop that is used when you have contiguous data)
* completed datetime implementation
* pointer data-types (i.e. the memory location holds a pointer to another part of an ndarray) --- very useful for "join" - type arrays
If anybody is interested in helping with any of these (and has time to do it, let me know). Some of this I could fund (especially if you are willing to come to Austin and be an intern for Enthought).
P.S. I hope to have more time this year to hang-out here on the numpy-discussion list (but we will see....)
More information about the NumPy-Discussion