[Numpy-discussion] Generator arrays
Dag Sverre Seljebotn
dagss at student.matnat.uio.no
Fri Jan 28 06:37:33 EST 2011
On 01/28/2011 01:01 AM, Travis Oliphant wrote:
> Just to start the conversation, and to find out who is interested, I would like to informally propose generator arrays for NumPy 2.0. This concept has as one use-case, the deferred arrays that Mark Wiebe has proposed. But, it also allows for "compressed arrays", on-the-fly computed arrays, and streamed or generated arrays.
> Basically, the modification I would like to make is to have an array flag (MEMORY) that when set means that the data attribute of a numpy array is a pointer to the address in memory where the data begins with the strides attribute pointing to a C-array of integers (in other words, all current arrays are MEMORY arrays)
> But, when the MEMORY flag is not set, the data attribute instead points to a length-2 C-array of pointers to functions
> [read(N, output_address, self->index_iter, self->extra), write(N, input_address, self->index_iter, self->extra)]
> Either of these could then be NULL (i.e. if write is NULL, then the array must be read-only).
> When the MEMORY flag is not set, the strides member of the ndarray structure is a pointer to the index_iter object (which could be anything that the particular read and write methods need it to be).
> The array structure should also get a member to hold the "extra" argument (which would hold any state that the array needed to hold on to in order to correctly perform the read or write operations --- i.e. it could hold an execution graph for deferred evaluation).
> The index_iter structure is anything that the read and write methods need to correctly identify *where* to write. Now, clearly, we could combine index_iter and extra into just one "structure" that holds all needed state for read and write to work correctly. The reason I propose two slots is because at least mentally in the use case of having these structures be calculation graphs, one of these structures is involved in "computing the location to read/write" and the other is involved in "computing what to read/write"
> The idea is fairly simple, but with some very interesting potential features:
> * lazy evaluation (of indexing, ufuncs, etc.)
> * fancy indexing as views instead of copies (really just another example of lazy evaluation)
> * compressed arrays
> * generated arrays (from computation or streamed data)
> * infinite arrays
> * computed arrays
> * missing-data arrays
> * ragged arrays (shape would be the bounding box --- which makes me think of ragged arrays as examples of masked arrays).
> * arrays that view PIL data.
> One could build an array with a (logically) infinite number of elements (we could use -2 in the shape tuple to indicate that).
> We don't need examples of all of these features for NumPy 2.0 to be released, because to really make this useful, we would need to modify all "calculation" code to produce a NON MEMORY array. What to do here still needs a lot of thought and experimentation.
> But, I can think about a situation where all NumPy calculations that produce arrays provide the option that when they are done inside of a particular context, a user-supplied behavior over-rides the default return. I want to study what Mark is proposing and understand his new iterator at a deeper level before providing more thoughts here.
> That's the gist of what I am thinking about. I would love feedback and comments.
I guess my reaction is along the lines of Charles': Why can't "a + b",
where a and b are NumPy arrays, simply return an object of a different
type that is lazily evaluated? Why can't infinite arrays simply be yet
Of course, much useful functionality should then be refactored into a
new "abstract array" class, and iterators etc. be given an API that
works with more than one type.
A special-case flag and function pointers seems a bit like reinventing
OO to me, and OO is already provided by Python.
More information about the NumPy-Discussion