[Numpy-discussion] Generator arrays

Fri Jan 28 17:40:33 EST 2011

On Fri, Jan 28, 2011 at 1:18 PM, Travis Oliphant <oliphant at enthought.com>wrote:

>
> Just to start the conversation, and to find out who is interested, I would
> like to informally propose generator arrays for NumPy 2.0.     This concept
> has as one use-case, the deferred arrays that Mark Wiebe has proposed.  But,
> it also allows for "compressed arrays", on-the-fly computed arrays, and
> streamed or generated arrays.
>

I like the idea, it could work very well.  It's also a bit risky, in the
sense that if the design isn't right in the end it could be overly
complicated or perform poorly.  The design will need time to bake.

> Basically, the modification I would like to make is to have an array flag
> (MEMORY) that when set means that the data attribute of a numpy array is a
> pointer to the address in memory where the data begins with the strides
> attribute pointing to a C-array of integers (in other words, all current
> arrays are MEMORY arrays)
>

> But, when the MEMORY flag is not set, the data attribute instead points to
> a length-2 C-array of pointers to functions
>
>        [read(N, output_address, self->index_iter, self->extra),  write(N,
> input_address, self->index_iter, self->extra)]
>
> Either of these could then be NULL (i.e. if write is NULL, then the array
> must be read-only).
>
> When the MEMORY flag is not set, the strides member of the ndarray
> structure is a pointer to the index_iter object (which could be anything
> that the particular read and write methods need it to be).
>

The details will have to be worked out,  but one additional thing the
deferred calculation needs is a way to re-enable writing to arrays that are
referenced by a deferred calculation.  For finite non-MEMORY arrays, a
common operation will probably be to convert them in-place into MEMORY
arrays.

The array structure should also get a member to hold the "extra" argument
> (which would hold any state that the array needed to hold on to in order to
> correctly perform the read or write operations --- i.e. it could hold an
> execution graph for deferred evaluation).
>
> The index_iter structure is anything that the read and write methods need
> to correctly identify *where* to write.   Now, clearly, we could combine
> index_iter and extra into just one "structure" that holds all needed state
> for read and write to work correctly.   The reason I propose two slots is
> because at least mentally in the use case of having these structures be
> calculation graphs, one of these structures is involved in "computing the
> location to read/write" and the other is involved in "computing what to
> read/write"
>

There are many ways one may want values to be gotten or put - dense array
data, values where a boolean mask has true values, a flat array with values
from specified arbitrary coordinates.  Some data sources may only support a
subset of the possibilities, and others may be able to support very fast
arbitrary access.  There probably needs to be a hierarchy of access methods,
similar to C++ STL iterators.

The idea is fairly simple, but with some very interesting potential
> features:
>
>        * lazy evaluation (of indexing, ufuncs, etc.)
>        * fancy indexing as views instead of copies (really just another
> example of lazy evaluation)
>        * compressed arrays
>        * generated arrays (from computation or streamed data)
>        * infinite arrays
>        * computed arrays
>        * missing-data arrays
>        * ragged arrays (shape would be the bounding box --- which makes me
> think of ragged arrays as examples of masked arrays).
>        * arrays that view PIL data.
>
> One could build an array with a (logically) infinite number of elements (we
> could use -2 in the shape tuple to indicate that).
>

The infinite shape reminds me of D's ranges.   It's also getting into a
territory where specifying a bounding box for the data makes more sense than
just a shape.  For just 1D data, you have [lower, upper], (-inf, upper],
[lower, +inf), and (-inf, inf) cases.

<snip> <paste>

One thing I didn't talk about in my previous long email, was the
> re-organization of calculation functions that needs to happen.  I really
> think that the ufunc concept needs to be broadened so that all
> function-pointers that are currently attached to data-type objects can be
> handled under the same calculation super-structure.
>

I like this approach since it's going in a more library-oriented direction.
 The generic style of programming popularized by STL can provide good
guidance to design this.  Even if C doesn't support templates, designing
orthogonal interfaces for calculation and iteration is still possible.

The design will also need layering.  At it's simplest, one should be able to
specify a ufunc with just a "double calc(double, double)" function and one
simple object creation function call.  At the same time, one should be able
to specialize the calculation function to do inner contiguous loops,
accumulation loops, or use SSE to get big speed improvements.

>
> This re-factoring would go a long way into cementing what kind of API is
> needed for different "array objects".    I am persuaded that improving
> framework for vectorized calculations which allow for any array-like objects
> (objects satisfying a certain protocol or API) is a better approach then
> altering the nice map of ndarray to in-memory data.
>

The interface to the iterator already is general enough that, for example, a
"compressed array" iterator could provide the same interface to client code.
 This could be generalized further as needed.  Francesc noticed this when I
modified the numexpr code.

Then, deferred arrays, masked arrays, computed arrays, and other array-like
> objects could provide protocols and APIs (and callbacks) that satisfy this
> general calculation structure.   This kind of generalization is probably
> more useful than changes to the array object itself.
>
> But, it's also hard and I'm not entirely sure what that structure should
> be.   I'm looking forward to thoughts in this direction and looking more
> closely at what Mark has done with writing ufuncs as wrappers around his
> iterators.   I'm concerned that his iterators don't support the generalized
> ufunc interface that I really like and was hoping would provide the
> abstraction needed to allow all the functions currently attached to dtypes
> (searchsorted, etc.) to be incorporated in the generalized calculation
> structure.
>

Think of the iterator as a tool just for accessing some or all of the
elements in an array or arrays.  The ufunc interfaces for generic functions,
reductions, accumulations, and the generalized ufunc simply use the iterator
to access the data, the iterator itself doesn't constrain them.
 Particularly the op_axes parameter turned out to be flexible in a way which
made these multitude of different views of the data for each calculation
possible.

Cheers,
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20110128/22b64d2f/attachment.html>