[Numpy-discussion] Generator arrays
oliphant at enthought.com
Fri Jan 28 01:37:16 EST 2011
> What happens to the buffer API/persistence with all those additions?
I understand the desire to keep things simple, which is why I am only proposing a rather small change to the array object with *huge* implications --- encompassing the very cool deferred arrays that Mark Wiebe is proposing. As Einstein said, "everything should be as simple as possilbe, *but not simpler*".
While now arrays have a data-pointer that always points to memory and an accompanying strides array, all I'm suggesting is that they allow for "indirect" or "computed arrays" in a fairly simple, but general-purpose way. Generators have been such a huge feature in Python, I really think we need to figure out how to have "generated arrays" in NumPy as well --- and it turns out to have huge features that right now are difficult with NumPy (including deferred evaluation).
I guess it's debatable how complex the array object is. I actually see the array object itself as quite simple even with the changes. What is complicated is how calculations are done and scattered in an ad hoc fashion between ufuncs and other array functions. I like the idea of unifying the calculation framework using ideas like Mark's iterators and the generic functions that were added earlier to ufuncs. I don't like the data-types holding on to the "calculation structures". I think all calculations in NumPy should fit under a common rubric. To me this would be an important part of any change.
Obviously the buffer API could only be implemented for MEMORY arrays (other arrays would raise an error). What to do with persistence is a good question, but resolvable I think. Initially, I would also raise an error for trying to pickle arrays that are not MEMORY arrays --- simply calling "copy" on an array gives you something that can be persisted.
Having this kind of functionality on the base NumPy object would be transformational for NumPy use. Yes, you could do similar things with other approaches, but there is a lot of benefit of having a powerful fundamental object that is a shared-place to mange the expression of data calculations.
Another approach is to introduce another object as you suggest which is the "generator array". This could work, especially if there were hooks in the calculation engine that allowed it to be produced by array operations (say in an appropriate context as described before). My main conerns are that in practice having a whole slew of different "array objects" (i.e. masked arrays, data arrays, labeled arrays, etc.) tends to cause code to be much bulkier to read in-practice (as you are doing a lot of conversions back and forth to take advantage of APIs that require one array or another.
Having code that is written to a single object is unifying and really assists with code re-use and code readability. One of the things I see happening is a tool like Cython being used to generate the call-graphs or read-write functions that are being proposed.
I could be convinced, though, that leaving array objects alone and creating a better calculation object (i.e. something like an array vector machine) embracing and extending ufuncs is a better way to go. But, I haven't seen that proposal.
More information about the NumPy-Discussion