PEP: Extending the buffer protocol to share array information.

Attached is my PEP for extending the buffer protocol to allow array data to be shared.
PEP: <unassigned> Title: Extending the buffer protocol to include the array interface Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant oliphant@ee.byu.edu Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 2.6
Abstract
This PEP proposes extending the tp_as_buffer structure to include function pointers that incorporate information about the intended shape and data-format of the provided buffer. In essence this will place something akin to the array interface directly into Python.
Rationale
Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. The NumPy project introduced an array interface (http://numpy.scipy.org/array_interface.shtml) through a set of attributes on the object itself. While this approach works, it requires attribute lookups which can be expensive when sharing many small arrays.
One of the key reasons that users often request to place something like NumPy into the standard library is so that it can be used as standard for other packages that deal with arrays. This PEP provides a mechanism for extending the buffer protocol (which already allows data sharing) to add the additional information needed to understand the data. This should be of benefit to all third-party modules that want to share memory through the buffer protocol such as GUI toolkits, PIL, PyGame, CVXOPT, PyVoxel, PyMedia, audio libraries, video libraries etc.
Proposal
Add a bf_getarrayinfo function pointer to the buffer protocol to allow objects to share additional information about the returned memory pointer. Add the TP_HAS_EXT_BUFFER flag to types that define the extended buffer protocol.
Specification:
static int
bf_getarrayinfo (PyObject *obj, Py_intptr_t **shape, Py_intptr_t **strides, PyObject **dataformat)
Inputs: obj -- The Python object being questioned.
Outputs:
[function result] -- the number of dimensions (n)
*shape -- A C-array of 'n' integers indicating the shape of the array. Can be NULL if n==0.
*strides -- A C-array of 'n' integers indicating the number of bytes to jump to get to the next element in each dimension. Can be NULL if the array is C-contiguous (or n==0).
*dataformat -- A Python object describing the data-format each element of the array should be interpreted as.
Discussion Questions:
1) How is data-format information supposed to be shared? A companion proposal suggests returning a data-format object which carries the information about the buffer area.
2) Should the single function pointer call be extended into multiple calls or should it's arguments be compressed into a structure that is filled?
3) Should a C-API function(s) be created which wraps calls to this function pointer much like is done now with the buffer protocol? What should the interface of this function (or these functions) be.
4) Should a mask (for missing values) be shared as well?
Reference Implementation
Supplied when the PEP is accepted.
Copyright
This document is placed in the public domain.

Travis E. Oliphant schrieb:
Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. The NumPy project introduced an array interface (http://numpy.scipy.org/array_interface.shtml) through a set of attributes on the object itself. While this approach works, it requires attribute lookups which can be expensive when sharing many small arrays.
Can you please give examples for real-world applications of this interface, preferably examples involving multiple independently-developed libraries? ("this" being the current interface in NumPy - I understand that the PEP's interface isn't implemented, yet)
Paul Moore (IIRC) gave the example of equalising the green values and maximizing the red values in a PIL image by passing it to NumPy: Is that a realistic (even though not-yet real-world) example? If so, what algorithms of NumPy would I use to perform this image manipulation (and why would I use NumPy for it if I could just write a for loop that does that in pure Python, given PIL's getpixel/setdata)?
Regards, Martin

On 10/30/06, Travis E. Oliphant oliphant.travis@ieee.org wrote:
Attached is my PEP for extending the buffer protocol to allow array data to be shared.
You might want to reference this thread ( http://mail.python.org/pipermail/python-3000/2006-August/003309.html) as Guido mentions that extending the buffer protocol to tell more about the data in the buffer and "would offer the numarray folks their 'array interface'".
-Brett

Martin v. Löwis wrote:
Travis E. Oliphant schrieb:
Several extensions to Python utilize the buffer protocol to share the location of a data-buffer that is really an N-dimensional array. However, there is no standard way to exchange the additional N-dimensional array information so that the data-buffer is interpreted correctly. The NumPy project introduced an array interface (http://numpy.scipy.org/array_interface.shtml) through a set of attributes on the object itself. While this approach works, it requires attribute lookups which can be expensive when sharing many small arrays.
Can you please give examples for real-world applications of this interface, preferably examples involving multiple independently-developed libraries? ("this" being the current interface in NumPy - I understand that the PEP's interface isn't implemented, yet)
Examples of Need
1) Suppose you have a image in *.jpg format that came from a camera and you want to apply Fourier-based image recovery to try and de-blur the image using modified Wiener filtering. Then you want to save the result in *.png format. The PIL provides an easy way to read *.jpg files into Python and write the result to *.png
and NumPy provides the FFT and the array math needed to implement the algorithm. Rather than have to dig into the details of how NumPy and the PIL interpret chunks of memory in order to write a "converter" between NumPy arrays and PIL arrays, there should be support in the buffer protocol so that one could write something like:
# Read the image a = numpy.frombuffer(Image.open('myimage.jpg')).
# Process the image. A = numpy.fft.fft2(a) B = A*inv_filter b = numpy.fft.ifft2(B).real
# Write it out Image.frombuffer(b).save('filtered.png')
Currently, without this proposal you have to worry about the "mode" the image is in and get it's shape using a specific method call (this method call is different for every object you might want to interface with).
2) The same argument for a library that reads and writes audio or video formats exists.
3) You want to blit images onto a GUI Image buffer for rapid updates but need to do math processing on the image values themselves or you want to read the images from files supported by the PIL.
If the PIL supported the extended buffer protocol, then you would not need to worry about the "mode" and the "shape" of the Image.
What's more, you would also be able to accept images from any object (like NumPy arrays or ctypes arrays) that supported the extended buffer protcol without having to learn how it shares information like shape and data-format.
I could have also included examples from PyGame, OpenGL, etc. I thought people were more aware of this argument as we've made it several times over the years. It's just taken this long to get to a point to start asking for something to get into Python.
Paul Moore (IIRC) gave the example of equalising the green values and maximizing the red values in a PIL image by passing it to NumPy: Is that a realistic (even though not-yet real-world) example?
I think so, but I've never done something like that.
If
so, what algorithms of NumPy would I use to perform this image manipulation (and why would I use NumPy for it if I could just write a for loop that does that in pure Python, given PIL's getpixel/setdata)?
Basically you would use array math operations and reductions (ufuncs and it's methods which are included in NumPy). You would do it this way for speed. It's going to be a lot slower doing those loops in Python. NumPy provides the ability to do them at close-to-C speeds.
-Travis

""Martin v. Löwis"" martin@v.loewis.de wrote in message news:4547BF86.6070806@v.loewis.de...
Paul Moore (IIRC) gave the example of equalising the green values and maximizing the red values in a PIL image by passing it to NumPy: Is that a realistic (even though not-yet real-world) example? If so, what algorithms of NumPy would I use to perform this image manipulation
The use of surfarrays manipulated by Numeric has been an optional but important part of PyGame for years. http://www.pygame.org/docs/ says Surfarray Introduction Pygame uses the Numeric python module to allow efficient per pixel effects on images. Using the surface arrays is an advanced feature that allows custom effects and filters. This also examines some of the simple effects from the Pygame example, arraydemo.py. The Examples section of the linked page http://www.pygame.org/docs/tut/surfarray/SurfarrayIntro.html has code snippets for generating, resizing, recoloring, filtering, and cross-fading images.
(and why would I use NumPy for it if I could just write a for loop that does that in pure Python, given PIL's getpixel/setdata)?
Why does anyone use Numeric/NumArray/NumPy? Faster,easier coding and much faster execution, which is especially important when straining for an acceptible framerate.
---- I believe that at present PyGame can only work with external images that it is programmed to know how to import. My guess is that if image source program X (such as PIL) described its data layout in a way that NumPy could read and act on, the import/copy step could be eliminated. But perhaps Travis can clarify this.
Terry Jan Reedy

"Travis Oliphant" oliphant.travis@ieee.org wrote in message news:ei8ors$7m4$1@sea.gmane.org...
Examples of Need
[snip] < I could have also included examples from PyGame, OpenGL, etc. I thought
people were more aware of this argument as we've made it several times over the years. It's just taken this long to get to a point to start asking for something to get into Python.
The problem of data format definition and sharing of data between applications has been a bugaboo of computer science for decades. But some have butted their heads against it more than others.
Something which made a noticeable dent in the problem, by making sharing 'just work' more easily, would, to me, be a read plus for python.
tjr

Terry Reedy wrote:
I believe that at present PyGame can only work with external images that it is programmed to know how to import. My guess is that if image source program X (such as PIL) described its data layout in a way that NumPy could read and act on, the import/copy step could be eliminated.
I wish you all stopped using PIL as an example in this discussion; for PIL 2, I'm moving towards an entirely opaque data model, with a "data view"-style client API.
</F>

Fredrik Lundh wrote:
Terry Reedy wrote:
I believe that at present PyGame can only work with external images that it is programmed to know how to import. My guess is that if image source program X (such as PIL) described its data layout in a way that NumPy could read and act on, the import/copy step could be eliminated.
I wish you all stopped using PIL as an example in this discussion; for PIL 2, I'm moving towards an entirely opaque data model, with a "data view"-style client API.
That's an un-reasonable request. The point of the buffer protocol allows people to represent their data in whatever way they like internally but still share it in a standard way. The extended buffer protocol allows sharing of the shape of the data and its format in a standard way as well.
We just want to be able to convert the data in PIL objects to other Python objects without having to write special "converter" functions. It's not important how PIL or PIL 2 stores the data as long as it participates in the buffer protocol.
Of course if the memory layout were compatible with the model of NumPy, then data-copies would not be required, but that is really secondary.
-Travis

Martin v. Löwis <martin <at> v.loewis.de> writes:
Can you please give examples for real-world applications of this interface, preferably examples involving multiple independently-developed libraries?
OK -- here's one I haven't seen in this thread yet:
wxPython has a lot code to translate between various Python data types and wx data types. An example is PointList Helper. This code examines the input Python data, and translates it to a wxList of wxPoints. It is used in a bunch of the drawing functions, for instance. It has some nifty optimizations so that if a python list if (x,y) tuples is passed in, then the code uses PyList_GetItem() to access the tuples, for instance.
If an Nx2 numpy array is passed in, it defaults to PySequence_GetItem() to get the (x,y) pair, and then again to get the values, which are converted to Python numbers, then checked and converted again to C ints.
The results is an awful lot of processing, even though the data in the numpy array already exists in a C array that could be exactly the same as the wxList of wxPoints (in fact, many of the drawing methods take a pointer to a correctly formatted C array of data).
Right now, it is faster to convert your numpy array of points to a python list of tuples first, then pass it in to wx.
However, were there a standard way to describe a buffer (pointer to a C array of data), then the PointListHelper code could look to see if the data is already correctly formated, and pass the pointer right through. If it was not it could still do the translation (like from doubles to ints, for instance) far more efficiently.
When I get the chance, I do intend to contribute code to support this in wxPython, using the numpy array interface. However, wouldn't it be better for it to support a generic interface that was in the standard lib, rather than only numpy?
While /F suggested we get off the PIL bandwagon, I do have code that has to pass data around between numpy, PIL and wx.Images ( and matplotlib AGG buffers, and GDAL geo-referenced image buffers, and ...). Most do support the current buffer protocol, so it can be done, but I'd be much happier if there was a little more checking going on, rather than my python code having to make sure the data is all arranged in memory the right way.
Oh, there is also the Python Cartographic Library, which can take a Python list of tuples as coordinates, and to a Projection on them, but which can't take a numpy array holding that same data.
-Chris

Chris Barker wrote:
While /F suggested we get off the PIL bandwagon
I suggest we drop the obsession with pointers to memory areas that are supposed to have a specific format; modern data access API:s don't work that way for good reasons, so I don't see why Python should grow a standard based on that kind of model.
the "right solution" for things like this is an *API* that lets you do things like:
view = object.acquire_view(region, supported formats) ... access data in view ... view.release()
and, for advanced users
format = object.query_format(constraints)
</F>

Fredrik Lundh wrote:
Chris Barker wrote:
While /F suggested we get off the PIL bandwagon
I suggest we drop the obsession with pointers to memory areas that are supposed to have a specific format; modern data access API:s don't work that way for good reasons, so I don't see why Python should grow a standard based on that kind of model.
Please give us an example of a modern data-access API (i.e. an application that uses one)?
I presume you are not fundamentally opposed to sharing memory given the example you gave.
the "right solution" for things like this is an *API* that lets you do things like:
view = object.acquire_view(region, supported formats) ... access data in view ... view.release()
and, for advanced users
format = object.query_format(constraints)
It sounds like you are concerned about the memory-area-not-current problem. Yeah, it can be a problem (but not an unsolvable one). Objects that share memory through the buffer protcol just have to be careful about resizing themselves or eliminating memory.
Anyway, it's a problem not solved by the buffer protocol. I have no problem with trying to fix that in the buffer protocol, either.
It's all completely separate from what I'm talking about as far as I can tell.
-Travis

Fredrik Lundh wrote:
Chris Barker wrote:
While /F suggested we get off the PIL bandwagon
I suggest we drop the obsession with pointers to memory areas that are supposed to have a specific format; modern data access API:s don't work that way for good reasons, so I don't see why Python should grow a standard based on that kind of model.
the "right solution" for things like this is an *API* that lets you do things like:
view = object.acquire_view(region, supported formats) ... access data in view ... view.release()
and, for advanced users
format = object.query_format(constraints)
So, if the extended buffer protocol were enhanced to enforce this kind of viewing and release, then would you support it?
Basically, the extended buffer protocol would at the same time as providing *more* information about the "view" require the implementer to undertand the idea of "holding" and "releasing" the view.
Would this basically require the object supporting the extended buffer protocol to keep some kind of list of who has views (or at least a number indicating how many views there are)?
-Travis

Fredrik Lundh wrote:
the "right solution" for things like this is an *API* that lets you do things like:
view = object.acquire_view(region, supported formats)
And how do you describe the "supported formats"?
That's where Travis's proposal comes in, as far as I can see.
-- Greg
participants (8)
-
"Martin v. Löwis"
-
Brett Cannon
-
Chris Barker
-
Fredrik Lundh
-
Greg Ewing
-
Terry Reedy
-
Travis E. Oliphant
-
Travis Oliphant