Buffer Interface for Python 3.0

An update for those of you who did not get the chance to come to PyCon. PyCon was very well attended this year and there were some excellent discussions and presentations. From PyCon I learned that Python 3000 is closer than I had previously thought. What this means for me, is that I am now focussing toward getting a re-vamped buffer interface into Python 3.0 This will help us interact with the Python developers more effectively. Once we have the buffer interface hammered out for Python 3.0 we can back-port the result to Python 2.6 Thus my buffer PEP is being re-vamped. Anybody who would like to comment or contribute to the design of the new buffer interface is welcome to voice their opinion. Basically, what we are going to do now is 1) Return the data-format specification in an extended struct-style string 2) Return the shape information in a tuple of lists: (shape, strides) There are two questions I'm grappling with right now: 1) Do we propose the inclusion of offsets in the shape information? NumPy does not use offsets internally but simply has a pointer to the start of the array. 2) The buffer interface needs to understand the idea of discontiguous arrays. If the shape/stride information is separate from the pointer-to-data call, then the user needs to know if that pointer-to-data is a "contiguous chunk" or just the beginning of a strided memory area (and so should not be treated as a single-segment). 3) If we support strided memory areas, then we should probably also allow some way for PIL-like objects to report their buffer sequence (I'm sure this was the origin of the multi-segment buffer protocol to begin with). Or we could just ignore that possibility. The PIL would have to copy memory in order to share it's images. Anybody with ideas is welcome to participate. What I have so far is at http://wiki.python.org/moin/ArrayInterface Thanks, -Travis

On 27/02/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Basically, what we are going to do now is
1) Return the data-format specification in an extended struct-style string 2) Return the shape information in a tuple of lists: (shape, strides)
There are two questions I'm grappling with right now:
1) Do we propose the inclusion of offsets in the shape information? NumPy does not use offsets internally but simply has a pointer to the start of the array.
I'm not quite sure I understand what this means. Correct me if I'm wrong, but within numpy, an array typically lives inside a hunk of memory allocated with malloc(); the first data element is somewhere inside that, and any data elements are distributed according to strides. Is that about right? The array object needs to know the location of the first element, the strides and sizes, the data type of each element, and it seems to me it also needs the address of the data area, so that that can be free()d when the last array using that hunk of memory is deallocated. In fact it would need a refcounted link to the array... Or, if this isn't how it works, how does numpy arrange for the array's memory to be deleted at the right time? Do numpy arrays keep a refcounted link to the array that "owns" the memory? How is memory deallocation managed for the buffer protocol? It seems like what one needs to access the memory is a buffer object plus an offset (plus the usual strides and whatnot).
2) The buffer interface needs to understand the idea of discontiguous arrays. If the shape/stride information is separate from the pointer-to-data call, then the user needs to know if that pointer-to-data is a "contiguous chunk" or just the beginning of a strided memory area (and so should not be treated as a single-segment).
3) If we support strided memory areas, then we should probably also allow some way for PIL-like objects to report their buffer sequence (I'm sure this was the origin of the multi-segment buffer protocol to begin with). Or we could just ignore that possibility. The PIL would have to copy memory in order to share it's images.
I'm not quite sure I understand what you mean by "contiguous" here. One interpretation would be that any array that uses every byte between the first and last is contiguous, and any other is discontiguous. Another would be that any array that can be described by strides and an offest is contiguous, as it must live in a contiguous block of malloc()ed (or mmap()ed or whatever) memory; discontiguous arrays would then be things like C's array-of-pointers-to-arrays arrangement, for which each row would be a single malloc()ed chunk but the chunks might be arranged arbitrarily in memory. If the former, I can't see why we would not support them, since they naturally occur in numpy and are tidily handled by the (shape,strides,offset) information. If the latter, supporting them is going to be a real challenge, involving a great deal of indirection... would the goal be to make them accessible through an interface resembling numpy's indexing? Anne
participants (2)
-
Anne Archibald
-
Travis Oliphant