Draft PEP for the new buffer interface to be in Python 3000
PEP: <unassigned> Title: Revising the buffer protocol Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant <oliphant@ee.byu.edu> Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 3000 Abstract This PEP proposes re-designing the buffer API (PyBufferProcs function pointers) to improve the way Python allows memory sharing in Python 3.0 In particular, it is proposed that the multiple-segment and character buffer portions of the buffer API are eliminated and additional function pointers are provided to allow sharing any multi-dimensional nature of the memory and what data-format the memory contains. Rationale The buffer protocol allows different Python types to exchange a pointer to a sequence of internal buffers. This functionality is '''extremely''' useful for sharing large segments of memory between different high-level objects, but it's too limited and has issues. 1. There is the little (never?) used "sequence-of-segments" option (bf_getsegcount) 2. There is the apparently redundant character-buffer option (bf_getcharbuffer) 3. There is no way for a consumer to tell the buffer-API-exporting object it is "finished" with its view of the memory and therefore no way for the exporting object to be sure that it is safe to reallocate the pointer to the memory that it owns (the array object reallocating its memory after sharing it with the buffer object which held the original pointer led to the infamous buffer-object problem). 4. Memory is just a pointer with a length. There is no way to describe what's "in" the memory (float, int, C-structure, etc.) 5. There is no shape information provided for the memory. But, several array-like Python types could make use of a standard way to describe the shape-interpretation of the memory (!wxPython, GTK, pyQT, CVXOPT, !PyVox, Audio and Video Libraries, ctypes, !NumPy, data-base interfaces, etc.) There are two widely used libraries that use the concept of discontiguous memory: PIL and NumPy. Their view of discontiguous arrays is a bit different, though. NumPy uses the notion of constant striding in each dimension as it's basic concept of an array. In this way a simple sub-region of a larger array can be described without copying the data. Strided memory is a common way to describe data to many computing libraries (such as the BLAS and LAPACK). The PIL uses a more opaque memory representation. Sometimes an image is contained in a contiguous segment of memory, but sometimes it is contained in an array of pointers to the contiguous segments (usually lines) of the image. This allows the image to not be loaded entirely into memory. The PIL is where the idea of multiple buffer segments in the original buffer interface came from, I believe. The buffer interface should allow discontiguous memory areas to share standard striding information. However, consumers that do not want to deal with strided memory should also be able to request a contiguous segment easily. Proposal Overview * Eliminate the char-buffer and multiple-segment sections of the buffer-protocol. * Unify the read/write versions of getting the buffer. * Add a new function to the protocol that should be called when the consumer object is "done" with the view. * Add a new function to allow the protocol to describe what is in memory (unifying what is currently done now in struct and array) * Add a new function to allow the protocol to share shape information * Fix all objects in core and standard library to conform to the new interface * Extend the struct module to handle more format specifiers Specification Change the PyBufferProcs structure to typedef struct { getbufferproc bf_getbuffer releasebufferproc bf_releasebuffer formatbufferproc bf_getbufferformat shapebufferproc bf_getbuffershape } typedef PyObject *(*getbufferproc)(PyObject *obj, void **buf, Py_ssize_t *len, int requires) Return a pointer to memory in buf and the length of that memory buffer in buf. Requirements for the memory are provided in requires (PYBUFFER_WRITE, PYBUFFER_ONESEGMENT). NULL is returned and an error raised if the object cannot return a view with those requirements. Otherwise, an object-specific "view" object is returned (which can just be a borrowed reference to obj). This view object should be used in the other API calls and does not need to be decref'd. It should be "released" if the interface exporter provides the bf_releasebuffer function. typedef int (*releasebufferproc)(PyObject *view) This function is called when a view of memory previously acquired from the object is no longer needed. It is up to the exporter of the API to make sure all views have been released before eliminating a reference to a previously returned pointer. It is up to consumers of the API to call this function on the object whose view is obtained when it is no longer needed. A -1 is returned on error and 0 on success. typedef char *(*formatbufferproc)(PyObject *view, int *itemsize) Get the format-string of the memory using the struct-module string syntax (see below for proposed additions to that syntax). Also, there is never an alignment assumption in this string---the full byte-layout is always required. If the implied size of this string is smaller than the length of the buffer then it is assumed that the string is repeated. If itemsize is not NULL, then return the size implied by the format string. This could be the entire length of the buffer or just the length of each element. It is equivalent to *itemsize = PyObject_SizeFromFormat(ret) if ret is the returned string. However, very often objects already know the itemsize without having to compute it separately. typedef PyObject *(*shapebufferproc)(PyObject *view) Return a 2-tuple of lists containing shape information: (shape, strides). The strides object can be None if the memory is C-style contiguous) otherwise it provides the striding in each dimension. All of these routines are optional for a type object (but the last three make no sense unless the first one is implemented). New C-API calls are proposed int PyObject_CheckBuffer(PyObject *obj) return 1 if the getbuffer function is available otherwise 0 PyObject * PyObject_GetBuffer(PyObject *obj, void **buf, Py_ssize_t *len, int requires) return a borrowed reference to a "view" object of memory for the object. Requirements for the memory should be given in requires (PYBUFFER_WRITE, PYBUFFER_ONESEGMENT). The memory pointer is in *buf and its length in *len. Note, the memory is not considered a single segment of memory unless PYBUFFER_ONESEGMENT is used in requires. Get possible striding using PyObject_GetBufferShape on the view object. int PyObject_ReleaseBuffer(PyObject *view) call this function to tell obj that you are done with your "view" This is a no-op if the object doesn't implement a release function. Only call this after a previous PyObject_GetBuffer has succeeded. Return -1 on error. char * PyObject_GetBufferFormat(PyObject *view, int *itemsize) Return a NULL-terminated string indicating the data-format of the memory buffer. The string is in struct-module syntax with the exception that there is never an alignment assumption (all bytes must be accounted for). If the length of the buffer indicated by this string is smaller than the total length of the buffer, then a repeat of the string is implied to fill the length of the buffer. If itemsize is not NULL, then return the implied size of each item (this could be calculated from the format string but it is often known by the view object anyway). PyObject * PyObject_GetBufferShape(PyObject *view) Return a 2-tuple of lists (shape, stride) providing the multi-dimensional shape of the memory area. The stride shows how many bytes to skip in each dimension to move in that dimension from the start of the array. Memory that is not a single contiguous-buffer can be represented with the pointer returned from GetBuffer and the shape and strides returned from GetBufferShape. int PyObject_SizeFromFormat(char *) Return the implied size of the data-format area from a struct-style description. Additions to the struct string-syntax The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere (in ctypes and NumPy for example). Here are the proposed additions: Character Description ================================== '1' bit (number before states how many bits) '?' platform _Bool type 'g' long double 'F' complex float 'D' complex double 'G' complex long double 'c' ucs-1 (latin-1) encoding 'u' ucs-2 'w' ucs-4 'O' pointer to Python Object 'T{}' structure (detailed layout inside {}) '(k1,k2,...,kn)' multi-dimensional array of whatever follows ':name:' optional name of the preceeding element '&' specific pointer (prefix before another charater) 'X{}' pointer to a function (optional function signature inside {}) The struct module will be changed to understand these as well and return appropriate Python objects on unpacking. Un-packing a long-double will return a c-types long_double. Unpacking 'u' or 'w' will return Python unicode. Unpacking a multi-dimensional array will return a list of lists. Un-packing a pointer will return a ctypes pointer object. Un-packing a bit will return a Python Bool. Endian-specification ('=','>','<') is also allowed inside the string so that it can change if needed. The previously-specified endian string is enforce at all times. The default endian is '='. According to the struct-module, a number can preceed a character code to specify how many of that type there are. The (k1,k2,...,kn) extension also allows specifying if the data is supposed to be viewed as a (C-style contiguous, last-dimension varies the fastest) multi-dimensional array of a particular format. Functions should be added to ctypes to create a ctypes object from a struct description, and add long-double, and ucs-2 to ctypes. Code to be affected All objects and modules in Python that export or consume the old buffer interface will be modified. Here is a partial list. * buffer object * bytes object * string object * array module * struct module * mmap module * ctypes module anything else using the buffer API Issues and Details The proposed locking mechanism relies entirely on the objects implementing the buffer interface to do their own thing. Ideally an object that implements the buffer interface should keep at least a number indicating how many releases are extant. The handling of discontiguous memory is new and can be seen as a modification of the multiple-segment interface. It is motivated by NumPy (used to be Numeric). NumPy objects should be able to share their strided memory with code that understands how to manage strided memory. Code should also be able to request contiguous memory if needed and objects exporting the buffer interface should be able to handle that either by raising an error (or constructing a read-only contiguous object and returning that as the view). Currently the struct module does not allow specification of nested structures. It seems like specifying a nested structure should be specified as several ways of viewing memory areas (ctypes and NumPy) already allow this. Copyright This PEP is placed in the public domain
On 2/27/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
PEP: <unassigned> Title: Revising the buffer protocol Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant <oliphant@ee.byu.edu> Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 3000
<snip>
Additions to the struct string-syntax
The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere (in ctypes and NumPy for example). Here are the proposed additions:
Character Description ================================== '1' bit (number before states how many bits) '?' platform _Bool type 'g' long double 'F' complex float 'D' complex double 'G' complex long double 'c' ucs-1 (latin-1) encoding 'u' ucs-2 'w' ucs-4 'O' pointer to Python Object 'T{}' structure (detailed layout inside {}) '(k1,k2,...,kn)' multi-dimensional array of whatever follows ':name:' optional name of the preceeding element '&' specific pointer (prefix before another charater) 'X{}' pointer to a function (optional function signature inside {})
I think it might be good to have something for the quad and half precision floats that will be coming along in the next IEEE754 specification. Quad precision isn't used that much, but when you need it, it is useful. Half precision (16 bits) is used in some GPU's and I have seen it used for such things as recording side looking radar returns. Chuck
Charles R Harris wrote:
On 2/27/07, *Travis Oliphant* <oliphant@ee.byu.edu <mailto:oliphant@ee.byu.edu>> wrote:
PEP: <unassigned> Title: Revising the buffer protocol Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant <oliphant@ee.byu.edu <mailto:oliphant@ee.byu.edu>> Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 3000
<snip>
Additions to the struct string-syntax
The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere (in ctypes and NumPy for example). Here are the proposed additions:
Character Description ================================== '1' bit (number before states how many bits) '?' platform _Bool type 'g' long double 'F' complex float 'D' complex double 'G' complex long double 'c' ucs-1 (latin-1) encoding 'u' ucs-2 'w' ucs-4 'O' pointer to Python Object 'T{}' structure (detailed layout inside {}) '(k1,k2,...,kn)' multi-dimensional array of whatever follows ':name:' optional name of the preceeding element '&' specific pointer (prefix before another charater) 'X{}' pointer to a function (optional function signature inside {})
I think it might be good to have something for the quad and half precision floats that will be coming along in the next IEEE754 specification. Quad precision isn't used that much, but when you need it, it is useful. Half precision (16 bits) is used in some GPU's and I have seen it used for such things as recording side looking radar returns.
The problem is that we aren't really specifying floating-point standards, we are specifying float, double and long double as whatever the compiler understands. There are some platforms which don't follow the IEEE 754 standard. This format specification will not be able to describe platform-independent floating-point descriptions. It would be nice to have such a description, but that is not what struct-style syntax does. Perhaps we could add it in the specification, but I'm not sure if the added complexity is worth holding it up over. -Travis
On 27/02/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
The problem is that we aren't really specifying floating-point standards, we are specifying float, double and long double as whatever the compiler understands.
There are some platforms which don't follow the IEEE 754 standard. This format specification will not be able to describe platform-independent floating-point descriptions.
It would be nice to have such a description, but that is not what struct-style syntax does. Perhaps we could add it in the specification, but I'm not sure if the added complexity is worth holding it up over.
Hmm. If this is to be used to describe, say, binary data in files on disk, having machine-independent formats would be very handy. The endianness specifiers are there to provide this for integers, because it's so useful. I realize that if a machine doesn't implement IEEE floats it will be pretty much impossible to implement python functions to work with them, or even just decode them, but it would be nice to be able to at least *specify* them. How much more complicated would it be to allow their specification? One letter for each IEEE type, in addition to the existing letters for platform-specific floats? Anne M. Archibald
On 2/27/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Charles R Harris wrote:
On 2/27/07, *Travis Oliphant* <oliphant@ee.byu.edu <mailto:oliphant@ee.byu.edu>> wrote:
PEP: <unassigned> Title: Revising the buffer protocol Version: $Revision: $ Last-Modified: $Date: $ Author: Travis Oliphant <oliphant@ee.byu.edu <mailto:oliphant@ee.byu.edu>> Status: Draft Type: Standards Track Created: 28-Aug-2006 Python-Version: 3000
<snip>
Additions to the struct string-syntax
The struct string-syntax is missing some characters to fully implement data-format descriptions already available elsewhere
(in
ctypes and NumPy for example). Here are the proposed additions:
Character Description ================================== '1' bit (number before states how many bits) '?' platform _Bool type 'g' long double 'F' complex float 'D' complex double 'G' complex long double 'c' ucs-1 (latin-1) encoding 'u' ucs-2 'w' ucs-4 'O' pointer to Python Object 'T{}' structure (detailed layout inside {}) '(k1,k2,...,kn)' multi-dimensional array of whatever follows ':name:' optional name of the preceeding element '&' specific pointer (prefix before another
charater)
'X{}' pointer to a function (optional function signature inside {})
I think it might be good to have something for the quad and half precision floats that will be coming along in the next IEEE754 specification. Quad precision isn't used that much, but when you need it, it is useful. Half precision (16 bits) is used in some GPU's and I have seen it used for such things as recording side looking radar returns.
The problem is that we aren't really specifying floating-point standards, we are specifying float, double and long double as whatever the compiler understands.
There are some platforms which don't follow the IEEE 754 standard. This format specification will not be able to describe platform-independent floating-point descriptions.
It would be nice to have such a description, but that is not what struct-style syntax does. Perhaps we could add it in the specification, but I'm not sure if the added complexity is worth holding it up over.
True enough, and it may not make that much sense until it is in the c standard. But it might be nice to reserve something for the future and maybe give some thought of how to deal with new data types as they come along. I can't think of any really flexible methods that don't require some sort of verbose table that goes along with the data, and the single letter codes are starting to get out of hand. Hmmm. It would actually be nice to redo things so that there was a prefix, say z for complex, f for float, then something for precision. The designation wouldn't be much use without some arithmetic to go with it and it doesn't make sense to write code for things that don't exist. I wonder how much of the arithmetic can be abstracted from the data type? Chuck
Charles R Harris wrote:
The problem is that we aren't really specifying floating-point standards, we are specifying float, double and long double as whatever the compiler understands.
There are some platforms which don't follow the IEEE 754 standard. This format specification will not be able to describe platform-independent floating-point descriptions.
It would be nice to have such a description, but that is not what struct-style syntax does. Perhaps we could add it in the specification, but I'm not sure if the added complexity is worth holding it up over.
True enough, and it may not make that much sense until it is in the c standard. But it might be nice to reserve something for the future and maybe give some thought of how to deal with new data types as they come along. I can't think of any really flexible methods that don't require some sort of verbose table that goes along with the data, and the single letter codes are starting to get out of hand. Hmmm. It would actually be nice to redo things so that there was a prefix, say z for complex, f for float, then something for precision. The designation wouldn't be much use without some arithmetic to go with it and it doesn't make sense to write code for things that don't exist. I wonder how much of the arithmetic can be abstracted from the data type?
I suspect we may have to do this separately in the NumPy world. Perhaps we could get such a specification into Python itself, but I'm not hopeful. Notice, though that we could use the struct syntax to specify a floating-point structure using the bit-field and naming. In other words an IEEE 754 32-bit float would be represented in struct-style syntax as '>1t:sign: 8t:exp: 23t:mantissa:' -Travis
On 2/27/07, Travis Oliphant <oliphant@ee.byu.edu> wrote:
Charles R Harris wrote:
The problem is that we aren't really specifying floating-point standards, we are specifying float, double and long double as
whatever
the compiler understands.
There are some platforms which don't follow the IEEE 754 standard. This format specification will not be able to describe platform-independent floating-point descriptions.
It would be nice to have such a description, but that is not what struct-style syntax does. Perhaps we could add it in the specification, but I'm not sure if the added complexity is worth holding it up
over.
True enough, and it may not make that much sense until it is in the c standard. But it might be nice to reserve something for the future and maybe give some thought of how to deal with new data types as they come along. I can't think of any really flexible methods that don't require some sort of verbose table that goes along with the data, and the single letter codes are starting to get out of hand. Hmmm. It would actually be nice to redo things so that there was a prefix, say z for complex, f for float, then something for precision. The designation wouldn't be much use without some arithmetic to go with it and it doesn't make sense to write code for things that don't exist. I wonder how much of the arithmetic can be abstracted from the data type?
I suspect we may have to do this separately in the NumPy world. Perhaps we could get such a specification into Python itself, but I'm not hopeful. Notice, though that we could use the struct syntax to specify a floating-point structure using the bit-field and naming.
In other words an IEEE 754 32-bit float would be represented in struct-style syntax as
'>1t:sign: 8t:exp: 23t:mantissa:'
That would probably do nicely. There are potential ambiguities but nothing worth worrying about. Is there a way to assign names to such a type? I suppose that it is just another string constant so one could write something like float32 = '>1t:sign: 8t:exp: 23t:mantissa:' and use that. Can those bit fields be of arbitrary length? Now for something completely different ;) In some things, like the socket module, it is possible to ask for a filelike interface which buffers the input and has the usual read, readline, etc function interface, but fromfile doesn't work with it. This isn't a biggie and I suppose fromfile is looking for a 'real' file, but I wonder if this would be a difficult thing to implement? I could look at the code but I thought I would ask you first. Chuck
Charles R Harris wrote:
On 2/27/07, *Travis Oliphant* <oliphant@ee.byu.edu <mailto:oliphant@ee.byu.edu>> wrote:
Charles R Harris wrote: > > > The problem is that we aren't really specifying floating-point > standards, we are specifying float, double and long double as whatever > the compiler understands. > > There are some platforms which don't follow the IEEE 754 standard. > This format specification will not be able to describe > platform-independent floating-point descriptions. > > It would be nice to have such a description, but that is not what > struct-style syntax does. Perhaps we could add it in the > specification, > but I'm not sure if the added complexity is worth holding it up over. > > > True enough, and it may not make that much sense until it is in the c > standard. But it might be nice to reserve something for the future and > maybe give some thought of how to deal with new data types as they > come along. I can't think of any really flexible methods that don't > require some sort of verbose table that goes along with the data, and > the single letter codes are starting to get out of hand. Hmmm. It > would actually be nice to redo things so that there was a prefix, say > z for complex, f for float, then something for precision. The > designation wouldn't be much use without some arithmetic to go with it > and it doesn't make sense to write code for things that don't exist. I > wonder how much of the arithmetic can be abstracted from the data type?
I suspect we may have to do this separately in the NumPy world. Perhaps we could get such a specification into Python itself, but I'm not hopeful. Notice, though that we could use the struct syntax to specify a floating-point structure using the bit-field and naming.
In other words an IEEE 754 32-bit float would be represented in struct-style syntax as
'>1t:sign: 8t:exp: 23t:mantissa:'
That would probably do nicely. There are potential ambiguities but nothing worth worrying about. Is there a way to assign names to such a type? I suppose that it is just another string constant so one could write something like
float32 = '>1t:sign: 8t:exp: 23t:mantissa:'
and use that. Can those bit fields be of arbitrary length?
Now for something completely different ;) In some things, like the socket module, it is possible to ask for a filelike interface which buffers the input and has the usual read, readline, etc function interface, but fromfile doesn't work with it. This isn't a biggie and I suppose fromfile is looking for a 'real' file, but I wonder if this would be a difficult thing to implement? I could look at the code but I thought I would ask you first.
The problem here is that fromfile is using the raw stdio fscanf commands which require an actual file id. It is not using the Python-level fread. It's pretty low-level. On the other-hand there is the fromstring approach which works with any stream. I suspect a function that uses one or the other could be implemented. The relevant functions are XXXX_scan and XXX_fromstr in arraytypes.c.src These are used for each data-type. Notice that PyArray_FromFile actually requires a FILE *fp pointer. You might be able to use PyArray_FromString which allows a char * to read data from. -Travis
Chuck
------------------------------------------------------------------------
_______________________________________________ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
participants (3)
-
Anne Archibald -
Charles R Harris -
Travis Oliphant