[Python-3000] Pre-PEP: Altering buffer protocol (tp_as_buffer)

Mon Feb 26 00:26:36 CET 2007

On 2/25/07, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote:
> Travis Oliphant wrote:
>
> >    2. There is no way for a consumer to tell the protocol-exporting
> > object it is "finished" with its view of the memory and therefore no way
> > for the object to be sure that it can reallocate the pointer to the
> > memory that it owns (the array object reallocating its memory after
> > sharing it with the buffer object led to the infamous buffer-object
> > problem).
>
> I'm not sure I'd categorise this problem that way -- it was
> more the buffer object's fault for assuming that it could
> hold on to a C pointer to the memory long-term.
>
> I'm a bit worried about having a get/release kind of thing
> in the protocol, because it risks forcing all objects which
> implement the protocol to provide some kind of refcounting
> and locking mechanism for their data. Some objects may not
> be able to do that easily or efficiently, especially if
> they're wrapping some external library that has no such
> notion.

Only if their buffer can actually move; if the buffer can't be moved
or resized once the object is created, the acquire and release can be
no-ops.

Another problem that would be solved by this is the current unsafety
of blocking I/O operations like file.readinto() and
socket.recv_into(). These operations do roughly the following:

(1) get the pointer and length from the buffer API
(2) release the GIL
(3) call the blocking read() or recv() system call with the pointer and length
(4) reacquire the GIL

The problem is that while the GIL is released, another thread with
access to the object whose buffer is being read into, could modify it
causing the buffer to be moved in memory, and the read() or recv()
operation will be overwriting freed memory (or worse, memory allocated
for a different purpose).

I realized this thinking about the 3.0 bytes object, but the 2.x array
object has the same problems, and probably every other object that
uses the buffer API and has a mutable size (if there are any).

> > All that is needed is to create a Python "memory_view" object that can
> > contain all the information needed and be returned when the buffer
> > protocol is called --- when it is garbage-collected, the
> > "bp_release_view" function is called on the exporting object.
>
> That sounds too heavyweight. Getting a memory view through
> this protocol should be a very lightweight operation -- ideally
> it shouldn't require allocating any memory at all, and it
> certainly shouldn't require creating a Python object.

I agree that getting the pointer and length should be separated from
finding out how the bytes should be interpreted. I'd like to propose a
simple stack or hierarchy of classes to address (what I think are)
Travis's needs:

- At the bottom is a redesigned buffer API: add locking, remove
segcount and char buffers.

- This API is implemented by things like mmap, and also by a "raw
bytes" object which allocates a buffer from the heap; other libraries
may have their own objects that implement this (e.g. numpy, PIL).

- There is a mixin class (at least conceptually it's a mixin) which
takes anything implementing the redesigned buffer API and adds the
bytes API (see recently updated PEP 358); operations like .strip() or
slicing should return copies (of the same or a different type) or
views at the discretion of the underlying object. (Maybe there should
be a read-only and read-write version of this; note that read-only is
not the same as immutable, since the underlying buffer may be modified
by other APIs, if it allows this.)

- *Another* API built on top of the redesigned buffer API would be
something more aligned with numpy's needs, adding (a) a shape
descriptor indicating the size, offset and stride of each dimension,
and (b) a record descriptor indicating the interpretation of one
element of the array. For (a), a list of 3-tuples of ints would
probably be sufficient (constrained so that no valid combination of
indexes points outside the buffer); for (b), I propose (with Jim
Hugunin who first suggested this at PyCon) to use the same concise but
expressing format-string-like notation used by the struct module. (The
bytes API is not quite a special case of this, since it provides more
string-like operations.)

The crucial idea here (like so often :-) is not to use inheritance but
composition. This means that we can separate management of the buffer
(e.g. malloc, mmap, whatever) from providing APIs on top of this
(either the bytes API or the multi-dimensional array API).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)