[Python-Dev] The buffer interface

Mon, 16 Oct 2000 13:08:07 -0500

The buffer interface is one of the most misunderstood parts of
Python.  I believe that if it were PEPped today, it would have a hard
time getting accepted in its current form.

There are also two different parts that are commonly referred by this
name: the "buffer API", which is a C-only API, and the "buffer
object", which has both a C API and a Python API.

Both were largely proposed, implemented and extended by others, and I
have to admit that I'm still uneasy with defending them, especially
the buffer object.  Both are extremely implementation-dependent (in
JPython, neither makes much sense).

The Buffer API
--------------

The C-only buffer API was originally intended to allow efficient
binary I/O from and (in some cases) to large objects that have a
relatively well-understood underlying memory representation.  Examples
of such objects include strings, array module arrays, memory-mapped
files, NumPy arrays, and PIL objects.

It was created with the desire to avoid an expensive memory-copy
operation when reading or writing large arrays.  For example, if you
have an array object containing several millions of double precision
floating point numbers, and you want to dump it to a file, you might
prefer to do the I/O directly from the array's memory buffer rather
than first copying it to a string.  (You lose portability of the data,
but that's often not a problem the user cares about in these cases.)

An alternative solution for this particular problem was consdered:
object types in need of this kind of efficient I/O could define their
own I/O methods, thereby allowing them to hide their internal
representation.  This was implemented in some cases (e.g. the array
module has read() and write() methods) but rejected, because a
simple-minded implementation of this approach would not work with
"file-like" objects (e.g. StringIO files).  It was deemed important
that file-like objects would not place restrictions on the kind of
objects that could interact with them (compared to real file objects).

A possible solution would have been to require that each object
implementing its own read and write methods should support both
efficient I/O to/from "real" file objects and fall-back I/O to/from
"file-like" objects.  The fall-back I/O would have to convert the
object's data to a string object which would then be passed to the
write() method of the file-like object.  This approach was rejected
because it would make it impossible to implement an alternative file
object that would be as efficient as the real file object, since large
object I/O would be using the inefficient fallback interface.

To address these issues, we decided to define an interface that would
let I/O operations ask the objects where their data bytes are in
memory, so that the I/O can go directly to/from the memory allocated
by the object.  This is the classic buffer API.  It has a read-only
and a writable variant -- the writable variant is for mutable objects
that will allow I/O directly into them.  Because we expected that some
objects might have an internal representation distributed over a
(small) number of separately allocated pieces of memory, we also added
the getsegcount() API.  All objects that I know support the buffer API
return a segment count of 1, and most places that use the buffer API
give up if the segment count is larger; so this may be considered as
an unnecessary generalization (and source of complexity).

The buffer API has found significant use in a way that wasn't
originally intended: as a sort of informal common base class for
string-like objects in situations where a char[] or char* type must be
passed (in a read-only fashion) to C code.  This is in fact the most
common use of the buffer API now, and appears to be the reason why the
segment count must typically be 1.

In connection with this, the buffer API has grown a distinction
between character and binary buffers (on the read-only end only).
This may have been a mistake; it was intended to help with Unicode but
it ended up not being used.

The Buffer Object
-----------------

The buffer object has a much less clear reason for its existence.
When Greg Stein first proposed it, he wrote:

    The intent of this type is to expose a string-like interface from
    an object that supports the buffer interface (without making a
    copy). In addition, it is intended to support slices of the target
    object.

    My eventual goal here is to tweak the file object to support
    memory mapping and the buffer interface. The buffer object can
    then return slices of the file without making a new copy. Next
    step: change marshal.c, ceval.c, and compile.c to support a buffer
    for the co_code attribute. Net result is that copies of code
    streams don't need to be copied onto the heap, but can be left in
    an mmap'd file or a frozen file. I'm hoping there will be some
    perf gains (time and memory).

    Even without some of the co_code work, enabling mmap'd files and
    buffers onto them should be very useful. I can probably rattle off
    a good number of other uses for the buffer type.

I don't think that any of these benefits have been realized yet, and
altogether I think that the buffer object causes a lot of confusion.
The buffer *API* doesn't guarantee enough about the lifetime of the
pointers for the buffer *object* to be able to safely preserve those
pointers, even if the buffer object holds on to the base object.  (The
C-level buffer API informally guarantees that the data remains valid
only until you do anything to the base object; this is usually fine as
long as you don't release the global interpreter lock.)

The buffer object's approach to implementing the various sequence
operations is strange: sometimes it behaves like a string, sometimes
it doesn't.  E.g. a slice returns a new string object unless it
happens to address the whole buffer, in which case it returns a
reference to the existing buffer object.  It would seem more logical
that a subslice would return a new buffer object.  Concatenation and
repetition of buffer objects are likewise implemented inconsistently;
it would have been more consistent with the intended purpose if these
weren't supported at all (i.e. if none of the buffer object operations
would allocate new memory except for buffer object headers).

I would have concluded that the buffer object is entirely useless, if
it weren't for some very light use that is being made of it by the
Unicode machinery.  I can't quite tell whether that was done just
because it was convenient, or whether that shows there is a real
need.

What Now?
---------

I'm not convinced that we need the buffer object at all.  For example,
the mmap module defines a sequence object so doesn't seem to need the
buffer object to help it support slices.

Regarding the buffer API, it's clearly useful, although I'm not
convinced that it needs the multiple segment count option or the char
vs. binary buffer distinction, given that we're not using this for
Unicode objects as we originally planned.

I also feel that it would be helpful if there was an explicit way to
lock and unlock the data, so that a file object can release the global
interpreter lock while it is doing the I/O.  But that's not a high
priority (and there are no *actual* problems caused by the lack of
such an API -- just *theoretical*).

For Python 3000, I think I'd like to rethink this whole mess.  Perhaps
byte buffers and character strings should be different beasts, and
maybe character strings could have Unicode and 8-bit subclasses (and
maybe other subclasses that explicitly know about their encoding).
And maybe we'd have a real file base class.  And so on.

What to do in the short run?  I'm still for severely simplifing the
buffer object (ripping out the unused operations) and deprecating it.

--Guido van Rossum (home page: http://www.python.org/~guido/)