[Python-Dev] The buffer interface

M.-A. Lemburg mal@lemburg.com
Mon, 16 Oct 2000 20:03:15 +0200

Guido van Rossum wrote:
> ...
> I would have concluded that the buffer object is entirely useless, if
> it weren't for some very light use that is being made of it by the
> Unicode machinery.  I can't quite tell whether that was done just
> because it was convenient, or whether that shows there is a real
> need.

I used the buffer object since I thought that buffer() objects
were to replace strings as container for binary data. The buffer
object wraps a memory buffer into a Python object for the purpose
of decoding it into Unicode. 8-bit string objects would have worked
just as well...
> What Now?
> ---------
> I'm not convinced that we need the buffer object at all.  For example,
> the mmap module defines a sequence object so doesn't seem to need the
> buffer object to help it support slices.

It would be nice to have an object for "copy by reference" rather
than "malloc + copy". This would be useful for strings (e.g. to
access substrings of a large string), Unicode and binary
data. The buffer object almost does this... it would only have
to stick to always returning buffer objects in coercion, slicing
etc. I also think that the name "buffer" is misleading, since it
really means "reference" in the context published by the Python
interface (the C API also has a way of defining new malloc areas
and referencing them through the buffer interface, but that is
not published in Python).

The other missing data type in Python is one for binary data.
Currently, string objects are in common use for this kind of
data. The problems with this are obvious: in some contexts strings
are expected to contain text data in other binary data. When the
two meet there's great confusion. I'd suggest either making
arrays the Python standard type for holding binary data, or
creating a completely new type (this should then be called
something like "buffer").

> Regarding the buffer API, it's clearly useful, although I'm not
> convinced that it needs the multiple segment count option or the char
> vs. binary buffer distinction, given that we're not using this for
> Unicode objects as we originally planned.

> I also feel that it would be helpful if there was an explicit way to
> lock and unlock the data, so that a file object can release the global
> interpreter lock while it is doing the I/O.  But that's not a high
> priority (and there are no *actual* problems caused by the lack of
> such an API -- just *theoretical*).

How about adding a generic low-level lock type for these kind
of tasks. The interpreter could be made aware of these types
to allow a much more fine-grained lock mechanism, e.g. to check
for acquired locks of certain objects only.
> For Python 3000, I think I'd like to rethink this whole mess.  Perhaps
> byte buffers and character strings should be different beasts, and
> maybe character strings could have Unicode and 8-bit subclasses (and
> maybe other subclasses that explicitly know about their encoding).
> And maybe we'd have a real file base class.  And so on.

Great... but 3000 is a long way ahead :-(
> What to do in the short run?  I'm still for severely simplifing the
> buffer object (ripping out the unused operations) and deprecating it.

Since it isn't all that known anyway, how about streamlining
the buffer object implementations of the various protocols and
removing the distinction between "s" and "t" ?!

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/