[Python-Dev] buffer object (was: Unicode debate)

M.-A. Lemburg mal@lemburg.com
Mon, 08 May 2000 10:33:01 +0200


Greg Stein wrote:
> 
> [ damn, I wish people would pay more attention to changing the subject
>   line to reflect the contents of the email ... I could not figure out if
>   there were any further responses to this without opening most of those
>   dang "Unicode debate" emails. sheesh... ]
> 
> On Tue, 2 May 2000, M.-A. Lemburg wrote:
> > Guido van Rossum wrote:
> > >
> > > [MAL]
> > > > Let's not do the same mistake again: Unicode objects should *not*
> > > > be used to hold binary data. Please use buffers instead.
> > >
> > > Easier said than done -- Python doesn't really have a buffer data
> > > type.
> 
> The buffer object. We *do* have the type.
> 
> > > Or do you mean the array module?  It's not trivial to read a
> > > file into an array (although it's possible, there are even two ways).
> > > Fact is, most of Python's standard library and built-in objects use
> > > (8-bit) strings as buffers.
> 
> For historical reasons only. It would be very easy to change these to use
> buffer objects, except for the simple fact that callers might expect a
> *string* rather than something with string-like behavior.

Would this be a too drastic change, then ? I think that we should
at least make use of buffers in the standard lib.

>
> >...
> > > > BTW, I think that this behaviour should be changed:
> > > >
> > > > >>> buffer('binary') + 'data'
> > > > 'binarydata'
> 
> In several places, bufferobject.c uses PyString_FromStringAndSize(). It
> wouldn't be hard at all to use PyBuffer_New() to allow the memory, then
> copy the data in. A new API could also help out here:
> 
>   PyBuffer_CopyMemory(void *ptr, int size)
> 
> > > > while:
> > > >
> > > > >>> 'data' + buffer('binary')
> > > > Traceback (most recent call last):
> > > >   File "<stdin>", line 1, in ?
> > > > TypeError: illegal argument type for built-in operation
> 
> The string object can't handle the buffer on the right side. Buffer
> objects use the buffer interface, so they can deal with strings on the
> right. Therefore: asymmetry :-(
> 
> > > > IMHO, buffer objects should never coerce to strings, but instead
> > > > return a buffer object holding the combined contents. The
> > > > same applies to slicing buffer objects:
> > > >
> > > > >>> buffer('binary')[2:5]
> > > > 'nar'
> > > >
> > > > should prefereably be buffer('nar').
> 
> Sure. Wouldn't be a problem. The FromStringAndSize() thing.

Right.
 
Before digging deeper into this, I think we should here
Guido's opinion on this again: he said that he wanted to
use Java's binary arrays for binary data... perhaps we
need to tweak the array type and make it more directly
accessible (from C and Python) instead.

> > > Note that a buffer object doesn't hold data!  It's only a pointer to
> > > data.  I can't off-hand explain the asymmetry though.
> >
> > Dang, you're right...
> 
> Untrue. There is an API call which will construct a buffer object with its
> own memory:
> 
>   PyObject * PyBuffer_New(int size)
> 
> The resulting buffer object will be read/write, and you can stuff values
> into it using the slice notation.

Yes, but that API is not reachable from within Python,
AFAIK.
 
> > > > Hmm, perhaps we need something like a data string object
> > > > to get this 100% right ?!
> 
> Nope. The buffer object is intended to be exactly this.
> 
> >...
> > > Not clear.  I'd rather do the equivalent of byte arrays in Java, for
> > > which no "string literal" notations exist.
> >
> > Anyway, one way or another I think we should make it clear
> > to users that they should start using some other type for
> > storing binary data.
> 
> Buffer objects. There are a couple changes to make this a bit easier for
> people:
> 
> 1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to
>    create a read/write buffer of a particular size. buffer() should create
>    a zero-length read/write buffer.

This looks a lot like function overloading... I don't think we
should get into this: how about having the buffer() API take
keywords instead ?!

buffer(size=1024,mode='rw') - 1K of owned read write memory
buffer(obj) - read-only referenced memory from obj
buffer(obj,mode='rw') - read-write referenced memory in obj

etc.

Or we could allow passing None as object to obtain an owned
read-write memory block (much like passing NULL to the
C functions).

> 2) if slice assignment is updated to allow changes to the length (for
>    example: buf[1:2] = 'abcdefgh'), then the buffer object definition must
>    change. Specifically: when the buffer object owns the memory, it does
>    this by appending the memory after the PyObject_HEAD and setting its
>    internal pointer to it; when the dealloc() occurs, the target memory
>    goes with the object. A flag would need to be added to tell the buffer
>    object to do a second free() for the case where a realloc has returned
>    a new pointer.
>    [ I'm not sure that I would agree with this change, however; but it
>      does make them a bit easier to work with; on the other hand, people
>      have been working with immutable strings for a long time, so they're
>      okay with concatenation, so I'm okay with saying length-altering
>      operations must simply be done thru concatenation. ]

I don't think I like this either: what happens when the buffer
doesn't own the memory ?
 
> IMO, extensions should be using the buffer object for raw bytes. I know
> that Mark has been updating some of the Win32 extensions to do this.
> Python programs could use the objects if the buffer() builtin is tweaked
> to allow a bit more flexibility in the arguments.

Right.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/