
Greg Stein wrote:
[ damn, I wish people would pay more attention to changing the subject line to reflect the contents of the email ... I could not figure out if there were any further responses to this without opening most of those dang "Unicode debate" emails. sheesh... ]
On Tue, 2 May 2000, M.-A. Lemburg wrote:
Guido van Rossum wrote:
[MAL]
Let's not do the same mistake again: Unicode objects should *not* be used to hold binary data. Please use buffers instead.
Easier said than done -- Python doesn't really have a buffer data type.
The buffer object. We *do* have the type.
Or do you mean the array module? It's not trivial to read a file into an array (although it's possible, there are even two ways). Fact is, most of Python's standard library and built-in objects use (8-bit) strings as buffers.
For historical reasons only. It would be very easy to change these to use buffer objects, except for the simple fact that callers might expect a *string* rather than something with string-like behavior.
Would this be a too drastic change, then ? I think that we should at least make use of buffers in the standard lib.
...
BTW, I think that this behaviour should be changed:
> buffer('binary') + 'data' 'binarydata'
In several places, bufferobject.c uses PyString_FromStringAndSize(). It wouldn't be hard at all to use PyBuffer_New() to allow the memory, then copy the data in. A new API could also help out here:
PyBuffer_CopyMemory(void *ptr, int size)
while:
> 'data' + buffer('binary') Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: illegal argument type for built-in operation
The string object can't handle the buffer on the right side. Buffer objects use the buffer interface, so they can deal with strings on the right. Therefore: asymmetry :-(
IMHO, buffer objects should never coerce to strings, but instead return a buffer object holding the combined contents. The same applies to slicing buffer objects:
> buffer('binary')[2:5] 'nar'
should prefereably be buffer('nar').
Sure. Wouldn't be a problem. The FromStringAndSize() thing.
Right. Before digging deeper into this, I think we should here Guido's opinion on this again: he said that he wanted to use Java's binary arrays for binary data... perhaps we need to tweak the array type and make it more directly accessible (from C and Python) instead.
Note that a buffer object doesn't hold data! It's only a pointer to data. I can't off-hand explain the asymmetry though.
Dang, you're right...
Untrue. There is an API call which will construct a buffer object with its own memory:
PyObject * PyBuffer_New(int size)
The resulting buffer object will be read/write, and you can stuff values into it using the slice notation.
Yes, but that API is not reachable from within Python, AFAIK.
Hmm, perhaps we need something like a data string object to get this 100% right ?!
Nope. The buffer object is intended to be exactly this.
...
Not clear. I'd rather do the equivalent of byte arrays in Java, for which no "string literal" notations exist.
Anyway, one way or another I think we should make it clear to users that they should start using some other type for storing binary data.
Buffer objects. There are a couple changes to make this a bit easier for people:
1) buffer(ob [,offset [,size]]) should be changed to allow buffer(size) to create a read/write buffer of a particular size. buffer() should create a zero-length read/write buffer.
This looks a lot like function overloading... I don't think we should get into this: how about having the buffer() API take keywords instead ?! buffer(size=1024,mode='rw') - 1K of owned read write memory buffer(obj) - read-only referenced memory from obj buffer(obj,mode='rw') - read-write referenced memory in obj etc. Or we could allow passing None as object to obtain an owned read-write memory block (much like passing NULL to the C functions).
2) if slice assignment is updated to allow changes to the length (for example: buf[1:2] = 'abcdefgh'), then the buffer object definition must change. Specifically: when the buffer object owns the memory, it does this by appending the memory after the PyObject_HEAD and setting its internal pointer to it; when the dealloc() occurs, the target memory goes with the object. A flag would need to be added to tell the buffer object to do a second free() for the case where a realloc has returned a new pointer. [ I'm not sure that I would agree with this change, however; but it does make them a bit easier to work with; on the other hand, people have been working with immutable strings for a long time, so they're okay with concatenation, so I'm okay with saying length-altering operations must simply be done thru concatenation. ]
I don't think I like this either: what happens when the buffer doesn't own the memory ?
IMO, extensions should be using the buffer object for raw bytes. I know that Mark has been updating some of the Win32 extensions to do this. Python programs could use the objects if the buffer() builtin is tweaked to allow a bit more flexibility in the arguments.
Right. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/