[I18n-sig] Re: [Python-Dev] Unicode debate

Guido van Rossum guido@python.org
Tue, 02 May 2000 08:26:50 -0400


[MAL]
> Let's not do the same mistake again: Unicode objects should *not*
> be used to hold binary data. Please use buffers instead.

Easier said than done -- Python doesn't really have a buffer data
type.  Or do you mean the array module?  It's not trivial to read a
file into an array (although it's possible, there are even two ways).
Fact is, most of Python's standard library and built-in objects use
(8-bit) strings as buffers.

I agree there's no reason to extend this to Unicode strings.

> BTW, I think that this behaviour should be changed:
> 
> >>> buffer('binary') + 'data'
> 'binarydata'
> 
> while:
> 
> >>> 'data' + buffer('binary')         
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> TypeError: illegal argument type for built-in operation
> 
> IMHO, buffer objects should never coerce to strings, but instead
> return a buffer object holding the combined contents. The
> same applies to slicing buffer objects:
> 
> >>> buffer('binary')[2:5]
> 'nar'
> 
> should prefereably be buffer('nar').

Note that a buffer object doesn't hold data!  It's only a pointer to
data.  I can't off-hand explain the asymmetry though.

> --
> 
> Hmm, perhaps we need something like a data string object
> to get this 100% right ?!
> 
> >>> d = data("...data...")
> or
> >>> d = d"...data..."
> >>> print type(d)
> <type 'data'>
> 
> >>> 'string' + d
> d"string...data..."
> >>> u'string' + d
> d"s\000t\000r\000i\000n\000g\000...data..."
> 
> >>> d[:5]
> d"...da"
> 
> etc.
> 
> Ideally, string and Unicode objects would then be subclasses
> of this type in Py3K.

Not clear.  I'd rather do the equivalent of byte arrays in Java, for
which no "string literal" notations exist.

--Guido van Rossum (home page: http://www.python.org/~guido/)