[Python-Dev] bytes type discussion

Wed Feb 15 01:56:00 CET 2006

On Feb 14, 2006, at 4:17 PM, Guido van Rossum wrote:

> On 2/14/06, Bob Ippolito <bob at redivi.com> wrote:
>> On Feb 14, 2006, at 3:13 PM, Guido van Rossum wrote:
>>> - we need a new PEP; PEP 332 won't cut it
>>>
>>> - no b"..." literal
>>>
>>> - bytes objects are mutable
>>>
>>> - bytes objects are composed of ints in range(256)
>>>
>>> - you can pass any iterable of ints to the bytes constructor, as  
>>> long
>>> as they are in range(256)
>>
>> Sounds like array.array('B').
>
> Sure.
>
>> Will the bytes object support the buffer interface?
>
> Do you want them to?
>
> I suppose they should *not* support the *text* part of that API.

I would imagine that it'd be convenient for integrating with existing  
extensions... e.g. initializing an array or Numeric array with one.

>> Will it accept
>> objects supporting the buffer interface in the constructor (or a
>> class method)?  If so, will it be a copy or a view?  Current
>> array.array behavior says copy.
>
> bytes() should always copy -- thanks for asking.

I only really ask because it's worth fully specifying these things.   
Copy seems a lot more sensible given the rest of the interpreter and  
stdlib (e.g. buffer(x) seems to always return a read-only buffer).

>>> - longs or anything with an __index__ method should do, too
>>>
>>> - when you index a bytes object, you get a plain int
>>
>> When slicing a bytes object, do you get another bytes object or a
>> list? If its a bytes object, is it a copy or a view?  Current
>> array.array behavior says copy.
>
> Another bytes object which is a copy.
>
> (Why would you even think about views here? They are evil.)

I mention views because that's what numpy/Numeric/numarray/etc.  
do...  It's certainly convenient at times to have that functionality,  
for example, to work with only the alpha channel in an RGBA image.   
Probably too magical for the bytes type.

 >>> import numpy
 >>> image = numpy.array(list('RGBARGBARGBA'))
 >>> alpha = image[3::4]
 >>> alpha
array([A, A, A], dtype=(string,1))
 >>> alpha[:] = 'X'
 >>> image
array([R, G, B, X, R, G, B, X, R, G, B, X], dtype=(string,1))

>>> Very controversial:
>>>
>>> - bytes("abc", "encoding") == bytes("abc") # ignores the "encoding"
>>> argument
>>>
>>> - bytes(u"abc") == bytes("abc") # for ASCII at least
>>>
>>> - bytes(u"\x80\xff") raises UnicodeError
>>>
>>> - bytes(u"\x80\xff", "latin-1") == bytes("\x80\xff")
>>>
>>> Martin von Loewis's alternative for the "very controversial" set  
>>> is to
>>> disallow an encoding argument and (I believe) also to disallow  
>>> Unicode
>>> arguments. In 3.0 this would leave us with s.encode(<encoding>)  
>>> as the
>>> only way to convert a string (which is always unicode) to bytes. The
>>> problem with this is that there's no code that works in both 2.x and
>>> 3.0.
>>
>> Given a base64 or hex string, how do you get a bytes object out of
>> it?  Currently str.decode('base64') and str.decode('hex') are good
>> solutions to this... but you get a str object back.
>
> I don't know -- you can propose an API you like here. base64 is as
> likely to encode text as binary data, so I don't think it's wrong for
> those things to return strings.

That's kinda true I guess -- but you'd still need an encoding in py3k  
to turn base64 -> text.  A lot of the current codecs infrastructure  
doesn't make sense in py3k -- for example, the 'zlib' encoding, which  
is really a bytes transform, or 'unicode_escape' which is a text  
transform.

I suppose there aren't too many different ways you'd want to encode  
or decode data to binary (beyond the text codecs), they should  
probably just live in a module -- something like the binascii we have  
now.  I do find the codecs infrastructure to be convenient at times  
(maybe too convenient), but since you're not interested in adding  
functions to existing types then a module seems like the best approach.

-bob