harmful str(bytes)

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Fri Oct 8 19:50:10 CEST 2010

On Fri, 08 Oct 2010 15:31:27 +0200, Hallvard B Furuseth wrote:

> Arnaud Delobelle writes:
>>Hallvard B Furuseth <h.b.furuseth at usit.uio.no> writes:
>>> I've been playing a bit with Python3.2a2, and frankly its charset
>>> handling looks _less_ safe than in Python 2. (...)
>>> With 2.<late> conversion Unicode <-> string the equivalent operation
>>> did not silently produce garbage: it raised UnicodeError instead. 
>>> With old raw Python strings that was not a problem in applications
>>> which did not need to convert any charsets, with python3 they can
>>> break.
>>> I really wish bytes.__str__ would at least by default fail.
>> I think you misunderstand the purpose of str().  It is to provide a
>> (unicode) string representation of an object and has nothing to do with
>> converting it to unicode:
> That's not the point - the point is that for 2.* code which _uses_ str
> vs unicode, the equivalent 3.* code uses str vs bytes.  Yet not the same
> way - a 2.* 'str' will sometimes be 3.* bytes, sometime str.  So
> upgraded old code will have to expect both str and bytes.

I'm sorry, this makes no sense to me. I've read it repeatedly, and I 
still don't understand what you're trying to say.

> In 2.*, str<->unicode conversion failed or produced the equivalent
> character/byte data.  Yes, there could be charset problems if the
> defaults were set up wrong, but that's a smaller problem than in 3.*. In
> 3.*, the bytes->str conversion always _silently_ produces garbage.

So you say, but I don't see it. Why is this garbage?

>>> b = b'abc\xff'
>>> str(b)

That's what I would expect from the str() function called with a bytes 
argument. Since decoding bytes requires a codec, which you haven't given, 
it can only return a string representation of the bytes.

If you want to decode bytes into a string, you need to specify a codec:

>>> >>> str(b, 'latin-1')
>>> b.decode('latin-1')


More information about the Python-list mailing list