harmful str(bytes)

Fri Oct 8 17:40:45 EDT 2010

On 10/8/2010 9:45 AM, Hallvard B Furuseth wrote:

>> Actually, the implicit contract of __str__ is that it never fails, so
>> that everything can be printed out (for debugging purposes, etc.).
>
> Nope:
>
> $ python2 -c 'str(u"\u1000")'
> Traceback (most recent call last):
>    File "<string>", line 1, in ?
> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)

This could be considered a design bug due to 'str' being used both to 
produce readable string representations of objects (perhaps one that 
could be eval'ed) and to convert unicode objects to equivalent string 
objects. which is not the same operation!

The above really should have produced '\u1000'! (the equivavlent of what 
str(bytes) does today). The 'conversion to equivalent str object' option 
should have required an explicit encoding arg rather than defaulting to 
the ascii codec. This mistake has been corrected in 3.x, so Yep.

> And the equivalent:
>
> $ python2 -c 'unicode("\xA0")'
> Traceback (most recent call last):
>    File "<string>", line 1, in ?
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

This is an application bug: either bad string or missing decoding arg.

> In Python 2, these two UnicodeEncodeErrors made our data safe from code
> which used str and unicode objects without checking too carefully which
> was which.  Code which sort the types out carefully enough would fail.
>
> In Python 3, that safety only exists for bytes(str), not str(bytes).

If you prefer the buggy 2.x design (and there are *many* tracker bug 
reports that were fixed by the 3.x change), stick with it.

-- 
Terry Jan Reedy