harmful str(bytes)

Mon Oct 11 17:45:36 EDT 2010

Terry Reedy writes:
>On 10/8/2010 9:45 AM, Hallvard B Furuseth wrote:
>>> Actually, the implicit contract of __str__ is that it never fails, so
>>> that everything can be printed out (for debugging purposes, etc.).
>>
>> Nope:
>>
>> $ python2 -c 'str(u"\u1000")'
>> Traceback (most recent call last):
>>    File "<string>", line 1, in ?
>> UnicodeEncodeError: 'ascii' codec can't encode character u'\u1000' in position 0: ordinal not in range(128)
>
> This could be considered a design bug due to 'str' being used both to
> produce readable string representations of objects (perhaps one that
> could be eval'ed) and to convert unicode objects to equivalent string
> objects. which is not the same operation!

Indeed, the eager str() and the lack of a more narrow str function is
one root of the problem.  I'd put it more more generally: Converting an
object which represents a string, to an actual str.  *And* __str__ may
be intended for Python-independent representations like 23 -> "23".

I expect that's why quite a bit of code calls str() just in case, which
is another root of the problem.  E.g.  urlencode(), as I said.  The code
might not need to, but str('string') is a noop so it doesn't hurt.
Maybe that's why %s does too, instead of demanding that the user calls
str() if needed.

> The above really should have produced '\u1000'! (the equivavlent of what
> str(bytes) does today). The 'conversion to equivalent str object' option
> should have required an explicit encoding arg rather than defaulting to
> the ascii codec. This mistake has been corrected in 3.x, so Yep.

If there were a __plain_str__() method which was supposed to fail rather
than start to babble Python syntax, and if there were not plenty of
Python code around which invoked __str__, I'd agree.

As it is, this "correction" instead is causing code which previously
produced the expected non-Python-related string output, to instead
produce Pythonesque repr() stuff.  See below.

>> And the equivalent:
>>
>> $ python2 -c 'unicode("\xA0")'
>> Traceback (most recent call last):
>>    File "<string>", line 1, in ?
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)
>
> This is an application bug: either bad string or missing decoding arg.

Exactly.  And Python 2 caught the bug.  (Since I had Ascii default
decoding, I'd forgotten Python could pick another default.)

For an app which handles Unicode vs. raw bytes, the equivalent Python 3
code is str(b"\xA0").  That's the *same* application bug, in equivalent
application code, and Python 3 does not catch it.  This time the bug is
spelled str() instead, which is much more likely than old unicode() to
happen somewhere thanks to the str()-related misdesign discussed above.

Article <hbf.20101008cg74 at bombur.uio.no> in this thread has an example.

And that's the third root of the problem above.  Technically it's the
same problem that an application bug can do str(None) where it should be
using a string, and produce garbage text.  The difference is that Python
forces programs to deal with these two different character/octet string
types, sometimes swapping back and forth between them.  And it's not
necessarily obvious from the code which type is in use where.  Python 3
has not changed that, it has strengthened it by removing the default
conversion.

Yet while the programmer now needs to be _more_ careful about this
before, Python 3 has removed the exception which caught this particular
bug instead of doing something to make it easier to find such bugs.

That's why I suggested making bytes.__str__ fail by default, annoying
as it would be.  But I don't know how annoying it'd be.  Maybe there
could be an option to disable it.

>> In Python 2, these two UnicodeEncodeErrors made our data safe from code
>> which used str and unicode objects without checking too carefully which
>> was which.  Code which sort the types out carefully enough would fail.
>>
>> In Python 3, that safety only exists for bytes(str), not str(bytes).
>
> If you prefer the buggy 2.x design (and there are *many* tracker bug
> reports that were fixed by the 3.x change), stick with it.

Bugs even with ASCII default encoding?  Looking closer at setencoding()
in site.py, it doesn't seem to do anything, it's "if 0"ed out.

As I think I've made clear, I certainly don't feel like entrusting
Python 3 with my raw string data just yet.

-- 
Hallvard