2to3 chokes on bad character

Peter Otten __peter__ at web.de
Thu Feb 24 08:00:34 EST 2011


John Machin wrote:

> On Feb 23, 7:47 pm, "Frank Millman" <fr... at chagford.com> wrote:
>> Hi all
>>
>> I don't know if this counts as a bug in 2to3.py, but when I ran it on my
>> program directory it crashed, with a traceback but without any indication
>> of which file caused the problem.
>>
> [traceback snipped]
> 
>> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
>> invalid start byte
>>
>> On investigation, I found some funny characters in docstrings that I
>> copy/pasted from a pdf file.
>>
>> Here are the details if they are of any use. Oddly, I found two instances
>> where characters 'look like' apostrophes when viewed in my text editor,
>> but one of them was accepted by 2to3 and the other caused the crash.
>>
>> The one that was accepted consists of three bytes - 226, 128, 153 (as
>> reported by python 2.6)
> 
> How did you incite it to report like that? Just use repr(the_3_bytes).
> It'll show up as '\xe2\x80\x99'.
> 
>  >>> from unicodedata import name as ucname
>  >>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')
>  u'\u2019'
>  >>> ucname(_)
>  'RIGHT SINGLE QUOTATION MARK'
> 
> What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
> QUOTATION MARK. That's OK.
> 
>  or 226, 8364, 8482 (as reported by python3.2).
> 
> Sorry, but you have instructed Python 3.2 to commit a nonsense:
> 
>  >>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]
>  [226, 8364, 8482]
> 
> In other words, you have taken that 3-byte sequence, decoded each byte
> separately using cp1252 (aka "the usual suspect") into a meaningless
> Unicode character and printed its ordinal.
> 
> In Python 3, don't use repr(); it has undergone the MHTP
> transformation and become ascii().
> 
>>
>> The one that crashed consists of a single byte - 146 (python 2.6) or 8217
>> (python 3.2).
> 
>  >>> chr(146).decode('cp1252')
>  u'\u2019'
>  >>> hex(8217)
>  '0x2019'
> 
> 
>> The issue is not that 2to3 should handle this correctly, but that it
>> should give a more informative error message to the unsuspecting user.
> 
> Your Python 2.x code should be TESTED before you poke 2to3 at it. In
> this case just trying to run or import the offending code file would
> have given an informative syntax error (you have declared the .py file
> to be encoded in UTF-8 but it's not).

The problem is that Python 2.x accepts arbitrary bytes in string constants. 
No error message or warning:
 
$ python
Python 2.6.4 (r264:75706, Dec  7 2009, 18:43:55)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> with open("tmp.py", "w") as f: # prepare the broken script
...     f.write("# -*- coding: utf-8 -*-\nprint 'bogus char: \x92'\n")
...
>>>
$ cat tmp.py
# -*- coding: utf-8 -*-
print 'bogus char: �'
$ python2.6 tmp.py
bogus char: �
$ 2to3-3.2 tmp.py
[traceback snipped]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 43: 
invalid start byte

In theory 2to3 could be changed to take the same approach as os.listdir(), 
but as in the OP's example occurences of the problem are likely to be 
editing accidents.



More information about the Python-list mailing list