2to3 chokes on bad character

Thu Feb 24 06:45:00 EST 2011

On Feb 23, 7:47 pm, "Frank Millman" <fr... at chagford.com> wrote:
> Hi all
>
> I don't know if this counts as a bug in 2to3.py, but when I ran it on my
> program directory it crashed, with a traceback but without any indication of
> which file caused the problem.
>
[traceback snipped]

> UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 5055:
> invalid start byte
>
> On investigation, I found some funny characters in docstrings that I
> copy/pasted from a pdf file.
>
> Here are the details if they are of any use. Oddly, I found two instances
> where characters 'look like' apostrophes when viewed in my text editor, but
> one of them was accepted by 2to3 and the other caused the crash.
>
> The one that was accepted consists of three bytes - 226, 128, 153 (as
> reported by python 2.6)

How did you incite it to report like that? Just use repr(the_3_bytes).
It'll show up as '\xe2\x80\x99'.

 >>> from unicodedata import name as ucname
 >>> ''.join(chr(i) for i in (226, 128, 153)).decode('utf8')
 u'\u2019'
 >>> ucname(_)
 'RIGHT SINGLE QUOTATION MARK'

What you have there is the UTF-8 representation of U+2019 RIGHT SINGLE
QUOTATION MARK. That's OK.

 or 226, 8364, 8482 (as reported by python3.2).

Sorry, but you have instructed Python 3.2 to commit a nonsense:

 >>> [ord(chr(i).decode('cp1252')) for i in (226, 128, 153)]
 [226, 8364, 8482]

In other words, you have taken that 3-byte sequence, decoded each byte
separately using cp1252 (aka "the usual suspect") into a meaningless
Unicode character and printed its ordinal.

In Python 3, don't use repr(); it has undergone the MHTP
transformation and become ascii().

>
> The one that crashed consists of a single byte - 146 (python 2.6) or 8217
> (python 3.2).

 >>> chr(146).decode('cp1252')
 u'\u2019'
 >>> hex(8217)
 '0x2019'

> The issue is not that 2to3 should handle this correctly, but that it should
> give a more informative error message to the unsuspecting user.

Your Python 2.x code should be TESTED before you poke 2to3 at it. In
this case just trying to run or import the offending code file would
have given an informative syntax error (you have declared the .py file
to be encoded in UTF-8 but it's not).

> BTW I have always waited for 'final releases' before upgrading in the past,
> but this makes me realise the importance of checking out the beta versions -
> I will do so in future.

I'm willing to bet that the same would happen with Python 3.1, if a
3.1 to 3.2 upgrade is what you are talking about