[Python-Dev] utf8 issue

M.-A. Lemburg mal@egenix.com
Thu, 05 Sep 2002 11:14:06 +0200

Guido van Rossum wrote:
>>Guido van Rossum <guido@python.org> writes:
>>>This might beling on SF, except it's already been solved in Python
>>>2.3, and I need guidance about what to do for Python 2.2.2.
>>>In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that
>>>cannot be decode back.  In 2.3, this is fixed.  Should this be fixed
>>>in 2.2.2 as well?
>>I think this was discussed really quite a long time ago, like six
>>months or so.
>>>I'm asking because it caused problems with reading .pyc files: if
>>>there's a Unicode literal containing a lone surrogate, reading the
>>>.pyc file causes an exception:
>>>UnicodeError: UTF-8 decoding error: unexpected code byte
>>>It looks like revision 2.128 fixed this for 2.3, but that patch
>>>doesn't cleanly apply to the 2.2 maintenance branch.  Can someone
>>I think the reason this didn't get fixed in 2.2.1 is that it
>>necessitates bumping MAGIC.
>>I can probably dig up more references if you want.
> Please do.  Bumping MAGIC is a no-no between dot releases.  But I
> don't understand why that is necessary?

It would be necessary since marshal uses UTF-8 for storing
Unicode literals. Even though it's highly unlikely that the
problem cases are used in Python Unicode literals, there's
a tiny chance. Without the MAGIC change this could result
in PYC files failing to load.

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/