[Python-Dev] utf8 issue

M.-A. Lemburg mal@lemburg.com
Fri, 06 Sep 2002 09:55:13 +0200


Guido van Rossum wrote:
>>>Please do.  Bumping MAGIC is a no-no between dot releases.  But I
>>>don't understand why that is necessary?
>>
>>It would be necessary since marshal uses UTF-8 for storing
>>Unicode literals.
> 
> 
> Do you mean that in 2.2 it doesn't?

Marshal uses it since 1.6. The point is that the fix to the
lone surrogate problem resulted in a change of the UTF codec
output. PYCs from unpatched and patched versions wouldn't
interop if they use lone surrogates in Unicode literals. We
usually bump the PYC magic in such a case, to avoid these
issues. Since it's not possible for a patch level release,
we have two choices:

1. leave things as they are

2. apply the fix and live with the consequences of having
    to regenerate PYCs by hand

Just to give an example of the problem:

Python 2.2:
-------------
u'\ud800'.encode('utf-8') == '\xa0\x80'

 >>> unicode('\xa0\x80', 'utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte

 >>> unicode('\xed\xa0\x80', 'utf-8')
Traceback (most recent call last):
   File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding

Current CVS Python:
---------------------
u'\ud800'.encode('utf-8') == '\xed\xa0\x80'

 >>> unicode('\xed\xa0\x80', 'utf-8')
u'\ud800'

>>Even though it's highly unlikely that the problem cases are used in
>>Python Unicode literals, there's a tiny chance. Without the MAGIC
>>change this could result in PYC files failing to load.
> 
> 
> Ha.  You may have missed the start of this thread, but the whole
> problem was that a PYC file *did* fail to load!  (The .py file had a
> lone surrogate in it.)  So I'm not sure this argument holds much
> water.

Interesting. I wouldn't have expected that.

> Can someone please explain what change would be necessary to what part
> of the code to prevent a lone surrogate in a string literal from
> creating a PYC file from blowing up?

One possibility would be to:

1. change the UTF-8 encoder in Python 2.2 to produce correct
    output

2. let the UTF-8 decoder in Python 2.2 accept the correct
    output *and* the maformed output

I am not sure whether 2. would introduce a security problem.
Perhaps there is a way to restrict the work-around so that
we don't run into UTF-8 encoding attack problems.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/