[Python-Dev] utf8 issue
M.-A. Lemburg
mal@lemburg.com
Fri, 06 Sep 2002 09:55:13 +0200
Guido van Rossum wrote:
>>>Please do. Bumping MAGIC is a no-no between dot releases. But I
>>>don't understand why that is necessary?
>>
>>It would be necessary since marshal uses UTF-8 for storing
>>Unicode literals.
>
>
> Do you mean that in 2.2 it doesn't?
Marshal uses it since 1.6. The point is that the fix to the
lone surrogate problem resulted in a change of the UTF codec
output. PYCs from unpatched and patched versions wouldn't
interop if they use lone surrogates in Unicode literals. We
usually bump the PYC magic in such a case, to avoid these
issues. Since it's not possible for a patch level release,
we have two choices:
1. leave things as they are
2. apply the fix and live with the consequences of having
to regenerate PYCs by hand
Just to give an example of the problem:
Python 2.2:
-------------
u'\ud800'.encode('utf-8') == '\xa0\x80'
>>> unicode('\xa0\x80', 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: unexpected code byte
>>> unicode('\xed\xa0\x80', 'utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
Current CVS Python:
---------------------
u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
>>> unicode('\xed\xa0\x80', 'utf-8')
u'\ud800'
>>Even though it's highly unlikely that the problem cases are used in
>>Python Unicode literals, there's a tiny chance. Without the MAGIC
>>change this could result in PYC files failing to load.
>
>
> Ha. You may have missed the start of this thread, but the whole
> problem was that a PYC file *did* fail to load! (The .py file had a
> lone surrogate in it.) So I'm not sure this argument holds much
> water.
Interesting. I wouldn't have expected that.
> Can someone please explain what change would be necessary to what part
> of the code to prevent a lone surrogate in a string literal from
> creating a PYC file from blowing up?
One possibility would be to:
1. change the UTF-8 encoder in Python 2.2 to produce correct
output
2. let the UTF-8 decoder in Python 2.2 accept the correct
output *and* the maformed output
I am not sure whether 2. would introduce a security problem.
Perhaps there is a way to restrict the work-around so that
we don't run into UTF-8 encoding attack problems.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/