Re: [Python-Dev] utf8 issue

Guido van Rossum <guido@python.org> writes:
This might beling on SF, except it's already been solved in Python 2.3, and I need guidance about what to do for Python 2.2.2.
In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that cannot be decode back. In 2.3, this is fixed. Should this be fixed in 2.2.2 as well?
I think this was discussed really quite a long time ago, like six months or so.
I'm asking because it caused problems with reading .pyc files: if there's a Unicode literal containing a lone surrogate, reading the .pyc file causes an exception:
UnicodeError: UTF-8 decoding error: unexpected code byte
It looks like revision 2.128 fixed this for 2.3, but that patch doesn't cleanly apply to the 2.2 maintenance branch. Can someone help?
I think the reason this didn't get fixed in 2.2.1 is that it necessitates bumping MAGIC. I can probably dig up more references if you want. Cheers, M. -- 34. The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Guido van Rossum <guido@python.org> writes:
This might beling on SF, except it's already been solved in Python 2.3, and I need guidance about what to do for Python 2.2.2.
In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that cannot be decode back. In 2.3, this is fixed. Should this be fixed in 2.2.2 as well?
I think this was discussed really quite a long time ago, like six months or so.
I'm asking because it caused problems with reading .pyc files: if there's a Unicode literal containing a lone surrogate, reading the .pyc file causes an exception:
UnicodeError: UTF-8 decoding error: unexpected code byte
It looks like revision 2.128 fixed this for 2.3, but that patch doesn't cleanly apply to the 2.2 maintenance branch. Can someone help?
I think the reason this didn't get fixed in 2.2.1 is that it necessitates bumping MAGIC.
I can probably dig up more references if you want.
Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Guido van Rossum <guido@python.org> writes:
This might beling on SF, except it's already been solved in Python 2.3, and I need guidance about what to do for Python 2.2.2.
In 2.2.1, a lone surrogate encoded into utf8 gives an utf8 string that cannot be decode back. In 2.3, this is fixed. Should this be fixed in 2.2.2 as well?
I think this was discussed really quite a long time ago, like six months or so.
I'm asking because it caused problems with reading .pyc files: if there's a Unicode literal containing a lone surrogate, reading the .pyc file causes an exception:
UnicodeError: UTF-8 decoding error: unexpected code byte
It looks like revision 2.128 fixed this for 2.3, but that patch doesn't cleanly apply to the 2.2 maintenance branch. Can someone help?
I think the reason this didn't get fixed in 2.2.1 is that it necessitates bumping MAGIC.
I can probably dig up more references if you want.
Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary?
It would be necessary since marshal uses UTF-8 for storing Unicode literals. Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary?
It would be necessary since marshal uses UTF-8 for storing Unicode literals.
Do you mean that in 2.2 it doesn't?
Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load.
Ha. You may have missed the start of this thread, but the whole problem was that a PYC file *did* fail to load! (The .py file had a lone surrogate in it.) So I'm not sure this argument holds much water. Can someone please explain what change would be necessary to what part of the code to prevent a lone surrogate in a string literal from creating a PYC file from blowing up? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary?
It would be necessary since marshal uses UTF-8 for storing Unicode literals.
Do you mean that in 2.2 it doesn't?
Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices: 1. leave things as they are 2. apply the fix and live with the consequences of having to regenerate PYCs by hand Just to give an example of the problem: Python 2.2: ------------- u'\ud800'.encode('utf-8') == '\xa0\x80'
unicode('\xa0\x80', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: UTF-8 decoding error: unexpected code byte
unicode('\xed\xa0\x80', 'utf-8') Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeError: UTF-8 decoding error: illegal encoding
Current CVS Python: --------------------- u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
unicode('\xed\xa0\x80', 'utf-8') u'\ud800'
Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load.
Ha. You may have missed the start of this thread, but the whole problem was that a PYC file *did* fail to load! (The .py file had a lone surrogate in it.) So I'm not sure this argument holds much water.
Interesting. I wouldn't have expected that.
Can someone please explain what change would be necessary to what part of the code to prevent a lone surrogate in a string literal from creating a PYC file from blowing up?
One possibility would be to: 1. change the UTF-8 encoder in Python 2.2 to produce correct output 2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[MAL, on UTF-8 for unicode]
Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices:
1. leave things as they are
2. apply the fix and live with the consequences of having to regenerate PYCs by hand
[but then later]
One possibility would be to:
1. change the UTF-8 encoder in Python 2.2 to produce correct output
2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output
This sounds like the right solution. I hope you can produce a patch against the release22-maint branch.
I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems.
I don't see what this vulnerability (if it is one) adds to the already laughable security of marshal and .pyc files. If someone you don't trust can write your .pyc files, they can cause your interpreter to crash by inserting bogus bytecode. So I'd say this is a non-issue. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (4)
-
Guido van Rossum
-
M.-A. Lemburg
-
M.-A. Lemburg
-
Michael Hudson