Re: [Python-Dev] utf8 issue

Guido van Rossum <guido@python.org> writes:
I think this was discussed really quite a long time ago, like six months or so.
I think the reason this didn't get fixed in 2.2.1 is that it necessitates bumping MAGIC. I can probably dig up more references if you want. Cheers, M. -- 34. The string is a stark data structure and everywhere it is passed there is much duplication of process. It is a perfect vehicle for hiding information. -- Alan Perlis, http://www.cs.yale.edu/homes/perlis-alan/quotes.html

Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
It would be necessary since marshal uses UTF-8 for storing Unicode literals. Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Do you mean that in 2.2 it doesn't?
Ha. You may have missed the start of this thread, but the whole problem was that a PYC file *did* fail to load! (The .py file had a lone surrogate in it.) So I'm not sure this argument holds much water. Can someone please explain what change would be necessary to what part of the code to prevent a lone surrogate in a string literal from creating a PYC file from blowing up? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices: 1. leave things as they are 2. apply the fix and live with the consequences of having to regenerate PYCs by hand Just to give an example of the problem: Python 2.2: ------------- u'\ud800'.encode('utf-8') == '\xa0\x80'
Current CVS Python: --------------------- u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
unicode('\xed\xa0\x80', 'utf-8') u'\ud800'
Interesting. I wouldn't have expected that.
One possibility would be to: 1. change the UTF-8 encoder in Python 2.2 to produce correct output 2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[MAL, on UTF-8 for unicode]
[but then later]
This sounds like the right solution. I hope you can produce a patch against the release22-maint branch.
I don't see what this vulnerability (if it is one) adds to the already laughable security of marshal and .pyc files. If someone you don't trust can write your .pyc files, they can cause your interpreter to crash by inserting bogus bytecode. So I'd say this is a non-issue. --Guido van Rossum (home page: http://www.python.org/~guido/)

Please do. Bumping MAGIC is a no-no between dot releases. But I don't understand why that is necessary? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
It would be necessary since marshal uses UTF-8 for storing Unicode literals. Even though it's highly unlikely that the problem cases are used in Python Unicode literals, there's a tiny chance. Without the MAGIC change this could result in PYC files failing to load. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Do you mean that in 2.2 it doesn't?
Ha. You may have missed the start of this thread, but the whole problem was that a PYC file *did* fail to load! (The .py file had a lone surrogate in it.) So I'm not sure this argument holds much water. Can someone please explain what change would be necessary to what part of the code to prevent a lone surrogate in a string literal from creating a PYC file from blowing up? --Guido van Rossum (home page: http://www.python.org/~guido/)

Guido van Rossum wrote:
Marshal uses it since 1.6. The point is that the fix to the lone surrogate problem resulted in a change of the UTF codec output. PYCs from unpatched and patched versions wouldn't interop if they use lone surrogates in Unicode literals. We usually bump the PYC magic in such a case, to avoid these issues. Since it's not possible for a patch level release, we have two choices: 1. leave things as they are 2. apply the fix and live with the consequences of having to regenerate PYCs by hand Just to give an example of the problem: Python 2.2: ------------- u'\ud800'.encode('utf-8') == '\xa0\x80'
Current CVS Python: --------------------- u'\ud800'.encode('utf-8') == '\xed\xa0\x80'
unicode('\xed\xa0\x80', 'utf-8') u'\ud800'
Interesting. I wouldn't have expected that.
One possibility would be to: 1. change the UTF-8 encoder in Python 2.2 to produce correct output 2. let the UTF-8 decoder in Python 2.2 accept the correct output *and* the maformed output I am not sure whether 2. would introduce a security problem. Perhaps there is a way to restrict the work-around so that we don't run into UTF-8 encoding attack problems. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

[MAL, on UTF-8 for unicode]
[but then later]
This sounds like the right solution. I hope you can produce a patch against the release22-maint branch.
I don't see what this vulnerability (if it is one) adds to the already laughable security of marshal and .pyc files. If someone you don't trust can write your .pyc files, they can cause your interpreter to crash by inserting bogus bytecode. So I'd say this is a non-issue. --Guido van Rossum (home page: http://www.python.org/~guido/)
participants (4)
-
Guido van Rossum
-
M.-A. Lemburg
-
M.-A. Lemburg
-
Michael Hudson