[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Adam Olsen report at bugs.python.org
Sat Jul 12 21:03:52 CEST 2008


Adam Olsen <rhamph at gmail.com> added the comment:

Marc, perhaps Unicode has refined their definitions since you last looked?

Valid UTF-8 *cannot* contain surrogates[1].  If it does, you have
CESU-8[2][3], not UTF-8.

So there are two bugs: first, the UTF-8 codec should refuse to load
surrogates.  Second, since the original bug showed up before the .pyc is
created, something in the parse/compilation/whatever stage is producing
CESU-8.


[1] 4th bullet point of D92 in
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
[2] http://unicode.org/reports/tr26/
[3] http://en.wikipedia.org/wiki/CESU-8

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3297>
_______________________________________


More information about the Python-bugs-list mailing list