[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen
report at bugs.python.org
Sat Jul 12 21:03:52 CEST 2008
Adam Olsen <rhamph at gmail.com> added the comment:
Marc, perhaps Unicode has refined their definitions since you last looked?
Valid UTF-8 *cannot* contain surrogates[1]. If it does, you have
CESU-8[2][3], not UTF-8.
So there are two bugs: first, the UTF-8 codec should refuse to load
surrogates. Second, since the original bug showed up before the .pyc is
created, something in the parse/compilation/whatever stage is producing
CESU-8.
[1] 4th bullet point of D92 in
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf
[2] http://unicode.org/reports/tr26/
[3] http://en.wikipedia.org/wiki/CESU-8
_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3297>
_______________________________________
More information about the Python-bugs-list
mailing list