[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created

Sun Jul 6 08:58:44 CEST 2008

New submission from Ezio Melotti <ezio.melotti at gmail.com>:

Problem: when you have Unicode characters with a code point greater than
U+FFFF written directly in the source file (that is, not in the form
u'\Uxxxxxxxx' but as normal chars in a u'' string) the interpreter uses
surrogate pairs for representing these characters only if the pyc
doesn't exist. When the pyc is created it uses a "normal" character
(\Uxxxxxxxx instead of the pair \uxxxx\uxxxx). This could lead to an
unexpected behavior while comparing Unicode strings or in other
situations (even if it could be solved without problems in different
ways - using u'\Uxxxxxxx' or u'\uxxx' instead of the characters,
encoding them before comparing - there shouldn't be differences between
a py and its pyc).

Tested on:
Ubuntu 8.04 with python 2.4: Uses a surrogate pair.
Ubuntu 8.04 with python 2.5: Uses a surrogate pair.
Windows XP SP2 with python 2.4: Uses a "normal" character.

Steps to reproduce the problem:
 1a. download the attached file or create it following the next step;
 1b. in a UTF-8-aware console write `print 
unichr(int('10123', 16))` (or any codepoint >= 10000), copy the printed
character (depending on the console it could be a box, two box or a
character) in a file with the lines `# -*- coding: utf-8 -*-`, `print
'Result:', u'<paste here the char>' == u'\U00010123'` and `print
'Repr:', repr(u'<paste here the char>'), repr(u'\U00010123')`. Save the
file in UTF-8;
 2. open a python interpreter and import the file (`import
unicodetest`). It should print `Result: False` and `Repr:
u'\ud800\udd23' u'\U00010123'` (the character is represented as a
surrogate pair). During this step the pyc file is created.
 3. from the python interpreter write `reload(unicodetest)`. Now it
should print `Result: True` and `Repr: u'\U00010123' u'\U00010123'` (the
char is represented as a "normal" character). Any other reload will
print True. If you delete the pyc and reload again it will print False.

(Instead of using reload() is also possible to create a function and
call it from the module when it's loaded and again with
unicodetest.func(), the result will be the same.)

Expected behavior:
The interpreter should use the same representation in both the situation
(and print True in both the tests).
Another solution could be to change the behavior of == to return True if
a normal char is compared with its surrogate pair (if it makes sense).

Further informations:
The character used for the test is part of the "Unicode Plane 1" (see
http://en.wikipedia.org/wiki/Basic_Multilingual_Plane).
More information about the surrogate pairs can be found here:
http://en.wikipedia.org/wiki/Surrogate_pair#Encoding_of_characters_outside_the_BMP

----------
components: Unicode
files: unicodetest.py
messages: 69321
nosy: ezio.melotti, lemburg
severity: normal
status: open
title: Python interpreter uses Unicode surrogate pairs only before the pyc is created
type: behavior
versions: Python 2.4, Python 2.5
Added file: http://bugs.python.org/file10826/unicodetest.py

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue3297>
_______________________________________