[pypy-issue] [issue1094] Non-BPM unicode literals decoded as surrogate pairs

Tue Mar 20 15:36:19 CET 2012

New submission from Simon Sapin <simon.sapin at kozea.fr>:

PyPy decodes each non-BMP character (ie. U+10000 and beyond) in unicode string
literals as a surrogate pair (as used in UTF-16) instead of a single character.

Test case:

# coding: utf8
string1 = u'\U00010083'
string2 = u'𐂃'  # should be the same as string1
codepoints1 = [ord(c) for c in string1]
codepoints2 = [ord(c) for c in string2.encode('utf16').decode('utf16')]
codepoints3 = [ord(c) for c in string2]
assert codepoints1 == [0x10083], codepoints1
assert codepoints2 == [0x10083], codepoints2
assert codepoints3 == [0x10083], codepoints3

This test pass on CPython compiled with "wide unicode". On PyPy 1.8.0,
codepoints3 is [0xd800, 0xdc83] (codepoints1 and codepoints2 are correct)

----------
messages: 4117
nosy: SimonSapin, pypy-issue
priority: bug
status: unread
title: Non-BPM unicode literals decoded as surrogate pairs

________________________________________
PyPy bug tracker <tracker at bugs.pypy.org>
<https://bugs.pypy.org/issue1094>
________________________________________