[ python-Bugs-1704793 ] incorrect return value of unicodedata.lookup() - beoynd BMP

Fri Jul 27 20:33:28 CEST 2007

Bugs item #1704793, was opened at 2007-04-21 12:52
Message generated for change (Comment added) made by loewis
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1704793&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: Python 2.5
Status: Open
Resolution: None
Priority: 6
Private: No
Submitted By: vbr (vlbrom)
Assigned to: Martin v. Löwis (loewis)
Summary: incorrect return value of unicodedata.lookup() - beoynd BMP

Initial Comment:
There seem to be an incorrect handling of unicode characters beyond the BMP (code point higher than 0xFFFF) in the unicodedata module - function lookup() on narrow unicode python builds (python 2.5.1, Windows XPh)

>>> unicodedata.lookup("GOTHIC LETTER FAIHU")
u'\u0346'
(should be u'\U00010346' - the beginning of the literal is truncated - leading to the ambiguity - in this case u'\u0346' is a combining diacritics "COMBINING BRIDGE ABOVE")

on the contrary, the unicode string literals \N{name} work well.

>>> u"\N{GOTHIC LETTER FAIHU}"
u'\U00010346'

Unfortunately, I haven't been able to find the problematic pieces of sourcecode, so I'm not able to fix it. 

It seems, that initially the correct information on the given codepoint is used, but finally only the last four digits of the codepoint value are taken into account using the "narrow" unicode literal \uxxxx instead of \Uxxxxxxxx 
, while the same task is handled correctly by the unicodeescape codec used for unicode string literals.

vbr

----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2007-07-27 20:33

Message:
Logged In: YES 
user_id=21627
Originator: NO

I'm skeptical about applying this to 2.5.x: I think it could be surprising
if you suddenly get length-two results. How about raising a ValueError
instead if the resulting character is out of range?

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2007-06-13 08:37

Message:
Logged In: YES 
user_id=849994
Originator: NO

Indeed, it is UCS-2, sorry.

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2007-06-13 06:25

Message:
Logged In: YES 
user_id=21627
Originator: NO

gbrandl: what precisely can you confirm? In any UCS-4 build, the lookup
should return the correct result, and it does so on my machine.

An alternative solution to the change proposed by perky would be to raise
a ValueError, similar to unichr().

----------------------------------------------------------------------

Comment By: Hye-Shik Chang (perky)
Date: 2007-06-12 13:12

Message:
Logged In: YES 
user_id=55188
Originator: NO

I attached a working fix for the problem. The patch encodes non-BMP
characters as a surrogate pair in the lookup function.

The surrogate pair encoding can be thought as something to be included in
the standard unicode API.  How about to provide UTF-32 codecs in the Python
C-API to help this kind of use?
File Added: unicodedata-lookup-ucs2fix.diff

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2007-06-12 11:35

Message:
Logged In: YES 
user_id=38388
Originator: NO

Martin, please have a look. Thanks.

----------------------------------------------------------------------

Comment By: Georg Brandl (gbrandl)
Date: 2007-04-21 22:29

Message:
Logged In: YES 
user_id=849994
Originator: NO

Confirmed with an linux-x86 UCS-4 build here.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1704793&group_id=5470