[ python-Feature Requests-1706460 ] access to unicodedata (via codepoints or 2-char surrogates)

SourceForge.net noreply at sourceforge.net
Tue Apr 24 12:47:21 CEST 2007


Feature Requests item #1706460, was opened at 2007-04-24 12:47
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1706460&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: vbr (vlbrom)
Assigned to: Nobody/Anonymous (nobody)
Summary: access to unicodedata (via codepoints or 2-char surrogates)

Initial Comment:
Currently, most functions of the unicodedata module require the unichr - unicode string of length 1 - as a parameter; for most uses it's ok, but especially while working with characters outside the BMP - (the code point over FFFF) on a narrow python build it could be quite handy, to access the properties of these characters simply using the codepoint or ordinal (since the simple unichr(x) only works for x <= FFFF on a narrow build, hence the other unicode planes are unaccessible this way).

I belive, the unicode database could be allready indexed using some numerical values like codepoints, or isn't it true?

With this improvement, the whole database could be effectively accessible also on narrow python builds, where it isn't possible to pass one-character string for codepoints over FFFF (even if the explicit limitation of unichr is bypassed, eg. by creating an unicode literal u'\Uxxxxxxxx', the resulting string consist of a surrogate pair and has obviously the length 2)

Alternatively, it could be possible, that the respective functions would also accept a two-character string, provided, this sequence can be correcly interpretted as a surrogate-pair representation of some valid unicode codepoint. 

Currently such behaviour (e.g. codepoint access) can be emulated with custom datasets derived from the unicode database, but I belive, that it should be possible to access the allready present data somehow (also on narrow builds), rather than having to duplicate it.



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=355470&aid=1706460&group_id=5470


More information about the Python-bugs-list mailing list