[Python-ideas] Support Unicode code point notation

Stephen J. Turnbull stephen at xemacs.org
Sun Jul 28 20:46:54 CEST 2013


Steven D'Aprano writes:

 > Earlier, MRAB suggested that unicodedata.name() could return the U+
 > code point in the case of unnamed characters. I think it would be
 > better to have a separate unicodedata function to return the code
 > point, and leave the current behaviour of name() alone.

His point, and I agree, is that it's not useful to have name() error,
as it does for unicodedata.name(chr(65535)).  In that case I would
prefer that it return "U+FFFF NOT A CHARACTER" or something like that.
And for chr(65535*2) it would return "U+1FFFE UNASSIGNED IN VERSION
<whatever version Python 3.3 happens to be using>".  Similarly for
unassigned private use area code points and surrogates (with their
blocks being mentioned).  It would be nice if assigned private use
area code points could have names added to the database.  If a private
use character wasn't named, it could have its name algorithmically
determined as "U+XXXX PRIVATE USE: UNNAMED".

 > def codepoint(c):
 >      return 'U+{:04X}'.format(ord(c))
 > 
 > This should always succeed for any character.

Or code point: it will succeed for things that aren't characters, such
as chr(65535).  As one-liners go, this does seem a reasonable
candidate for the stdlib.

Steve


More information about the Python-ideas mailing list