Differences between \N escapes and unicodedata
eryk sun
eryksun at gmail.com
Sat Aug 6 00:25:26 EDT 2016
On Sat, Aug 6, 2016 at 3:13 AM, Chris Angelico <rosuav at gmail.com> wrote:
>>>> unicodedata.lookup("NULL")
> '\x00'
>>>> "\N{NULL}"
> '\x00'
>>>> unicodedata.name(_)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ValueError: no such name
>
> Tested on 3.4, 3.5, and 3.6. Extremely odd.
U+0000 has a legacy name and alias names in the standard, but no primary name:
http://www.unicode.org/Public/8.0.0/ucd/UnicodeData.txt
http://www.unicode.org/Public/8.0.0/ucd/NameAliases.txt
lookup() includes the aliases from the private use area where Python
maps them (U+F0000 - U+F01CB), and of course maps it back to the
correct character code.
For the following I hacked unicodedata.name() to allow returning names
for the alias range. Notice that there are multiple aliases for a
given character, straight from the above-mentioned NameAliases
database.
>>> names = [unicodedata.name(chr(i)) for i in range(0xf0000, 0xf01cb)]
>>> print(*textwrap.wrap(', '.join(names[:80])), sep='\n')
NULL, NUL, START OF HEADING, SOH, START OF TEXT, STX, END OF TEXT,
ETX, END OF TRANSMISSION, EOT, ENQUIRY, ENQ, ACKNOWLEDGE, ACK, ALERT,
BEL, BACKSPACE, BS, CHARACTER TABULATION, HORIZONTAL TABULATION, HT,
TAB, LINE FEED, NEW LINE, END OF LINE, LF, NL, EOL, LINE TABULATION,
VERTICAL TABULATION, VT, FORM FEED, FF, CARRIAGE RETURN, CR, SHIFT
OUT, LOCKING-SHIFT ONE, SO, SHIFT IN, LOCKING-SHIFT ZERO, SI, DATA
LINK ESCAPE, DLE, DEVICE CONTROL ONE, DC1, DEVICE CONTROL TWO, DC2,
DEVICE CONTROL THREE, DC3, DEVICE CONTROL FOUR, DC4, NEGATIVE
ACKNOWLEDGE, NAK, SYNCHRONOUS IDLE, SYN, END OF TRANSMISSION BLOCK,
ETB, CANCEL, CAN, END OF MEDIUM, EOM, SUBSTITUTE, SUB, ESCAPE, ESC,
INFORMATION SEPARATOR FOUR, FILE SEPARATOR, FS, INFORMATION SEPARATOR
THREE, GROUP SEPARATOR, GS, INFORMATION SEPARATOR TWO, RECORD
SEPARATOR, RS, INFORMATION SEPARATOR ONE, UNIT SEPARATOR, US, SP,
DELETE, DEL
More information about the Python-list
mailing list