[Python-ideas] Support Unicode code point notation

Sat Jul 27 12:01:43 CEST 2013

Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:

http://www.unicode.org/versions/corrigendum2.html

as well as in third party sites that have reason to discuss Unicode code points, e.g.:

https://en.wikipedia.org/wiki/Eth#Computer_input

I propose that Python strings support this as the preferred escape notation for Unicode code points:

'\U+03C0'
=> 'π'

The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.

Doesn't this violate "Only One Way To Do It"?
---------------------------------------------

That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses:

d <=> U+0064

and leave legacy escape sequences as the not-so-obvious ways to do it:

\x64 \144 \u0064 \U00000064

Why do we need yet another way of writing escape sequences?
-----------------------------------------------------------

We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.

One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings.

Two-byte \u escapes are harmless, so long as you imagine that Unicode is a 16-bit character set. Unfortunately, it is not. \u does not support code points in the Supplementary Multilingual Planes (those with ordinal value greater than 0xFFFF), and can silently give the wrong result if you make a mistake in counting digits:

# I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus)
s = '\u13080'
=> oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)

Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless.

What is the U+ escape specification?
------------------------------------

http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

lists the escape sequences, including:

\uxxxx         Character with 16-bit hex value xxxx
\Uxxxxxxxx     Character with 32-bit hex value xxxxxxxx

To this should be added:

\U+xxxx        Character at code point xxxx (hex)

with the note:

Exactly 4, 5 or 6 hexadecimal digits are required.

Upper or lower case?
--------------------

Uppercase should be preferred, as the Unicode Consortium uses it, but both should be accepted.

Variable number of digits? Isn't that a bad thing?
--------------------------------------------------

It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.

Is this backwards compatible?
-----------------------------

I believe it is. As of Python 3.3, strings using \U+ give a syntax error:

py> '\U+13080'
   File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: end of string in escape sequence

What deprecation schedule are you proposing?
--------------------------------------------

I'm not. At least, the existing features should not be considered for removal before Python 4000. In the meantime, the U+ form should be noted as the preferred way, and perhaps blessed in PEP 8.

Should string reprs use the U+ form?
------------------------------------

\u escapes are sometimes used in string reprs, e.g. for private-use characters:

py> chr(0xE034)
'\ue034'

Should this change to '\U+E034'? My personal preference is that it should, but I fear backwards compatibility may prevent it. Even if the exact form of str.__repr__ is not guaranteed, changing the repr would break (e.g.) some doctests.

This proposal defers any discussion of changing the repr of strings to use U+ escapes.

-- 
Steven