[Python-ideas] Support Unicode code point notation

Sat Jul 27 12:22:56 CEST 2013

On 27/07/13 11:01, Steven D'Aprano wrote:
> Unicode's standard notation for code points is U+ followed by a 4, 5 or
> 6 hex digit string, such as π = U+03C0. This notation is found
> throughout the Unicode Consortium's website, e.g.:
>
> http://www.unicode.org/versions/corrigendum2.html
>
> as well as in third party sites that have reason to discuss Unicode code
> points, e.g.:
>
> https://en.wikipedia.org/wiki/Eth#Computer_input
>
> I propose that Python strings support this as the preferred escape
> notation for Unicode code points:
>
> '\U+03C0'
> => 'π'
>
> The existing \U and \u variants must be kept for backwards
> compatibility, but should be (mildly) discouraged in new code.
>
>
> Doesn't this violate "Only One Way To Do It"?
> ---------------------------------------------
>
> That's not what the Zen says. The Zen says there should be One Obvious
> Way to do it, not Only One. It is my hope that we can agree that the One
> Obvious Way to refer to a Unicode character by its code point is by
> using the same notation that the Unicode Consortium uses:
>
> d <=> U+0064
>
> and leave legacy escape sequences as the not-so-obvious ways to do it:
>
> \x64 \144 \u0064 \U00000064
>
>
> Why do we need yet another way of writing escape sequences?
> -----------------------------------------------------------
>
> We don't need another one, we need a better one. U+xxxx is the standard
> Unicode notation, while existing Python escapes have various problems.
>
> One-byte hex and oct escapes are a throwback to the old one-byte ASCII
> days, and reflect an obsolete idea of strings being equivalent to bytes.
> Backwards compatibility requires that we continue to support them, but
> they shouldn't be encouraged in strings.
>
> Two-byte \u escapes are harmless, so long as you imagine that Unicode is
> a 16-bit character set. Unfortunately, it is not. \u does not support
> code points in the Supplementary Multilingual Planes (those with ordinal
> value greater than 0xFFFF), and can silently give the wrong result if
> you make a mistake in counting digits:
>
> # I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus)
> s = '\u13080'
> => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
>
> Four-byte \U escape sequences support the entire Unicode character set,
> but they are terribly verbose, and the first three digits are *always*
> zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF,
> so the first three digits of the eight digit hex value are pointless.
>
>
> What is the U+ escape specification?
> ------------------------------------
>
> http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
>
>
> lists the escape sequences, including:
>
> \uxxxx         Character with 16-bit hex value xxxx
> \Uxxxxxxxx     Character with 32-bit hex value xxxxxxxx
>
>
> To this should be added:
>
> \U+xxxx        Character at code point xxxx (hex)
>
>
> with the note:
>
> Exactly 4, 5 or 6 hexadecimal digits are required.
>
>
> Upper or lower case?
> --------------------
>
> Uppercase should be preferred, as the Unicode Consortium uses it, but
> both should be accepted.
>
>
> Variable number of digits? Isn't that a bad thing?
> --------------------------------------------------
>
> It's neither good nor bad. Octal escapes already support from 1 to 3 oct
> digits. In some languages (but not Python), hex escapes support from 1
> to an unlimited number of hex digits.
>
>
> Is this backwards compatible?
> -----------------------------
>
> I believe it is. As of Python 3.3, strings using \U+ give a syntax error:
>
> py> '\U+13080'
>    File "<stdin>", line 1
> SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
> position 0-7: end of string in escape sequence
>
>
> What deprecation schedule are you proposing?
> --------------------------------------------
>
> I'm not. At least, the existing features should not be considered for
> removal before Python 4000. In the meantime, the U+ form should be noted
> as the preferred way, and perhaps blessed in PEP 8.
>
>
> Should string reprs use the U+ form?
> ------------------------------------
>
> \u escapes are sometimes used in string reprs, e.g. for private-use
> characters:
>
> py> chr(0xE034)
> '\ue034'
>
> Should this change to '\U+E034'? My personal preference is that it
> should, but I fear backwards compatibility may prevent it. Even if the
> exact form of str.__repr__ is not guaranteed, changing the repr would
> break (e.g.) some doctests.
>
> This proposal defers any discussion of changing the repr of strings to
> use U+ escapes.
>
>
>

What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 
ETHIOPIC SYLLABLE SEE and a digit 5?

-1 without a clear way to disambiguate.

Regards,
Ian