[Python-ideas] Support Unicode code point notation

Sat Jul 27 13:01:47 CEST 2013

On Jul 27, 2013, at 12:22, Ian Foote <ian at feete.org> wrote:

> On 27/07/13 11:01, Steven D'Aprano wrote:
>> Unicode's standard notation for code points is U+ followed by a 4, 5 or
>> 6 hex digit string, such as π = U+03C0. This notation is found
>> throughout the Unicode Consortium's website, e.g.:
>> 
>> http://www.unicode.org/versions/corrigendum2.html
>> 
>> as well as in third party sites that have reason to discuss Unicode code
>> points, e.g.:
>> 
>> https://en.wikipedia.org/wiki/Eth#Computer_input
>> 
>> I propose that Python strings support this as the preferred escape
>> notation for Unicode code points:
>> 
>> '\U+03C0'
>> => 'π'
>> 
>> The existing \U and \u variants must be kept for backwards
>> compatibility, but should be (mildly) discouraged in new code.
>> 
>> 
>> Doesn't this violate "Only One Way To Do It"?
>> ---------------------------------------------
>> 
>> That's not what the Zen says. The Zen says there should be One Obvious
>> Way to do it, not Only One. It is my hope that we can agree that the One
>> Obvious Way to refer to a Unicode character by its code point is by
>> using the same notation that the Unicode Consortium uses:
>> 
>> d <=> U+0064
>> 
>> and leave legacy escape sequences as the not-so-obvious ways to do it:
>> 
>> \x64 \144 \u0064 \U00000064
>> 
>> 
>> Why do we need yet another way of writing escape sequences?
>> -----------------------------------------------------------
>> 
>> We don't need another one, we need a better one. U+xxxx is the standard
>> Unicode notation, while existing Python escapes have various problems.
>> 
>> One-byte hex and oct escapes are a throwback to the old one-byte ASCII
>> days, and reflect an obsolete idea of strings being equivalent to bytes.
>> Backwards compatibility requires that we continue to support them, but
>> they shouldn't be encouraged in strings.
>> 
>> Two-byte \u escapes are harmless, so long as you imagine that Unicode is
>> a 16-bit character set. Unfortunately, it is not. \u does not support
>> code points in the Supplementary Multilingual Planes (those with ordinal
>> value greater than 0xFFFF), and can silently give the wrong result if
>> you make a mistake in counting digits:
>> 
>> # I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus)
>> s = '\u13080'
>> => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
>> 
>> Four-byte \U escape sequences support the entire Unicode character set,
>> but they are terribly verbose, and the first three digits are *always*
>> zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF,
>> so the first three digits of the eight digit hex value are pointless.
>> 
>> 
>> What is the U+ escape specification?
>> ------------------------------------
>> 
>> http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
>> 
>> 
>> lists the escape sequences, including:
>> 
>> \uxxxx         Character with 16-bit hex value xxxx
>> \Uxxxxxxxx     Character with 32-bit hex value xxxxxxxx
>> 
>> 
>> To this should be added:
>> 
>> \U+xxxx        Character at code point xxxx (hex)
>> 
>> 
>> with the note:
>> 
>> Exactly 4, 5 or 6 hexadecimal digits are required.
>> 
>> 
>> Upper or lower case?
>> --------------------
>> 
>> Uppercase should be preferred, as the Unicode Consortium uses it, but
>> both should be accepted.
>> 
>> 
>> Variable number of digits? Isn't that a bad thing?
>> --------------------------------------------------
>> 
>> It's neither good nor bad. Octal escapes already support from 1 to 3 oct
>> digits. In some languages (but not Python), hex escapes support from 1
>> to an unlimited number of hex digits.
>> 
>> 
>> Is this backwards compatible?
>> -----------------------------
>> 
>> I believe it is. As of Python 3.3, strings using \U+ give a syntax error:
>> 
>> py> '\U+13080'
>>   File "<stdin>", line 1
>> SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
>> position 0-7: end of string in escape sequence
>> 
>> 
>> What deprecation schedule are you proposing?
>> --------------------------------------------
>> 
>> I'm not. At least, the existing features should not be considered for
>> removal before Python 4000. In the meantime, the U+ form should be noted
>> as the preferred way, and perhaps blessed in PEP 8.
>> 
>> 
>> Should string reprs use the U+ form?
>> ------------------------------------
>> 
>> \u escapes are sometimes used in string reprs, e.g. for private-use
>> characters:
>> 
>> py> chr(0xE034)
>> '\ue034'
>> 
>> Should this change to '\U+E034'? My personal preference is that it
>> should, but I fear backwards compatibility may prevent it. Even if the
>> exact form of str.__repr__ is not guaranteed, changing the repr would
>> break (e.g.) some doctests.
>> 
>> This proposal defers any discussion of changing the repr of strings to
>> use U+ escapes.
> 
> What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
> 
> -1 without a clear way to disambiguate.

We already have the exact same problem with octal literals. They can be one to three digits, ending at the first non-octal-digit character (or end of string). So '\123' is unambiguously 'S', while '\128' is unambiguously '\n8'. Not exactly beautiful, but simple, and a precedent going back to the earliest days of Python, and beyond it to C.

So if we followed the same rule, '\U+12345' would unambiguously be character U+12345, while '\U+1234@' would be U+1234 and a @.

That doesn't mean it's necessarily a good idea. After all, we don't allow 1-char hex escapes. And octal escapes are already pretty weird, in that they don't encode only characters up to 127 (as in C) or all of Unicode, but everything up to 511 (because that happens to be the max you can fit into the rules), so maybe they're not a great precedent to follow.

> 
> Regards,
> Ian
> 
> 
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas