[Python-ideas] Support Unicode code point notation

Sat Jul 27 13:17:00 CEST 2013

On Sat, Jul 27, 2013 at 12:01 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> Unicode's standard notation for code points is U+ followed by a 4, 5 or 6
> hex digit string, such as π = U+03C0. This notation is found throughout the
> Unicode Consortium's website, e.g.:
>
> http://www.unicode.org/versions/corrigendum2.html
>
> as well as in third party sites that have reason to discuss Unicode code
> points, e.g.:
>
> https://en.wikipedia.org/wiki/Eth#Computer_input
>
> I propose that Python strings support this as the preferred escape notation
> for Unicode code points:
>
> '\U+03C0'
> => 'π'
>
> The existing \U and \u variants must be kept for backwards compatibility,
> but should be (mildly) discouraged in new code.

As Marc-Andre Lemburg said, C, C++ and Java use the same notation as
Python does.

And there is NO programming language implementing the U+ syntax.  Why
should we?  Why should we violate de-facto standards?

Existing programming languages use one or more of:

a) \uHHHH
b) \UHHHHHHHH
c) \u{H..HHHHHH} (eg. Ruby)
c) \xH..HH
d) \x{H..HHHHHH}
e) \O..OOO

and probably some more variants I am not aware of or forgot about, but
there is probably no programming language that does \U+{H..HHHHHH}, so
why should we?

> Doesn't this violate "Only One Way To Do It"?
> ---------------------------------------------
>
> That's not what the Zen says. The Zen says there should be One Obvious Way
> to do it, not Only One. It is my hope that we can agree that the One Obvious
> Way to refer to a Unicode character by its code point is by using the same
> notation that the Unicode Consortium uses:
>
> d <=> U+0064
>
> and leave legacy escape sequences as the not-so-obvious ways to do it:
>
> \x64 \144 \u0064 \U00000064

For a C, C++, Java or some other programmers, the ABOVE ways are the
obvious ways to do it.  \U+ definitely is not.  Even something as
basic as GNU echo uses the \u \U syntax.

> Why do we need yet another way of writing escape sequences?
> -----------------------------------------------------------
>
> We don't need another one, we need a better one. U+xxxx is the standard
> Unicode notation, while existing Python escapes have various problems.

…standard notation that NO programming language uses.  In English,
sure thing — go for those fancy U+H..HHHHHH things, that’s what they
are for.

> One-byte hex and oct escapes are a throwback to the old one-byte ASCII days,
> and reflect an obsolete idea of strings being equivalent to bytes. Backwards
> compatibility requires that we continue to support them, but they shouldn't
> be encouraged in strings.

Py2k’s str or Py3k’s bytes still exist and are used.  This is also
where you would use \xHH or \OOO.

> Two-byte \u escapes are harmless, so long as you imagine that Unicode is a
> 16-bit character set. Unfortunately, it is not. \u does not support code
> points in the Supplementary Multilingual Planes (those with ordinal value
> greater than 0xFFFF), and can silently give the wrong result if you make a
> mistake in counting digits:
>
> # I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus)
> s = '\u13080'
> => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
>
> Four-byte \U escape sequences support the entire Unicode character set, but
> they are terribly verbose, and the first three digits are *always* zero.
> Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the
> first three digits of the eight digit hex value are pointless.

Ruby handles this wonderfully with what I called syntax (c) above.
So, maybe instead of this, let’s get working on \u{H..HHHHHH}?

> [snip]
>
> Variable number of digits? Isn't that a bad thing?
> --------------------------------------------------
>
> It's neither good nor bad. Octal escapes already support from 1 to 3 oct
> digits. In some languages (but not Python), hex escapes support from 1 to an
> unlimited number of hex digits.

This is bad, because of hex digits.

Consider this:

'\U+0002two'

Would get us Start of Text (aka ^B), and the letters 't', 'w' and 'o'.

And when we wanted to go with French,

'\U+0002deux'

We will find ourselves with MODIFIER LETTER RHOTIC HOOK, 'u' and 'x'.  Uh-oh!

(example above based on another one from Unicode mailing list archives.)

> [snip]

Overall, huge nonsense.  If you care about some wasted zeroes, why not
propose to steal Ruby’s syntax, denoted as (c) in this message?

-- 
Chris “Kwpolska” Warrick <http://kwpolska.tk>
PGP: 5EAAEA16
stop html mail | always bottom-post | only UTF-8 makes sense