[Python-ideas] Support Unicode code point notation

Sat Jul 27 17:47:57 CEST 2013

On 27/07/13 21:17, Chris “Kwpolska” Warrick wrote:

> And there is NO programming language implementing the U+ syntax.

That is incorrect.

LispWorks and MIT Scheme both support #\U+ syntax:

http://www.lispworks.com/documentation/lw50/LWUG/html/lwuser-352.htm
http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/External-Representation-of-Characters.html

as does a project "CLforJava":

https://groups.google.com/forum/#!topic/comp.lang.lisp/pUjKLYLgrVA

(The leading # is Lisp syntax to create a character.)

CSS supports U+ syntax for both individual characters and ranges:

http://www.w3.org/TR/css3-fonts/#unicode-range-desc

BitC does something similar to what I am suggesting:

http://www.bitc-lang.org/docs/bitc/spec.html#stringlit

There may be others I am unaware of. So if you're worried about Python breaking new ground by supporting the standard Unicode notation for code points, don't worry, others have done so first.

> Why should we?  Why should we violate de-facto standards?

"The great thing about standards is there are so many to choose from." You listed six. Here are a few more:

http://billposer.org/Software/ListOfRepresentations.html

None of them are language-independent standards. There is only one language-independent standard for representing code points, and that is the U+xxxx standard used by the Unicode Consortium.

There is a whole universe of Unicode discussion that makes no reference to C or Java escapes, but does reference U+xxxx code points. U+xxxx is the language-independent standard that *any* person familiar with Unicode should be able to understand, regardless of what programming language they use.

We're not "violating" anything. Python doesn't support Ruby's \u{xxxxxx} escape, does that mean we're "violating" Ruby's de facto standard? Or are they violating ours? No to both of those, of course. Python is not Ruby, and nothing we do can violate Ruby's standard. Or C's, or Java's.

[...]
>> Doesn't this violate "Only One Way To Do It"?
>> ---------------------------------------------
>>
>> That's not what the Zen says. The Zen says there should be One Obvious Way
>> to do it, not Only One. It is my hope that we can agree that the One Obvious
>> Way to refer to a Unicode character by its code point is by using the same
>> notation that the Unicode Consortium uses:
>>
>> d <=> U+0064
>>
>> and leave legacy escape sequences as the not-so-obvious ways to do it:
>>
>> \x64 \144 \u0064 \U00000064
>
> For a C, C++, Java or some other programmers, the ABOVE ways are the
> obvious ways to do it.  \U+ definitely is not.  Even something as
> basic as GNU echo uses the \u \U syntax.

If you want C, C++, Java, Pascal, Forth, ... you know where to get them. This is Python, not C or Java, and we're discussing what is right for the Python language, not for C or Java. (Java still treats Unicode as a 16-bit charset. It isn't.)

While we can, and should, consider what other languages do, we should neither slavishly follow them into bad decisions, nor should we be scared to introduce features that they don't have. Whether my proposal is good or bad, it is what it is regardless of what other languages do. C programmers find braces obvious. If you're unaware of the Pythonic response to the argument "we should do what C does", try this:

from __future__ import braces

[...]
>> One-byte hex and oct escapes are a throwback to the old one-byte ASCII days,
>> and reflect an obsolete idea of strings being equivalent to bytes. Backwards
>> compatibility requires that we continue to support them, but they shouldn't
>> be encouraged in strings.
>
> Py2k’s str or Py3k’s bytes still exist and are used.  This is also
> where you would use \xHH or \OOO.

This proposal has nothing to do with bytes nor Python 2. Python 2 is closed to new features. Unicode escapes are irrelevant to bytes. Your comment here is a red herring.

>> Four-byte \U escape sequences support the entire Unicode character set, but
>> they are terribly verbose, and the first three digits are *always* zero.
>> Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the
>> first three digits of the eight digit hex value are pointless.

Correction: I obviously can't count, it is only the first two digits that are always zero.

> Ruby handles this wonderfully with what I called syntax (c) above.
> So, maybe instead of this, let’s get working on \u{H..HHHHHH}?

Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF, so an absolute maximum of six digits, not seven or more as you keep writing (four times, not that I'm counting :-)

As for Ruby's syntax, by your own argument, it "violates the de facto standard" of C, C++, Java, and, yes, Python. Perhaps you would like to tell Matz that it's a terrible idea because it is violating Python's standard?

But seriously, the biggest benefit I see from the Ruby syntax is you can write a sequence of code points:

\u{00E0 00E9 00EE 00F5 00FC}
=> àéîõü

but that's not my proposal. If somebody else wants to champion that, be my guest.

>> [snip]
>>
>> Variable number of digits? Isn't that a bad thing?
>> --------------------------------------------------
>>
>> It's neither good nor bad. Octal escapes already support from 1 to 3 oct
>> digits. In some languages (but not Python), hex escapes support from 1 to an
>> unlimited number of hex digits.
>
> This is bad, because of hex digits.

I have covered this objection in my reply to Chris Angelico. In short, you are no worse off than you already are if you use octal escapes. A U+ hex escape will, at most, need two extra leading zeroes to avoid running past the end, so to speak. Your example:

> '\U+0002deux'

could be written as '\U+000002deux'. (Or any of the existing ways of writing it would continue to work. Since U+0002 is an ASCII control character, I would not object to it being written as '\x02' or '\2'.)

While I acknowledge the issue you raise, I don't think much of this example. Surely in nearly any real-world example there would be some sort of separator between the control character and the word?

'\U+0002 deux'

Yes, the issue of digits following octal or U+ escapes is a real issue, but it is not a common issue, and the solution is *exactly* the same in both cases: add one or two extra zeroes.

>> [snip]
>
> Overall, huge nonsense.  If you care about some wasted zeroes, why not
> propose to steal Ruby’s syntax, denoted as (c) in this message?

I don't merely care about wasted zeroes. I care about improving Python's Unicode model.

What we call "characters" in Python actually are code points, and I believe we should support the standard notation for code points, even if we support other notation as well.

If you go to the Unicode.org website, or Wikipedia, or any other site that actually understands Unicode, they invariably talk about code points and use the U+ notation. But in Python, we use our own notation that is *just slightly different*, for little or no good reason. ("C programmers use it" is not a good reason for a language which is not a variant of C.)

While we must continue to support existing ways of escaping Unicode characters, I'd like to be able to tell people:

"To enter a Unicode code point in a string, put a backslash in front of it."

instead of telling them to count the number of hex digits, then either use \u or \U, and don't forget to pad it to eight digits if you use \U but not \u. Oh, and if you're tempted to copy and paste the code point from somewhere, you have to drop the U+ or it won't work.

Unicode's notation is nice and simple. If we had it first, would we prefer \uxxxx and \U00xxxxxx over it? I don't think so.

-- 
Steven