[Python-ideas] Support Unicode code point notation

Sat Jul 27 14:37:58 CEST 2013

On Sat, Jul 27, 2013 at 12:22 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On 27/07/13 20:22, Ian Foote wrote:
>>
>> On 27/07/13 11:01, Steven D'Aprano wrote:
>
>
>>> Variable number of digits? Isn't that a bad thing?
>>> --------------------------------------------------
>>>
>>> It's neither good nor bad. Octal escapes already support from 1 to 3 oct
>>> digits. In some languages (but not Python), hex escapes support from 1
>>> to an unlimited number of hex digits.
>
>
>> What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234
>> ETHIOPIC SYLLABLE SEE and a digit 5?
>
>
>
> There is no ambiguity. Just like oct escapes, the longest valid sequence (up
> to the maximum) would be used. If you used the shortest, then there would be
> no way to specify 5 or 6 digit sequences.

In a vacuum, \U+12345 seems like a good thing. But two issues dog it:
incompatibility with *every other language*, and the inability to
follow it with a hex digit. With octal escapes, there's a limit of
three digits, so you can simply stuff in an extra zero or two:

>>> "\1234"
'S4'
>>> "\01234"
'\n34'
>>> "\001234"
'\x01234'

Granted, this isn't the case in all languages, but it's a reasonable
convention to stick to. How many digits should be permitted in \U+
notation? Six? Eight? Will a quick eyeball of a string literal be able
to figure out the correct interpretation of "\U+0012345678"? Also,
this is a problem with a lot more characters than it is with octal,
which unambiguously stops after any non-digit; in hex, there are two
additional digits (8, 9) and twelve very common ASCII letters (A-F,
a-f) which can cause problems. I foresee issues like with Windows
paths in non-raw strings:

>>> "c:\qwer"
'c:\\qwer'
>>> "c:\asdf"
'c:\x07sdf'

Some work, some don't. You'll put in a convenient four or five digit
Unicode escape, follow it with a non-hex letter, and then later on
come and edit and confuse yourself no end.

I'm -1 on the proposal, primarily because it's different from
everything else without being a significant improvement over them.

On Sat, Jul 27, 2013 at 12:25 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On 27/07/13 20:07, M.-A. Lemburg wrote:
>
>> The \u and \U notations are standard in several programming
>> languages, e.g. Java and C++, so we're in good company.
>
>
> Given the problems with both \u and \U escapes, I think it is better to say
> we're in bad company.

Good or bad, it's a large company, and that *in itself* is of value.

ChrisA