[Python-ideas] Support Unicode code point notation

Steven D'Aprano steve at pearwood.info
Sat Jul 27 15:57:47 CEST 2013


On 27/07/13 22:37, Chris Angelico wrote:

> In a vacuum, \U+12345 seems like a good thing. But two issues dog it:
> incompatibility with *every other language*,

Every language is incompatible with every other language. That's why they are different languages. Some languages happen to share a few (or many) similarities, but they are dwarfed by the differences. And yet we manage.

Do you really mean to suggest that a C programmer is capable of interpreting U+2345 when reading about code points on Wikipedia, but will be confused when reading '\U+2345' in Python code? Surely not. But if so, I suggest that Python's \x escapes will also confuse him, since Python's \x is incompatible with C's \x. (We even mention that difference in the docs.) As well as Python's significant indentation, duck typing, and, most of all, lack of braces.


> and the inability to
> follow it with a hex digit. With octal escapes, there's a limit of
> three digits, so you can simply stuff in an extra zero or two:

You would simply do the same as you already do for octal escapes: stuff in an extra zero or two:

'\U+0003B82'
=> U+03B8 followed by 2

There's never any need to add more than two zeroes, since you can't use fewer than four or more than six digits in total.


> How many digits should be permitted in \U+
> notation? Six? Eight?

The Unicode standard uses exactly four, five or six hex digits for code points. The smallest code point is U+0000, and the largest is U+10FFFF. So:

'\U+FFpq' will be a SyntaxError, just like '\uFFpq' today;

'\U+FFFFFF' will be a SyntaxError, just like '\U00FFFFFF' today;

'\U+00F2' will be unambiguously interpreted as a four digit hex escape;

'\U+00FF2' will be unambiguously interpreted as a five digit hex escape;

'\U+00FFF2' will be unambiguously interpreted as a six digit hex escape;

'\U+00FFFF2' will be unambiguously interpreted as U+FFFF followed by 2.



> Will a quick eyeball of a string literal be able
> to figure out the correct interpretation of "\U+0012345678"?

I don't think that the existing hex escapes pass the "quick eyeball" test:

'M\u00fcller'

but your example above will be parsed as U+1234 followed by 5678.


> Also,
> this is a problem with a lot more characters than it is with octal,
> which unambiguously stops after any non-digit; in hex, there are two
> additional digits (8, 9) and twelve very common ASCII letters (A-F,
> a-f) which can cause problems. I foresee issues like with Windows
> paths in non-raw strings:
>
>>>> "c:\qwer"
> 'c:\\qwer'
>>>> "c:\asdf"
> 'c:\x07sdf'
>
> Some work, some don't. You'll put in a convenient four or five digit
> Unicode escape, follow it with a non-hex letter, and then later on
> come and edit and confuse yourself no end.

'C:\Products\Umbrellas'

has the same problem. This is an issue with Windows path names, not my proposal. You don't even need Unicode to be bitten by this issue, just a name starting with n, t, x, etc.



-- 
Steven


More information about the Python-ideas mailing list