[Python-ideas] Support Unicode code point notation

Sat Jul 27 23:58:44 CEST 2013

On 7/27/2013 7:22 AM, Steven D'Aprano wrote:
> On 27/07/13 20:22, Ian Foote wrote:
>> On 27/07/13 11:01, Steven D'Aprano wrote:
>
>>> Variable number of digits? Isn't that a bad thing?
>>> --------------------------------------------------
>>>
>>> It's neither good nor bad.

It is wretched. In the unicode standard, the U+ notation is used for 
single codepoints and as near as I can tell from checking a few 
chapters, always has a trailing delimiter (space or punctuation). This 
is true even for successive codepoints. For example: "katakana letter 
ainu to can simply be mapped to the Unicode character sequence <U+30C8, 
U+309A>". Note that the authors did not simple write "U+30C8U+309A" as 
in this proposal. In other words, the proposal does not conform to the 
usage of the notation in the standard.

In tables, the 'U+' is omitted. Sequential codepoints are separated by 
spaces for readability. For instance,
'0069 0307 0301' in one table stands for the single grapheme 'i̇́' 
(Lithuanian char) == '\u0069\u0307\u0301'

Even though a computer could parse 'U+0069U+0307U+0301' correctly, most 
humans eyes will see '+' as the separator. I find this more painful to 
read than the '\' form.

>>> Octal escapes already support from 1 to 3 oct digits.

And there are awful to use in string literals, as opposed to numbers.

>>> In some languages (but not Python), hex escapes support from 1
>>> to an unlimited number of hex digits.

That is fine for numbers. For strings, 2*n hex digits often (typically?) 
means n bytes.

>> What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or
>> U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?

> There is no ambiguity.

But there is a problem. What if a persons (an Ethiopian?) *wants* to 
write U+1234 ETHIOPIC SYLLABLE SEE and a digit 5 as a 2 character 
identifier? You really expect someone to tranlate '5' into 'U+00xx'?

> Just like oct escapes, the longest valid sequence
> (up to the maximum) would be used. If you used the shortest, then there
> would be no way to specify 5 or 6 digit sequences.

As I said above, there is no ambiguity in the standard because they do 
not jam codepoints (with or without 'U+') together without 
non-alphanumeric delimiters.

-- 
Terry Jan Reedy