[Python-ideas] Support Unicode code point notation
Terry Reedy
tjreedy at udel.edu
Sat Jul 27 23:58:44 CEST 2013
On 7/27/2013 7:22 AM, Steven D'Aprano wrote:
> On 27/07/13 20:22, Ian Foote wrote:
>> On 27/07/13 11:01, Steven D'Aprano wrote:
>
>>> Variable number of digits? Isn't that a bad thing?
>>> --------------------------------------------------
>>>
>>> It's neither good nor bad.
It is wretched. In the unicode standard, the U+ notation is used for
single codepoints and as near as I can tell from checking a few
chapters, always has a trailing delimiter (space or punctuation). This
is true even for successive codepoints. For example: "katakana letter
ainu to can simply be mapped to the Unicode character sequence <U+30C8,
U+309A>". Note that the authors did not simple write "U+30C8U+309A" as
in this proposal. In other words, the proposal does not conform to the
usage of the notation in the standard.
In tables, the 'U+' is omitted. Sequential codepoints are separated by
spaces for readability. For instance,
'0069 0307 0301' in one table stands for the single grapheme 'i̇́'
(Lithuanian char) == '\u0069\u0307\u0301'
Even though a computer could parse 'U+0069U+0307U+0301' correctly, most
humans eyes will see '+' as the separator. I find this more painful to
read than the '\' form.
>>> Octal escapes already support from 1 to 3 oct digits.
And there are awful to use in string literals, as opposed to numbers.
>>> In some languages (but not Python), hex escapes support from 1
>>> to an unlimited number of hex digits.
That is fine for numbers. For strings, 2*n hex digits often (typically?)
means n bytes.
>> What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or
>> U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
> There is no ambiguity.
But there is a problem. What if a persons (an Ethiopian?) *wants* to
write U+1234 ETHIOPIC SYLLABLE SEE and a digit 5 as a 2 character
identifier? You really expect someone to tranlate '5' into 'U+00xx'?
> Just like oct escapes, the longest valid sequence
> (up to the maximum) would be used. If you used the shortest, then there
> would be no way to specify 5 or 6 digit sequences.
As I said above, there is no ambiguity in the standard because they do
not jam codepoints (with or without 'U+') together without
non-alphanumeric delimiters.
--
Terry Jan Reedy
More information about the Python-ideas
mailing list