Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.: http://www.unicode.org/versions/corrigendum2.html as well as in third party sites that have reason to discuss Unicode code points, e.g.: https://en.wikipedia.org/wiki/Eth#Computer_input I propose that Python strings support this as the preferred escape notation for Unicode code points: '\U+03C0' => 'π' The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code. Doesn't this violate "Only One Way To Do It"? --------------------------------------------- That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses: d <=> U+0064 and leave legacy escape sequences as the not-so-obvious ways to do it: \x64 \144 \u0064 \U00000064 Why do we need yet another way of writing escape sequences? ----------------------------------------------------------- We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems. One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings. Two-byte \u escapes are harmless, so long as you imagine that Unicode is a 16-bit character set. Unfortunately, it is not. \u does not support code points in the Supplementary Multilingual Planes (those with ordinal value greater than 0xFFFF), and can silently give the wrong result if you make a mistake in counting digits: # I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus) s = '\u13080' => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO) Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless. What is the U+ escape specification? ------------------------------------ http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-li... lists the escape sequences, including: \uxxxx Character with 16-bit hex value xxxx \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx To this should be added: \U+xxxx Character at code point xxxx (hex) with the note: Exactly 4, 5 or 6 hexadecimal digits are required. Upper or lower case? -------------------- Uppercase should be preferred, as the Unicode Consortium uses it, but both should be accepted. Variable number of digits? Isn't that a bad thing? -------------------------------------------------- It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits. Is this backwards compatible? ----------------------------- I believe it is. As of Python 3.3, strings using \U+ give a syntax error: py> '\U+13080' File "<stdin>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: end of string in escape sequence What deprecation schedule are you proposing? -------------------------------------------- I'm not. At least, the existing features should not be considered for removal before Python 4000. In the meantime, the U+ form should be noted as the preferred way, and perhaps blessed in PEP 8. Should string reprs use the U+ form? ------------------------------------ \u escapes are sometimes used in string reprs, e.g. for private-use characters: py> chr(0xE034) '\ue034' Should this change to '\U+E034'? My personal preference is that it should, but I fear backwards compatibility may prevent it. Even if the exact form of str.__repr__ is not guaranteed, changing the repr would break (e.g.) some doctests. This proposal defers any discussion of changing the repr of strings to use U+ escapes. -- Steven
Steven D'Aprano wrote:
Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:
http://www.unicode.org/versions/corrigendum2.html
as well as in third party sites that have reason to discuss Unicode code points, e.g.:
https://en.wikipedia.org/wiki/Eth#Computer_input
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
-1. The \u and \U notations are standard in several programming languages, e.g. Java and C++, so we're in good company. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source
Python/Zope Consulting and Support ... http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
On 27/07/13 11:01, Steven D'Aprano wrote:
Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:
http://www.unicode.org/versions/corrigendum2.html
as well as in third party sites that have reason to discuss Unicode code points, e.g.:
https://en.wikipedia.org/wiki/Eth#Computer_input
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
Doesn't this violate "Only One Way To Do It"? ---------------------------------------------
That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses:
d <=> U+0064
and leave legacy escape sequences as the not-so-obvious ways to do it:
\x64 \144 \u0064 \U00000064
Why do we need yet another way of writing escape sequences? -----------------------------------------------------------
We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.
One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings.
Two-byte \u escapes are harmless, so long as you imagine that Unicode is a 16-bit character set. Unfortunately, it is not. \u does not support code points in the Supplementary Multilingual Planes (those with ordinal value greater than 0xFFFF), and can silently give the wrong result if you make a mistake in counting digits:
# I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus) s = '\u13080' => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless.
What is the U+ escape specification? ------------------------------------
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-li...
lists the escape sequences, including:
\uxxxx Character with 16-bit hex value xxxx \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx
To this should be added:
\U+xxxx Character at code point xxxx (hex)
with the note:
Exactly 4, 5 or 6 hexadecimal digits are required.
Upper or lower case? --------------------
Uppercase should be preferred, as the Unicode Consortium uses it, but both should be accepted.
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
Is this backwards compatible? -----------------------------
I believe it is. As of Python 3.3, strings using \U+ give a syntax error:
py> '\U+13080' File "<stdin>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: end of string in escape sequence
What deprecation schedule are you proposing? --------------------------------------------
I'm not. At least, the existing features should not be considered for removal before Python 4000. In the meantime, the U+ form should be noted as the preferred way, and perhaps blessed in PEP 8.
Should string reprs use the U+ form? ------------------------------------
\u escapes are sometimes used in string reprs, e.g. for private-use characters:
py> chr(0xE034) '\ue034'
Should this change to '\U+E034'? My personal preference is that it should, but I fear backwards compatibility may prevent it. Even if the exact form of str.__repr__ is not guaranteed, changing the repr would break (e.g.) some doctests.
This proposal defers any discussion of changing the repr of strings to use U+ escapes.
What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5? -1 without a clear way to disambiguate. Regards, Ian
On Jul 27, 2013, at 12:22, Ian Foote
On 27/07/13 11:01, Steven D'Aprano wrote:
Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:
http://www.unicode.org/versions/corrigendum2.html
as well as in third party sites that have reason to discuss Unicode code points, e.g.:
https://en.wikipedia.org/wiki/Eth#Computer_input
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
Doesn't this violate "Only One Way To Do It"? ---------------------------------------------
That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses:
d <=> U+0064
and leave legacy escape sequences as the not-so-obvious ways to do it:
\x64 \144 \u0064 \U00000064
Why do we need yet another way of writing escape sequences? -----------------------------------------------------------
We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.
One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings.
Two-byte \u escapes are harmless, so long as you imagine that Unicode is a 16-bit character set. Unfortunately, it is not. \u does not support code points in the Supplementary Multilingual Planes (those with ordinal value greater than 0xFFFF), and can silently give the wrong result if you make a mistake in counting digits:
# I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus) s = '\u13080' => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless.
What is the U+ escape specification? ------------------------------------
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-li...
lists the escape sequences, including:
\uxxxx Character with 16-bit hex value xxxx \Uxxxxxxxx Character with 32-bit hex value xxxxxxxx
To this should be added:
\U+xxxx Character at code point xxxx (hex)
with the note:
Exactly 4, 5 or 6 hexadecimal digits are required.
Upper or lower case? --------------------
Uppercase should be preferred, as the Unicode Consortium uses it, but both should be accepted.
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
Is this backwards compatible? -----------------------------
I believe it is. As of Python 3.3, strings using \U+ give a syntax error:
py> '\U+13080' File "<stdin>", line 1 SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-7: end of string in escape sequence
What deprecation schedule are you proposing? --------------------------------------------
I'm not. At least, the existing features should not be considered for removal before Python 4000. In the meantime, the U+ form should be noted as the preferred way, and perhaps blessed in PEP 8.
Should string reprs use the U+ form? ------------------------------------
\u escapes are sometimes used in string reprs, e.g. for private-use characters:
py> chr(0xE034) '\ue034'
Should this change to '\U+E034'? My personal preference is that it should, but I fear backwards compatibility may prevent it. Even if the exact form of str.__repr__ is not guaranteed, changing the repr would break (e.g.) some doctests.
This proposal defers any discussion of changing the repr of strings to use U+ escapes.
What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
-1 without a clear way to disambiguate.
We already have the exact same problem with octal literals. They can be one to three digits, ending at the first non-octal-digit character (or end of string). So '\123' is unambiguously 'S', while '\128' is unambiguously '\n8'. Not exactly beautiful, but simple, and a precedent going back to the earliest days of Python, and beyond it to C. So if we followed the same rule, '\U+12345' would unambiguously be character U+12345, while '\U+1234@' would be U+1234 and a @. That doesn't mean it's necessarily a good idea. After all, we don't allow 1-char hex escapes. And octal escapes are already pretty weird, in that they don't encode only characters up to 127 (as in C) or all of Unicode, but everything up to 511 (because that happens to be the max you can fit into the rules), so maybe they're not a great precedent to follow.
Regards, Ian
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Sat, Jul 27, 2013 at 12:01 PM, Steven D'Aprano
Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:
http://www.unicode.org/versions/corrigendum2.html
as well as in third party sites that have reason to discuss Unicode code points, e.g.:
https://en.wikipedia.org/wiki/Eth#Computer_input
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
As Marc-Andre Lemburg said, C, C++ and Java use the same notation as Python does. And there is NO programming language implementing the U+ syntax. Why should we? Why should we violate de-facto standards? Existing programming languages use one or more of: a) \uHHHH b) \UHHHHHHHH c) \u{H..HHHHHH} (eg. Ruby) c) \xH..HH d) \x{H..HHHHHH} e) \O..OOO and probably some more variants I am not aware of or forgot about, but there is probably no programming language that does \U+{H..HHHHHH}, so why should we?
Doesn't this violate "Only One Way To Do It"? ---------------------------------------------
That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses:
d <=> U+0064
and leave legacy escape sequences as the not-so-obvious ways to do it:
\x64 \144 \u0064 \U00000064
For a C, C++, Java or some other programmers, the ABOVE ways are the obvious ways to do it. \U+ definitely is not. Even something as basic as GNU echo uses the \u \U syntax.
Why do we need yet another way of writing escape sequences? -----------------------------------------------------------
We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.
…standard notation that NO programming language uses. In English, sure thing — go for those fancy U+H..HHHHHH things, that’s what they are for.
One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings.
Py2k’s str or Py3k’s bytes still exist and are used. This is also where you would use \xHH or \OOO.
Two-byte \u escapes are harmless, so long as you imagine that Unicode is a 16-bit character set. Unfortunately, it is not. \u does not support code points in the Supplementary Multilingual Planes (those with ordinal value greater than 0xFFFF), and can silently give the wrong result if you make a mistake in counting digits:
# I want EGYPTIAN HIEROGLYPH D010 (Eye of Horus) s = '\u13080' => oops, I get 'ገ0' (ETHIOPIC SYLLABLE GA, ZERO)
Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless.
Ruby handles this wonderfully with what I called syntax (c) above. So, maybe instead of this, let’s get working on \u{H..HHHHHH}?
[snip]
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
This is bad, because of hex digits. Consider this: '\U+0002two' Would get us Start of Text (aka ^B), and the letters 't', 'w' and 'o'. And when we wanted to go with French, '\U+0002deux' We will find ourselves with MODIFIER LETTER RHOTIC HOOK, 'u' and 'x'. Uh-oh! (example above based on another one from Unicode mailing list archives.)
[snip]
Overall, huge nonsense. If you care about some wasted zeroes, why not propose to steal Ruby’s syntax, denoted as (c) in this message? -- Chris “Kwpolska” Warrick http://kwpolska.tk PGP: 5EAAEA16 stop html mail | always bottom-post | only UTF-8 makes sense
On 27/07/13 20:22, Ian Foote wrote:
On 27/07/13 11:01, Steven D'Aprano wrote:
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
There is no ambiguity. Just like oct escapes, the longest valid sequence (up to the maximum) would be used. If you used the shortest, then there would be no way to specify 5 or 6 digit sequences. -- Steven
On Sat, Jul 27, 2013 at 12:22 PM, Steven D'Aprano
On 27/07/13 20:22, Ian Foote wrote:
On 27/07/13 11:01, Steven D'Aprano wrote:
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
There is no ambiguity. Just like oct escapes, the longest valid sequence (up to the maximum) would be used. If you used the shortest, then there would be no way to specify 5 or 6 digit sequences.
In a vacuum, \U+12345 seems like a good thing. But two issues dog it: incompatibility with *every other language*, and the inability to follow it with a hex digit. With octal escapes, there's a limit of three digits, so you can simply stuff in an extra zero or two:
"\1234" 'S4' "\01234" '\n34' "\001234" '\x01234'
Granted, this isn't the case in all languages, but it's a reasonable convention to stick to. How many digits should be permitted in \U+ notation? Six? Eight? Will a quick eyeball of a string literal be able to figure out the correct interpretation of "\U+0012345678"? Also, this is a problem with a lot more characters than it is with octal, which unambiguously stops after any non-digit; in hex, there are two additional digits (8, 9) and twelve very common ASCII letters (A-F, a-f) which can cause problems. I foresee issues like with Windows paths in non-raw strings:
"c:\qwer" 'c:\\qwer' "c:\asdf" 'c:\x07sdf'
Some work, some don't. You'll put in a convenient four or five digit
Unicode escape, follow it with a non-hex letter, and then later on
come and edit and confuse yourself no end.
I'm -1 on the proposal, primarily because it's different from
everything else without being a significant improvement over them.
On Sat, Jul 27, 2013 at 12:25 PM, Steven D'Aprano
On 27/07/13 20:07, M.-A. Lemburg wrote:
The \u and \U notations are standard in several programming languages, e.g. Java and C++, so we're in good company.
Given the problems with both \u and \U escapes, I think it is better to say we're in bad company.
Good or bad, it's a large company, and that *in itself* is of value. ChrisA
On 27/07/13 22:37, Chris Angelico wrote:
In a vacuum, \U+12345 seems like a good thing. But two issues dog it: incompatibility with *every other language*,
Every language is incompatible with every other language. That's why they are different languages. Some languages happen to share a few (or many) similarities, but they are dwarfed by the differences. And yet we manage. Do you really mean to suggest that a C programmer is capable of interpreting U+2345 when reading about code points on Wikipedia, but will be confused when reading '\U+2345' in Python code? Surely not. But if so, I suggest that Python's \x escapes will also confuse him, since Python's \x is incompatible with C's \x. (We even mention that difference in the docs.) As well as Python's significant indentation, duck typing, and, most of all, lack of braces.
and the inability to follow it with a hex digit. With octal escapes, there's a limit of three digits, so you can simply stuff in an extra zero or two:
You would simply do the same as you already do for octal escapes: stuff in an extra zero or two: '\U+0003B82' => U+03B8 followed by 2 There's never any need to add more than two zeroes, since you can't use fewer than four or more than six digits in total.
How many digits should be permitted in \U+ notation? Six? Eight?
The Unicode standard uses exactly four, five or six hex digits for code points. The smallest code point is U+0000, and the largest is U+10FFFF. So: '\U+FFpq' will be a SyntaxError, just like '\uFFpq' today; '\U+FFFFFF' will be a SyntaxError, just like '\U00FFFFFF' today; '\U+00F2' will be unambiguously interpreted as a four digit hex escape; '\U+00FF2' will be unambiguously interpreted as a five digit hex escape; '\U+00FFF2' will be unambiguously interpreted as a six digit hex escape; '\U+00FFFF2' will be unambiguously interpreted as U+FFFF followed by 2.
Will a quick eyeball of a string literal be able to figure out the correct interpretation of "\U+0012345678"?
I don't think that the existing hex escapes pass the "quick eyeball" test: 'M\u00fcller' but your example above will be parsed as U+1234 followed by 5678.
Also, this is a problem with a lot more characters than it is with octal, which unambiguously stops after any non-digit; in hex, there are two additional digits (8, 9) and twelve very common ASCII letters (A-F, a-f) which can cause problems. I foresee issues like with Windows paths in non-raw strings:
"c:\qwer" 'c:\\qwer' "c:\asdf" 'c:\x07sdf'
Some work, some don't. You'll put in a convenient four or five digit Unicode escape, follow it with a non-hex letter, and then later on come and edit and confuse yourself no end.
'C:\Products\Umbrellas' has the same problem. This is an issue with Windows path names, not my proposal. You don't even need Unicode to be bitten by this issue, just a name starting with n, t, x, etc. -- Steven
On 27/07/2013 12:17, Chris “Kwpolska” Warrick wrote:
On Sat, Jul 27, 2013 at 12:01 PM, Steven D'Aprano
wrote: Unicode's standard notation for code points is U+ followed by a 4, 5 or 6 hex digit string, such as π = U+03C0. This notation is found throughout the Unicode Consortium's website, e.g.:
http://www.unicode.org/versions/corrigendum2.html
as well as in third party sites that have reason to discuss Unicode code points, e.g.:
https://en.wikipedia.org/wiki/Eth#Computer_input
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
As Marc-Andre Lemburg said, C, C++ and Java use the same notation as Python does.
And there is NO programming language implementing the U+ syntax. Why should we? Why should we violate de-facto standards?
Existing programming languages use one or more of:
a) \uHHHH b) \UHHHHHHHH c) \u{H..HHHHHH} (eg. Ruby) c) \xH..HH d) \x{H..HHHHHH} e) \O..OOO
and probably some more variants I am not aware of or forgot about, but there is probably no programming language that does \U+{H..HHHHHH}, so why should we?
[snip] Perl supports "\N{U+1234}" and "\x{1234}", and I believe that some languages also support "\x{41 42 43}" as an abbreviation for "\x{41}\x{42}\x{43}". As others have said, "\U+1234" suffers from the same problem as octal escapes. -1
On 27/07/13 21:17, Chris “Kwpolska” Warrick wrote:
And there is NO programming language implementing the U+ syntax.
That is incorrect. LispWorks and MIT Scheme both support #\U+ syntax: http://www.lispworks.com/documentation/lw50/LWUG/html/lwuser-352.htm http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/External-Representation-... as does a project "CLforJava": https://groups.google.com/forum/#!topic/comp.lang.lisp/pUjKLYLgrVA (The leading # is Lisp syntax to create a character.) CSS supports U+ syntax for both individual characters and ranges: http://www.w3.org/TR/css3-fonts/#unicode-range-desc BitC does something similar to what I am suggesting: http://www.bitc-lang.org/docs/bitc/spec.html#stringlit There may be others I am unaware of. So if you're worried about Python breaking new ground by supporting the standard Unicode notation for code points, don't worry, others have done so first.
Why should we? Why should we violate de-facto standards?
"The great thing about standards is there are so many to choose from." You listed six. Here are a few more: http://billposer.org/Software/ListOfRepresentations.html None of them are language-independent standards. There is only one language-independent standard for representing code points, and that is the U+xxxx standard used by the Unicode Consortium. There is a whole universe of Unicode discussion that makes no reference to C or Java escapes, but does reference U+xxxx code points. U+xxxx is the language-independent standard that *any* person familiar with Unicode should be able to understand, regardless of what programming language they use. We're not "violating" anything. Python doesn't support Ruby's \u{xxxxxx} escape, does that mean we're "violating" Ruby's de facto standard? Or are they violating ours? No to both of those, of course. Python is not Ruby, and nothing we do can violate Ruby's standard. Or C's, or Java's. [...]
Doesn't this violate "Only One Way To Do It"? ---------------------------------------------
That's not what the Zen says. The Zen says there should be One Obvious Way to do it, not Only One. It is my hope that we can agree that the One Obvious Way to refer to a Unicode character by its code point is by using the same notation that the Unicode Consortium uses:
d <=> U+0064
and leave legacy escape sequences as the not-so-obvious ways to do it:
\x64 \144 \u0064 \U00000064
For a C, C++, Java or some other programmers, the ABOVE ways are the obvious ways to do it. \U+ definitely is not. Even something as basic as GNU echo uses the \u \U syntax.
If you want C, C++, Java, Pascal, Forth, ... you know where to get them. This is Python, not C or Java, and we're discussing what is right for the Python language, not for C or Java. (Java still treats Unicode as a 16-bit charset. It isn't.) While we can, and should, consider what other languages do, we should neither slavishly follow them into bad decisions, nor should we be scared to introduce features that they don't have. Whether my proposal is good or bad, it is what it is regardless of what other languages do. C programmers find braces obvious. If you're unaware of the Pythonic response to the argument "we should do what C does", try this: from __future__ import braces [...]
One-byte hex and oct escapes are a throwback to the old one-byte ASCII days, and reflect an obsolete idea of strings being equivalent to bytes. Backwards compatibility requires that we continue to support them, but they shouldn't be encouraged in strings.
Py2k’s str or Py3k’s bytes still exist and are used. This is also where you would use \xHH or \OOO.
This proposal has nothing to do with bytes nor Python 2. Python 2 is closed to new features. Unicode escapes are irrelevant to bytes. Your comment here is a red herring.
Four-byte \U escape sequences support the entire Unicode character set, but they are terribly verbose, and the first three digits are *always* zero. Python doesn't (and shouldn't) support \U escapes beyond 10FFFF, so the first three digits of the eight digit hex value are pointless.
Correction: I obviously can't count, it is only the first two digits that are always zero.
Ruby handles this wonderfully with what I called syntax (c) above. So, maybe instead of this, let’s get working on \u{H..HHHHHH}?
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF, so an absolute maximum of six digits, not seven or more as you keep writing (four times, not that I'm counting :-) As for Ruby's syntax, by your own argument, it "violates the de facto standard" of C, C++, Java, and, yes, Python. Perhaps you would like to tell Matz that it's a terrible idea because it is violating Python's standard? But seriously, the biggest benefit I see from the Ruby syntax is you can write a sequence of code points: \u{00E0 00E9 00EE 00F5 00FC} => àéîõü but that's not my proposal. If somebody else wants to champion that, be my guest.
[snip]
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad. Octal escapes already support from 1 to 3 oct digits. In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
This is bad, because of hex digits.
I have covered this objection in my reply to Chris Angelico. In short, you are no worse off than you already are if you use octal escapes. A U+ hex escape will, at most, need two extra leading zeroes to avoid running past the end, so to speak. Your example:
'\U+0002deux'
could be written as '\U+000002deux'. (Or any of the existing ways of writing it would continue to work. Since U+0002 is an ASCII control character, I would not object to it being written as '\x02' or '\2'.) While I acknowledge the issue you raise, I don't think much of this example. Surely in nearly any real-world example there would be some sort of separator between the control character and the word? '\U+0002 deux' Yes, the issue of digits following octal or U+ escapes is a real issue, but it is not a common issue, and the solution is *exactly* the same in both cases: add one or two extra zeroes.
[snip]
Overall, huge nonsense. If you care about some wasted zeroes, why not propose to steal Ruby’s syntax, denoted as (c) in this message?
I don't merely care about wasted zeroes. I care about improving Python's Unicode model. What we call "characters" in Python actually are code points, and I believe we should support the standard notation for code points, even if we support other notation as well. If you go to the Unicode.org website, or Wikipedia, or any other site that actually understands Unicode, they invariably talk about code points and use the U+ notation. But in Python, we use our own notation that is *just slightly different*, for little or no good reason. ("C programmers use it" is not a good reason for a language which is not a variant of C.) While we must continue to support existing ways of escaping Unicode characters, I'd like to be able to tell people: "To enter a Unicode code point in a string, put a backslash in front of it." instead of telling them to count the number of hex digits, then either use \u or \U, and don't forget to pad it to eight digits if you use \U but not \u. Oh, and if you're tempted to copy and paste the code point from somewhere, you have to drop the U+ or it won't work. Unicode's notation is nice and simple. If we had it first, would we prefer \uxxxx and \U00xxxxxx over it? I don't think so. -- Steven
On Sat, Jul 27, 2013 at 4:47 PM, Steven D'Aprano
Unicode's notation is nice and simple. If we had it first, would we prefer \uxxxx and \U00xxxxxx over it? I don't think so.
Almost certainly not. Like I said, I think your idea is great *in a vacuum*. Obviously the removal of the current notations is out of the question, which means that this is yet another way to specify a codepoint; and it's one that most programmers won't be looking for. (I stand corrected, though: I had thought that there were *no* other languages using this notation. Of course, this is a silly thought. There is almost nothing that hasn't already been done, somewhere.) If Python had supported this notation from the beginning of Unicode strings, or at least since 3.0, then adding \uxxxx would have been purely as a sop to C/Java/etc programmers, and it would likely have gone nowhere. How much value is gained by creating a new syntax, which now Python programmers have to understand in addition to the existing ones? Consistency across languages is fairly important; have you ever used \123 notation in a BIND file? http://rosuav.blogspot.com/2012/12/i-want-my-octal.html Maybe Python will start a new trend, and \U+1234 will become the new convention. Maybe that's a good thing. But how beneficial will it be, and how complicating? I'm weakening my stance to -0. ChrisA
Steven D'Aprano writes:
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
-1. Because:
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
OTOH, supporting "\N{U+03C0}" seems harmless, if not particularly useful, to me. However, I don't find it hard to imagine that some people would use it in preference to the \U and \u escpes, despite being somewhat verbose.
On 27/07/2013 17:46, Stephen J. Turnbull wrote:
Steven D'Aprano writes:
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
-1. Because:
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
OTOH, supporting "\N{U+03C0}" seems harmless, if not particularly useful, to me. However, I don't find it hard to imagine that some people would use it in preference to the \U and \u escpes, despite being somewhat verbose.
I think the point of "\N{U+03C0}" is that it lets you name all of the codepoints, even those that are as yet unnamed. :-)
On 27 July 2013 17:46, Stephen J. Turnbull
Steven D'Aprano writes:
I propose that Python strings support this as the preferred escape notation for Unicode code points:
'\U+03C0' => 'π'
-1. Because:
The existing \U and \u variants must be kept for backwards compatibility, but should be (mildly) discouraged in new code.
OTOH, supporting "\N{U+03C0}" seems harmless, if not particularly useful, to me. However, I don't find it hard to imagine that some people would use it in preference to the \U and \u escpes, despite being somewhat verbose.
As a quick guess, I would. I don't like counting.
On 7/27/2013 7:22 AM, Steven D'Aprano wrote:
On 27/07/13 20:22, Ian Foote wrote:
On 27/07/13 11:01, Steven D'Aprano wrote:
Variable number of digits? Isn't that a bad thing? --------------------------------------------------
It's neither good nor bad.
It is wretched. In the unicode standard, the U+ notation is used for single codepoints and as near as I can tell from checking a few chapters, always has a trailing delimiter (space or punctuation). This is true even for successive codepoints. For example: "katakana letter ainu to can simply be mapped to the Unicode character sequence ". Note that the authors did not simple write "U+30C8U+309A" as in this proposal. In other words, the proposal does not conform to the usage of the notation in the standard. In tables, the 'U+' is omitted. Sequential codepoints are separated by spaces for readability. For instance, '0069 0307 0301' in one table stands for the single grapheme 'i̇́' (Lithuanian char) == '\u0069\u0307\u0301' Even though a computer could parse 'U+0069U+0307U+0301' correctly, most humans eyes will see '+' as the separator. I find this more painful to read than the '\' form.
Octal escapes already support from 1 to 3 oct digits.
And there are awful to use in string literals, as opposed to numbers.
In some languages (but not Python), hex escapes support from 1 to an unlimited number of hex digits.
That is fine for numbers. For strings, 2*n hex digits often (typically?) means n bytes.
What should 'U+12345' be? U+12345 CUNEIFORM SIGN URU TIMES KI or U+1234 ETHIOPIC SYLLABLE SEE and a digit 5?
There is no ambiguity.
But there is a problem. What if a persons (an Ethiopian?) *wants* to write U+1234 ETHIOPIC SYLLABLE SEE and a digit 5 as a 2 character identifier? You really expect someone to tranlate '5' into 'U+00xx'?
Just like oct escapes, the longest valid sequence (up to the maximum) would be used. If you used the shortest, then there would be no way to specify 5 or 6 digit sequences.
As I said above, there is no ambiguity in the standard because they do not jam codepoints (with or without 'U+') together without non-alphanumeric delimiters. -- Terry Jan Reedy
Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future. This isn't a problem for the U+XXXX notation in informal usage, since it's usually written with surrounding whitespace or punctuation that makes it clear where the digits end. But the \U+XXXX syntax as currently proposed would bake in an absolute 6-digit limit that's impossible to ever extend.
I'd like to be able to tell people:
"To enter a Unicode code point in a string, put a backslash in front of it."
instead of telling them to count the number of hex digits,
But they're *still* going to have to count hex digits, and pad to 6 if it happens to be followed by a problematic character. If we're going to introduce something new, we might as well design it not to have silly, awkward properties like that. The Ruby \U{...} syntax has the following advantages: * Very clear, not prone to editing errors * No fixed limit on number of digits * Extends easily to multiple code points * Can optionally accept U+ for those who like that * Precedent exists in at least one other language Or we could invent something of our own, such as using another backslash as a delimiter: \U+1234\ Multiple characters could be written as: \U+1234+5678+9abc\ -- Greg
On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
They will for as long as UTF-16 is supported. Really, it would have been better all round if UTF-16 had never existed, and everyone just had to switch up to UTF-32; sure, memory would have been wasted, but concepts like PEP 393 would have been devised to deal with that, and we wouldn't have stupid bugs in 99% of programming languages. ChrisA
On Jul 28, 2013, at 1:18, Chris Angelico
On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
wrote: Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
They will for as long as UTF-16 is supported. Really, it would have been better all round if UTF-16 had never existed, and everyone just had to switch up to UTF-32; sure, memory would have been wasted, but concepts like PEP 393 would have been devised to deal with that, and we wouldn't have stupid bugs in 99% of programming languages.
UTF-16 wouldn't have been a problem if it weren't almost compatible with UCS2, allowing all kinds of Unicode 1.0 software to misleadingly claim Unicode 2.0 support. (For example, for a long time, both Windows and Java "supported" UTF-16 by treating surrogate pairs as two characters instead of one, which is like "supporting" UTF-8 by treating it like ASCII--except that the bugs are much less likely to hit developers early in the cycle.) There are use cases for which UTF-16 is perfectly reasonable. For example, strings with lots of BMP CJK characters and an occasional non-BMP character aren't helped by PEP 393, or by UTF-8, but they are helped by UTF-16. (So long as you can rely on software not treating it as UCS2…) But anyway, this is pretty far off topic. Unicode could go past 10FFFF without dropping UTF-16, either by adding more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem. The problem is that we have no way to predict how they will extend UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters.
On 28 Jul 2013 10:34, "Andrew Barnert"
On Jul 28, 2013, at 1:18, Chris Angelico
wrote: On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
wrote: Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode
points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
They will for as long as UTF-16 is supported. Really, it would have been better all round if UTF-16 had never existed, and everyone just had to switch up to UTF-32; sure, memory would have been wasted, but concepts like PEP 393 would have been devised to deal with that, and we wouldn't have stupid bugs in 99% of programming languages.
UTF-16 wouldn't have been a problem if it weren't almost compatible with UCS2, allowing all kinds of Unicode 1.0 software to misleadingly claim Unicode 2.0 support. (For example, for a long time, both Windows and Java "supported" UTF-16 by treating surrogate pairs as two characters instead of one, which is like "supporting" UTF-8 by treating it like ASCII--except
code that the bugs are much less likely to hit developers early in the cycle.) There are use cases for which UTF-16 is perfectly reasonable. For example, strings with lots of BMP CJK characters and an occasional non-BMP character aren't helped by PEP 393, or by UTF-8, but they are helped by UTF-16. (So long as you can rely on software not treating it as UCS2…) But anyway, this is pretty far off topic.
Unicode could go past 10FFFF without dropping UTF-16, either by adding
more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem.
The problem is that we have no way to predict how they will extend
UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters. The idea of enhancing name based lookup by accepting the "U+" prefix as specifying a code point sounds good to me. It's already a delimited notation, doesn't require a new escape and, as someone else pointed out, allows \N to be used consistently, even if a code point doesn't have a name yet. Cheers, Nick.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On 28/07/13 09:14, Greg Ewing wrote:
Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
Yes we can. The Unicode Consortium have guaranteed that Unicode will never be extended past code point U+10FFFF. I quote: Q: Will UTF-16 ever be extended to more than a million characters? A: No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). http://www.unicode.org/faq/utf_bom.html#utf16-6 Supporting some hypothetical "Super-hyper-mega-Code" in 2035 will be as big a change as adding Unicode in the first place. It will probably require a PEP :-) [...]
I'd like to be able to tell people:
"To enter a Unicode code point in a string, put a backslash in front of it."
instead of telling them to count the number of hex digits,
But they're *still* going to have to count hex digits, and pad to 6 if it happens to be followed by a problematic character.
Most uses of hex escapes aren't followed by another hex digit: there are in excess of a million Unicode code points, and less than 50 are hex digits (less than 30 if you exclude East-Asian full-width forms). To return to the example that keeps being given, if you're writing Ethiopian text, I don't think it is actually very likely that you will want to follow ETHIOPIC SYLLABLE SEE by a Latin digit 5 with no separator between them. Yes, it "might" happen, but there are trivial ways to deal that, in no particular order: - pad the code point to six digits - don't use \U+, use a fixed-width \u or \U escape - use string concatenation '\U+1234' '5' - use string substitutions (% or format or $ templates).
If we're going to introduce something new, we might as well design it not to have silly, awkward properties like that.
The Ruby \U{...} syntax has the following advantages:
* Very clear, not prone to editing errors * No fixed limit on number of digits * Extends easily to multiple code points * Can optionally accept U+ for those who like that * Precedent exists in at least one other language
As I said earlier, if someone wants to champion that idea, I won't object.
Or we could invent something of our own, such as using another backslash as a delimiter:
\U+1234\
Multiple characters could be written as:
\U+1234+5678+9abc\
Another suggestion which was made is: \N{U+xxxx} (Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points: \N{U+xxxx U+yyyy U+zzzzz} or \N{U+xxxx yyyy zzzzz} -- Steven
On 28/07/13 10:30, Andrew Barnert wrote:
Unicode could go past 10FFFF without dropping UTF-16, either by adding more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem.
The problem is that we have no way to predict how they will extend UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters.
The Unicode Consortium formally guarantees stability of the character range U+0000 - U+10FFFF. http://www.unicode.org/faq/utf_bom.html#utf16-6 -- Steven
Steven D'Aprano, 28.07.2013 05:43:
Another suggestion which was made is:
\N{U+xxxx}
+1
That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
If I were up for bike shedding, I'd suggest to rather use comma separated code point values here. I don't think I have a preference regarding the repetition of the "U+" prefix (it looks less clear without it and feels redundant if you require it), but thinking of the cases where a sequence of two or more code points combines into one character makes it seem like a useful thing to support in general. Stefan
On Sun, Jul 28, 2013 at 4:57 AM, Steven D'Aprano
On 28/07/13 10:30, Andrew Barnert wrote:
Unicode could go past 10FFFF without dropping UTF-16, either by adding more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem.
The problem is that we have no way to predict how they will extend UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters.
The Unicode Consortium formally guarantees stability of the character range U+0000 - U+10FFFF.
And to add to this: Surrogate triplets would majorly break one of the fundamentals of UTF-16, namely that it guarantees synchronizability. You can look at any 16-bit code unit and know whether it's a lead or trail surrogate. (Obviously if you write to a file or other byte stream, you have to have some out-of-band way to synchronize on bytes, that's separate.) So there's unlikely ever to be a scheme that extends UTF-16 to more characters. UTF-8 can in theory handle longer codes (and some encoders can simply use the same mathematical technique to encode numbers larger than 10FFFF, as we've already seen). The only way would be to declare UTF-16 as a flawed system, just as UCS-2 is. It's a system that can encode only the first planes of Unicode. I doubt it'll ever happen, though, as there's no need for more space. ChrisA
Greg Ewing writes:
Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
In Unicode, they will. Blood was shed over the issue in the ISO 10646 committees before the standards could be unified. Huge amounts of software validate UTF-8 and UTF-16 including staying within the range, and won't easily be converted to accept extended ranges. So Unicode and ISO 10646 will stay within the current 17 pages. To go beyond that they'll need a new standard. In any case, it seems really unlikely that more than 1,000,000 code points will ever be needed, unless there's a mutation that makes all of *us* obsolete.
The Ruby \U{...} syntax has the following advantages:
So does the \N{U+XXXX} proposal, and it has the further advantage of indicating the obvious semantics as a name for this character/code point, which is consistent with the actual usage of the U+XXXX syntax in the standard.
Steven D'Aprano writes:
if you're writing Ethiopian text, I don't think it is actually very likely that you will want to follow ETHIOPIC SYLLABLE SEE by a Latin digit 5 with no separator between them. Yes, it "might" happen,
If you're writing Ethiopic text, I doubt you'll be using escape sequences to denote Ethiopic characters in the first place. I think it's hard to predict how these sequences are going to be used in the future. What I would worry about it not whether writers would "want" to use such sequences, but whether they'll bother to clean them up if they occur in the first place. The writer knows what she wants; it's the reader who has to parse the resulting mess.
(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
This is a modal encoding, which has proved to be a really bad idea in its past incarnations. I hope that extension is never added to Python.
A bit of clarification:
On Sat, Jul 27, 2013 at 5:47 PM, Steven D'Aprano
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF, so an absolute maximum of six digits, not seven or more as you keep writing (four times, not that I'm counting :-)
My fancy syntax meant “up to six hex digits”. And 10FFFF is six digits long.
~~~
On Sun, Jul 28, 2013 at 1:14 AM, Greg Ewing
The Ruby \U{...} syntax has the following advantages:
It’s \u{}. "\U{}" results in "U{}", i.e. does not work.
* No fixed limit on number of digits
Are we still speaking of the Ruby implementation? irb(main):002:0> "\u{1234567}" SyntaxError: (irb):2: invalid Unicode codepoint (too large) "\u{1234567}" ^ from /usr/bin/irb:12:in `<main>' -- Chris “Kwpolska” Warrick http://kwpolska.tk PGP: 5EAAEA16 stop html mail | always bottom-post | only UTF-8 makes sense
On 28/07/13 17:41, Stephen J. Turnbull wrote:
(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
This is a modal encoding, which has proved to be a really bad idea in its past incarnations. I hope that extension is never added to Python.
Could you elaborate please? What do you mean "modal encoding", and what past incarnations are you referring to? -- Steven
On 28 July 2013 18:21, Steven D'Aprano
On 28/07/13 17:41, Stephen J. Turnbull wrote:
(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
This is a modal encoding, which has proved to be a really bad idea in its past incarnations. I hope that extension is never added to Python.
Could you elaborate please? What do you mean "modal encoding", and what past incarnations are you referring to?
I believe what Stephen means is that it changes the \N{} notation from a relatively straightforward key lookup (where everything inside the "{}" refers to a single code point), to a two level parser, where the contents of the "{}" need to be further parsed to see if they refer to one code point or many. It doesn't bother me that much personally, especially if it was a general comma delimited capability that also worked with other code point names, but my inclination is to call YAGNI on the additional complexity. Using "modal encoding" to refer to that change isn't really valid though - Python string syntax is already modal, since "\N{" switches modes to "any characters until the next '}' are part of a code point name rather than part of the string contents", and similar statements can be made about the other escape sequences (especially the other Unicode related ones). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Steven D'Aprano writes:
On 28/07/13 17:41, Stephen J. Turnbull wrote:
(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
This is a modal encoding, which has proved to be a really bad idea in its past incarnations. I hope that extension is never added to Python.
Could you elaborate please? What do you mean "modal encoding", and what past incarnations are you referring to?
A "modal encoding" is one in which the same combination of code units (here, ASCII characters) is interpreted differently depending on arbitrarily distant context. One only has to look at certain web pages or mail messages to see similar encodings (SGML numeric character entities, quoted-printable encoding of text using non-Latin character sets) abused to represent many lines of text. In such (ab)uses, it's very easy to corrupt the whole stream accidentally by losing one of the braces or by interpolating text encoded differently. Sure, it's easy for humans to recognize what's going on, and recover, when they encounter corrupted text interactively, but this is obviously not a convention that's intended for interactive human use! The main past incarnation is the ISO 2022 family. I see no advantage in "readability" of "\N{U+xxxx U+yyyy U+zzzzz}" or "\N{U+xxxx yyyy zzzzz}" over "\N{U+xxxx}\N{U+yyyy}\N{U+zzzzz}", and very little space savings. Worst, it violates the basic understanding that "\N{...}" is the name of one character or code point.
On 28 July 2013 19:05, Stephen J. Turnbull
Steven D'Aprano writes:
On 28/07/13 17:41, Stephen J. Turnbull wrote:
(Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points:
\N{U+xxxx U+yyyy U+zzzzz}
or
\N{U+xxxx yyyy zzzzz}
This is a modal encoding, which has proved to be a really bad idea in its past incarnations. I hope that extension is never added to Python.
Could you elaborate please? What do you mean "modal encoding", and what past incarnations are you referring to?
A "modal encoding" is one in which the same combination of code units (here, ASCII characters) is interpreted differently depending on arbitrarily distant context.
Ah, I had missed the "arbitrarily distant" sense you intended for modal encoding. Agreed, the fact that unicode escapes (including \N{}) are limited in length to a single code point is a definite win in that regard. Cheers, Nick. P.S. It occurs to me that the str.format mini-language has no such limitation, though:
def hexchr(x): ... return chr(int(x, 16)) ...
def hex2str(s): ... return "".join(hexchr(x) for x in s.split()) ... class chrformat: ... def __format__(self, fmt): ... return hex2str(fmt) ... "{:40 60 1234 e9}".format(chrformat()) '@`ሴé'
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan writes:
It doesn't bother me that much personally, especially if it was a general comma delimited capability that also worked with other code point names,
I think it should bother you, though. It's not a problem for Python core developers, it's true. Similarly, ISO 2022 was a great idea in theory, and works fine for communication of text over streams. The problem is when you want to embed that stream in some higher-level protocol. So, for example, the original space-separated syntax breaks one-argument split-string, while your comma-separated version breaks CSV. You could fix both of those by using no separator and simply finishing the current code point on encountering "U+" or "}", but I doubt anybody would find that variant appealing. Now, for program literals this isn't going to matter because a string will be converted to internal representation by the compiler, and the program never sees that syntax. But what about applications like web frameworks which often eval client-supplied strings? I hope we are not going to recommend they eval them before validating them!<wink/>
but my inclination is to call YAGNI on the additional complexity.
"Using 'complexity' to refer to this syntax isn't really valid though - what it is, is 'complicated'."<wink/>
Using "modal encoding" to refer to that change isn't really valid though
No, it's quite correct, at least in ISO-land. There, a modal encoding is one which must maintain state across *code points*. The single- code-point "\N" syntax needs to maintain state across *code units*, but when it's done with a code *point*, it's done - there's no state to worry about before starting to parse the next one. By your definition, UTF-8 is modal, but that doesn't seem a very useful categorization to me.
On 28 July 2013 22:00, Stephen J. Turnbull
Nick Coghlan writes:
Using "modal encoding" to refer to that change isn't really valid though
No, it's quite correct, at least in ISO-land. There, a modal encoding is one which must maintain state across *code points*. The single- code-point "\N" syntax needs to maintain state across *code units*, but when it's done with a code *point*, it's done - there's no state to worry about before starting to parse the next one. By your definition, UTF-8 is modal, but that doesn't seem a very useful categorization to me.
My bytes-oriented comms background is showing ;) I agree, preserving the property that "one escape sequence = one code point" is valuable, so the proposal should just be to make this resolve to the right value: "\N{U+<code-point>}" It would also be more consistent if unicodedata.lookup() was updated to handle numeric code point names. Something like:
import unicodedata def enhanced_lookup(name): ... if name.startswith("U+"): ... return chr(int(name[2:], 16)) ... return unicodedata.lookup(name) ... enhanced_lookup("GREEK SMALL LETTER ALPHA") 'α' enhanced_lookup("U+03B1") 'α'
Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 28/07/13 23:06, Nick Coghlan wrote:
It would also be more consistent if unicodedata.lookup() was updated to handle numeric code point names. Something like:
import unicodedata def enhanced_lookup(name): ... if name.startswith("U+"): ... return chr(int(name[2:], 16)) ... return unicodedata.lookup(name) ... enhanced_lookup("GREEK SMALL LETTER ALPHA") 'α' enhanced_lookup("U+03B1") 'α'
Earlier, MRAB suggested that unicodedata.name() could return the U+ code point in the case of unnamed characters. I think it would be better to have a separate unicodedata function to return the code point, and leave the current behaviour of name() alone. def codepoint(c): return 'U+{:04X}'.format(ord(c)) This should always succeed for any character. -- Steven
On 28/07/2013 18:29, Steven D'Aprano wrote:
On 28/07/13 23:06, Nick Coghlan wrote:
It would also be more consistent if unicodedata.lookup() was updated to handle numeric code point names. Something like:
import unicodedata def enhanced_lookup(name): ... if name.startswith("U+"): ... return chr(int(name[2:], 16)) ... return unicodedata.lookup(name) ... enhanced_lookup("GREEK SMALL LETTER ALPHA") 'α' enhanced_lookup("U+03B1") 'α'
Earlier, MRAB suggested that unicodedata.name() could return the U+ code point in the case of unnamed characters.
What I said was: """I think the point of "\N{U+03C0}" is that it lets you name all of the codepoints, even those that are as yet unnamed.""" Whether unicodedata.name() could have a fallback is something I've never considered. Until now... :-)
I think it would be better to have a separate unicodedata function to return the code point, and leave the current behaviour of name() alone.
def codepoint(c): return 'U+{:04X}'.format(ord(c))
This should always succeed for any character.
Steven D'Aprano writes:
Earlier, MRAB suggested that unicodedata.name() could return the U+ code point in the case of unnamed characters. I think it would be better to have a separate unicodedata function to return the code point, and leave the current behaviour of name() alone.
His point, and I agree, is that it's not useful to have name() error,
as it does for unicodedata.name(chr(65535)). In that case I would
prefer that it return "U+FFFF NOT A CHARACTER" or something like that.
And for chr(65535*2) it would return "U+1FFFE UNASSIGNED IN VERSION
def codepoint(c): return 'U+{:04X}'.format(ord(c))
This should always succeed for any character.
Or code point: it will succeed for things that aren't characters, such as chr(65535). As one-liners go, this does seem a reasonable candidate for the stdlib. Steve
I have raised an issue on the tracker for this: http://bugs.python.org/issue18614 -- Steven
I wonder if this should also support the special labels for characters without names: control-NNNN reserved-NNNN noncharacter-NNNN private-use-NNNN surrogate-NNNN see p. 138 of http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf I would think that unicodedata.name should not return these, but perhaps unicodedata.lookup should accept them. Note that the doc says that these are frequently displayed enclosed in <>, so perhaps unicodedata.lookup('U+0001') == unicodedata.lookup('control-0001') == unicodedata.lookup('<control-0001>') == '\x01' --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security
On 8/1/2013 1:14 PM, Bruce Leban wrote:
I wonder if this should also support the special labels for characters without names:
control-NNNN reserved-NNNN noncharacter-NNNN private-use-NNNN surrogate-NNNN
see p. 138 of http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf
I would think that unicodedata.name http://unicodedata.name should not return these, but perhaps unicodedata.lookup should accept them. Note that the doc says that these are frequently displayed enclosed in <>, so perhaps
unicodedata.lookup('U+0001') == unicodedata.lookup('control-0001') == unicodedata.lookup('<control-0001>') == '\x01'
That is a lot of added complication of both doc and code for what seems like little gain. Why would someone write 'control-' instead of 'U+'? -- Terry Jan Reedy
On Thu, Aug 1, 2013 at 6:09 PM, Terry Reedy
Why would someone write 'control-' instead of 'U+'?
Because this is the recommended way to form the code-point labels: "For each code point type without character names, code point labels are constructed by using a lowercase prefix derived from the code point type, followed by a hyphen-minus and then a 4- to 6-digit hexadecimal representation of the code point." "To avoid any possible confusion with actual, non-null Name property values, constructed Unicode code point labels are often displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, and so on. This convention is used consistently in the data files for the Unicode Character Database." "A constructed code point label is distinguished from the designation of the code point itself (for example, “U+0009” or “U+FFFF”), which is also a unique identifier, as described in Appendix A, Notational Conventions." < http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf> I would rather see unicodedata.lookup() to be extended to accept code-point labels rather than "the designation of the code point itself." The same applies to \N escape: I would rather see \N{control-NNNN} or \N{surrogate-NNNN} in string literals than some mysterious \N{U+NNNN}.
On 2 Aug 2013 09:00, "Alexander Belopolsky"
On Thu, Aug 1, 2013 at 6:09 PM, Terry Reedy
wrote: Why would someone write 'control-' instead of 'U+'?
Because this is the recommended way to form the code-point labels:
"For each code point type without character names, code point labels are
constructed by using a lowercase prefix derived from the code point type, followed by a hyphen-minus and then a 4- to 6-digit hexadecimal representation of the code point."
"To avoid any possible confusion with actual, non-null Name property
values, constructed Unicode code point labels are often displayed between angle brackets: <control-0009>, <noncharacter-FFFF>, and so on. This convention is used consistently in the data files for the Unicode Character Database."
"A constructed code point label is distinguished from the designation of
the code point itself (for example, “U+0009” or “U+FFFF”), which is also a unique identifier, as described in Appendix A, Notational Conventions." < http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf>
I would rather see unicodedata.lookup() to be extended to accept
code-point labels rather than "the designation of the code point itself." The same applies to \N escape: I would rather see \N{control-NNNN} or \N{surrogate-NNNN} in string literals than some mysterious \N{U+NNNN}. -1. I'd never even heard of code point labels before this thread, while the "U+" notation is incredibly common. Cheers, Nick.
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Thu, Aug 1, 2013 at 7:20 PM, Nick Coghlan
I'd never even heard of code point labels before this thread, while the "U+" notation is incredibly common.
Nick, Did you see this part: "A constructed code point label is distinguished from the designation of the code point itself (for example, “U+0009” or “U+FFFF”), which is also a unique identifier"? The purpose of unicode.lookup() is to look up the unicode code point by name and "U+NNNN" is not a name - it is "the designation of the code point itself." There is no need to look up anything if you want to process an occasional s = "U+FFFF" string: chr(int(s[2:], 16) ) will do the job. The original proposal was to allow \U+NNNN escape as a shortcut for \U0000NNNN. This is a clear readability improvement while \N{U+001B}, for example, is not an improvement over \N{ESCAPE}. However, for more obscure control characters, \N{control-NNNN} may be clearer than any currently available spelling. For example, \N{control-001E} is easier to understand than \036, \x1e, \u001E, \N{RS} or even the most verbose \N{INFORMATION SEPARATOR TWO}.
On Thu, Aug 1, 2013 at 4:55 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
On Thu, Aug 1, 2013 at 7:20 PM, Nick Coghlan
wrote: I'd never even heard of code point labels before this thread, while the "U+" notation is incredibly common.
<snip>
The original proposal was to allow \U+NNNN escape as a shortcut for \U0000NNNN. This is a clear readability improvement while \N{U+001B}, for example, is not an improvement over \N{ESCAPE}. However, for more obscure control characters, \N{control-NNNN} may be clearer than any currently available spelling. For example, \N{control-001E} is easier to understand than \036, \x1e, \u001E, \N{RS} or even the most verbose \N{INFORMATION SEPARATOR TWO}.
My reason to suggest including it is that it's in the standard as the label for these characters so it's reasonable to expect lookup to know about these labels just as it knows about 'EXCLAMATION MARK'. If someone has created data using the standard and passes it to unicode.lookup, it should work. I'm +/-0 on having 'control-' and 'reserved-' etc. simply being different spellings of 'U+' so that '\N{control-0021}' == '\N{U+0021}' == '\x21' == '!' even though that isn't a control character. That is, if the data doesn't conform to the standard, it wouldn't necessarily be terrible if it did something reasonable rather than raising an exception. And, I'm only suggesting this be supported on the reading side. --- Bruce I'm hiring: http://www.cadencemd.com/info/jobs Latest blog post: Alice's Puzzle Page http://www.vroospeak.com Learn how hackers think: http://j.mp/gruyere-security
On Thu, Aug 1, 2013 at 8:04 PM, Bruce Leban
I'm +/-0 on having 'control-' and 'reserved-' etc. simply being different spellings of 'U+' so that '\N{control-0021}' == '\N{U+0021}' == '\x21' == '!' even though that isn't a control character.
This misses the point of adding the code point type prefix. If you fat-finger \N{control-0021} instead of intended \N{control-0012} you would want a quick syntax error rather than an obscure bug. Similarly, when you are reading someone else's code, you don't want to consult the code table every time you see \N{control-NNNN} to assure that this is really a control character rather than a surrogate- or private-use- one.
On Sat, Jul 27, 2013 at 6:01 AM, Steven D'Aprano
Why do we need yet another way of writing escape sequences? ------------------------------**-----------------------------
We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.
The current situation with \u and \U escapes can hardly qualify as an obvious way to do it. There is nothing obvious about either \u limitation to four digits nor \U requirement to have eight. (I remember discovering that after first trying something like \u1FFFF, then \U1FFFF and then checking the reference manual to discover \U0001FFFF. I don't think my experience was unique.) I have a counter-proposal that may improve the situation: allow 4, 5, 6 or 8 hex digits after \U optionally surrounded by braces. When used without braces, maximal munch rule applies: the escape sequence ends at the first non-hex-digit. I would allow only upper-case A-F in 4-6 digits escapes to minimize the need for braces.
Alexander Belopolsky writes:
On Thu, Aug 1, 2013 at 8:04 PM, Bruce Leban
wrote:
I'm +/-0 on having 'control-' and 'reserved-' etc. simply being different spellings of 'U+' so that '\N{control-0021}' == '\N{U+0021}' == '\x21' == '!' even though that isn't a control character.
This misses the point of adding the code point type prefix.
Not really. That would just pass the responsibility for enforcing consistency to linters, instead of the translator. You can't just make this a syntax error because a code point may be reserved one Python version and a letter in another, depending on which versions of the Unicode tables are being used by those versions of Python. That would conflict with Unicode itself, which says that unknown code points must be treated as characters. This is way too fragile to be allowed to cause syntax errors.
If you fat-finger \N{control-0021} instead of intended \N{control-0012} you would want a quick syntax error rather than an obscure bug. Similarly, when you are reading someone else's code, you don't want to consult the code table every time you see \N{control-NNNN} to assure that this is really a control character rather than a surrogate- or private-use- one.
+0 on Bruce's idea, -1 on syntax errors It might be on rare occasions be useful to be strict about fixed-for- all-time types like surrogate and private use. (But even those weren't fixed for all time in the past!) Really, this is an editor or linter function.
On 02/08/2013 02:08, Alexander Belopolsky wrote:
On Sat, Jul 27, 2013 at 6:01 AM, Steven D'Aprano
mailto:steve@pearwood.info> wrote: Why do we need yet another way of writing escape sequences? ------------------------------__-----------------------------
We don't need another one, we need a better one. U+xxxx is the standard Unicode notation, while existing Python escapes have various problems.
The current situation with \u and \U escapes can hardly qualify as an obvious way to do it. There is nothing obvious about either \u limitation to four digits nor \U requirement to have eight. (I remember discovering that after first trying something like \u1FFFF, then \U1FFFF and then checking the reference manual to discover \U0001FFFF. I don't think my experience was unique.)
I have a counter-proposal that may improve the situation: allow 4, 5, 6 or 8 hex digits after \U optionally surrounded by braces. When used without braces, maximal munch rule applies: the escape sequence ends at the first non-hex-digit. I would allow only upper-case A-F in 4-6 digits escapes to minimize the need for braces.
Perl has \x{...}. Ruby has \u{...}. Python would have \U{...}. We could follow Perl or Ruby, or both of them, or even allow braces with any of the hex escapes.
On Thu, Aug 1, 2013 at 9:15 PM, Stephen J. Turnbull
Alexander Belopolsky writes:
On Thu, Aug 1, 2013 at 8:04 PM, Bruce Leban
wrote: .. This misses the point of adding the code point type prefix. Not really. That would just pass the responsibility for enforcing consistency to linters, instead of the translator.
I have not seen a linter yet that would suggest that "\x41" should be written as "A". The choice of the best literal syntax requires human judgement. A linter cannot tell you when 1.00 is better than 1.0 or 1. I would choose a more verbose \N{control-NNNN} over shorter \uNNNN when I want to make it obvious to the human reader of my code that I use a control character rather than anything else.
You can't just make this a syntax error because a code point may be reserved one Python version and a letter in another, depending on which versions of the Unicode tables are being used by those versions of Python.
That's true, but why would you write \N{reserved-NNNN} instead of \uNNNN to begin with? I would assume you would only choose a longer spelling when it is important for your program that you use a reserved character and your program will not work correctly with the UCD version where the NNNN code point is assigned.
That would conflict with Unicode itself, which says that unknown code points must be treated as characters. This is way too fragile to be allowed to cause syntax errors.
You can always avoid syntax errors by using \uNNNN. If you choose to specify the character type you hopefully do it for a good reason.
..
It might be on rare occasions be useful to be strict about fixed-for- all-time types like surrogate and private use.
There are only five type prefixes: control-, reserved-, non-character-, private-use-, and surrogate-. With the possible exception or reserved-, on a rare occasion when you want to be explicit about the character type, it is useful to be strict. In case of reserved-, I cannot think of any legitimate use for a reserved character in a string literal, so if strictness is a problem in this case, I would disallow \N{reserved-NNNN} altogether.
(But even those weren't fixed for all time in the past!)
Now they are: control- property is immutable since version 1.1.5, surrogate- and private-use- since 2.0, and noncharacter- since 3.1.0. (See http://www.unicode.org/policies/stability_policy.html.) Moreover, since 2.1.0, "The enumeration of General_Category property values is fixed. No new values will be added."
On Thu, Aug 1, 2013 at 9:46 PM, MRAB
We could follow Perl or Ruby, or both of them, or even allow braces with any of the hex escapes.
That choice is unfortunately precluded by backwards compatibility because both "\u1FFFF" and "\x1FFFF" are valid strings. (Are braces optional in Perl's \x{..} or Ruby's \u{..}?) Also, the upper-case U is more in-line with U+ notation and \N escape. If we are looking for "one obvious way," I think it should be \U with \x and \u remaining the other less obvious ways.
Alexander Belopolsky writes:
On Thu, Aug 1, 2013 at 9:46 PM, MRAB
wrote:
We could follow Perl or Ruby, or both of them, or even allow braces with any of the hex escapes.
That choice is unfortunately precluded by backwards compatibility because both "\u1FFFF" and "\x1FFFF" are valid strings. (Are braces optional in Perl's \x{..} or Ruby's \u{..}?) Also, the upper-case U is more in-line with U+ notation and \N escape. If we are looking for "one obvious way," I think it should be \U with \x and \u remaining the other less obvious ways.
-1. The obvious way forward is \N{U+1FFFF}. That *looks* like an algorithmically generated name, and (wow!) that's what it *is*.[1] The existing \U, \u, and \x escapes are fine as they are. They can't really be deprecated because they're needed for portability to older Python versions which won't have any of the proposed extensions. Changing the syntax of \U to allow braces with a variable-width hexadecimal argument is only a minor compatibility break, but please have pity on the folks who support python-list. They'll forever be dealing with questions like "I know I've seen other people write '\U3bb', why do I get a weird syntax error?" and "I use Python 3.3. Why do I get a syntax error with '\U{3BB}'?" On the other hand, \N{U+1FFFF} will currently get a lookup failure. I think that's OK, since currently code needs to be prepared for that to fail anyway since it raises an error, and users will be used to it because it's easy to typo Unicode names when typing from memory -- they're pretty regular but not 100% so. Footnotes: [1] Of course, it's also an invalid code point in any Unicode stream. ;-)
On 02/08/13 09:55, Alexander Belopolsky wrote:
The original proposal was to allow \U+NNNN escape as a shortcut for \U0000NNNN. This is a clear readability improvement while \N{U+001B}, for example, is not an improvement over \N{ESCAPE}. However, for more obscure control characters, \N{control-NNNN} may be clearer than any currently available spelling. For example, \N{control-001E} is easier to understand than \036, \x1e, \u001E, \N{RS} or even the most verbose \N{INFORMATION SEPARATOR TWO}.
Despite the vigorous objections to a variable-length escape sequence[1] I still consider that the One Obvious Way to refer to a Unicode code-point numerically is by U+NNNN with 4-6 hex digits. Add a backslash to turn it into an escape sequence, and we have \U+NNNN. If I'm still around when Python 4000 is under development, I'll propose that syntax as an outright replacement for legacy escapes \xNN \oNNN \uNNNN and \U00NNNNNN (for strings, but not bytes, where \xNN is still the OOWTDI). But that's a *long* way away. In the meantime, we're constrained by backward compatibility to keep existing escape formats. There is considerable opposition to another variable-length escape sequence without delimiters, and \N{U+NNNN} seems to be a reasonable compromise to me even though it is actually longer than the current \U00NNNNNN escape. I consider this proposal to be about two things, conformity with Unicode notation, and clarity, not length. If somebody wishes to champion the proposal to support code-point labels, please start a separate thread. The two features are independent. [1] None of which persuade me -- many languages have variable-length octal escapes, and this is the first time I've ever heard anyone complain about them being harmful. -- Steven
On Thu, Aug 1, 2013 at 11:30 PM, Stephen J. Turnbull
-1. The obvious way forward is \N{U+1FFFF}. That *looks* like an algorithmically generated name, and (wow!) that's what it *is*.[1]
The only problem is that this is not a conforming name according to the Unicode standard. The standard is very explicit in its recommendation on how the names should be generated: "Use in APIs. APIs which return the value of a Unicode “character name” for a given code point might vary somewhat in their behavior. An API which is defined as strictly returning the value of the Unicode Name property (the “na” attribute), should return a null string for any Unicode code point other than graphic or format characters, as that is the actual value of the property for such code points. On the other hand, an API which returns a name for Unicode code points, but which is expected to provide useful, unique labels for unassigned, reserved code points and other special code point types, should return the value of the Unicode Name property for any code point for which it is non-null, but should otherwise construct a code point label to stand in for a character name." The recommendation on what should be accepted as a valid name is more relaxed: "... it can be more effective for a user interface to use names that were translated or otherwise adjusted to meet the expectations of the targeted user community. By also listing the formal character name, a user interface could ensure that users can unambiguously refer to the character by the name documented in the Unicode Standard." This does not literally preclude treating U+NNNN as a character name, but it looks like such use is discouraged: "A constructed code point label is distinguished from the designation of the code point itself (for example, “U+0009” or “U+FFFF”), which is also a unique identifier."
[1] Of course, it's also an invalid code point in any Unicode stream. ;-)
This is not accurate. U+1FFFF is a valid code point and its generated label is <noncharacter-1FFFF>. Noncharacters "are forbidden for use in open interchange of Unicode text data. ... Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them." (See Chapter 16.7 Noncharacters.) In Python 0x1FFFF is a valid code point:
chr(0x1FFFF) '\U0001ffff'
An application written in Python can use strings containing '\U0001ffff' internally, but should not interchange such strings with other applications.
Alexander Belopolsky writes:
On Thu, Aug 1, 2013 at 9:15 PM, Stephen J. Turnbull
wrote: Alexander Belopolsky writes:
>>> This misses the point of adding the code point type prefix.
Not really. That would just pass the responsibility for enforcing consistency to linters, instead of the translator.
I have not seen a linter yet that would suggest that "\x41" should be written as "A".
Irrelevant. All I suggest the linter do is the "is \N{control-0x21} consistent in the sense that U+0021 is a control character?" check. That's what you said is the point. I just want that check done outside of the compiler. >> You can't just make this a syntax error because a code point may
be reserved one Python version and a letter in another, depending on which versions of the Unicode tables are being used by those versions of Python.
That's true, but why would you write \N{reserved-NNNN} instead of \uNNNN to begin with?
I wouldn't. The problem isn't writing "\N{reserved-50000}". It's the other way around: I want to *write* "\N{control-50000}" which expresses my intent in Python 3.5 and not have it blow up in Python 3.4 which uses an older UCD where U+50000 is unassigned.
With the possible exception or reserved-, on a rare occasion when you want to be explicit about the character type, it is useful to be strict.
As explained above, strictness is not backward compatible with older versions of the UCD that might be in use in older versions of Python.
Alexander Belopolsky writes:
On Thu, Aug 1, 2013 at 11:30 PM, Stephen J. Turnbull
wrote:
-1. The obvious way forward is \N{U+1FFFF}. That *looks* like an algorithmically generated name, and (wow!) that's what it *is*.
The only problem is that this is not a conforming name according to the Unicode standard. The standard is very explicit in its recommendation on how the names should be generated: "Use in APIs. APIs which return the value of a Unicode “character name” [...]
This whole section of the standard is irrelevant. Of course unicodedata.name('A') should *return* 'LATIN CAPITAL LETTER A', but we're discussing the possibility of extending what unicodedata.lookup() should *accept*.
The recommendation on what should be accepted as a valid name is more relaxed: "... it can be more effective for a user interface to use names that were translated or otherwise adjusted to meet the expectations of the targeted user community."
It seems to me that's exactly what those of us who advocate using \N{} are saying.
This does not literally preclude treating U+NNNN as a character name, but it looks like such use is discouraged: "A constructed code point label is distinguished from the designation of the code point itself (for example, “U+0009” or “U+FFFF”), which is also a unique identifier."
I don't see any such implication. What's being said here is that an application should not expect a conforming implementation to treat "U+0009" and "control-0009" identically in all respects. For example, "control-0009" might be subjected to the kind of consistency check you want. Or only one of the two might be acceptable to a name lookup function. Or you might have to use different functions to convert them to characters. Steve
participants (15)
-
Alexander Belopolsky
-
Andrew Barnert
-
Bruce Leban
-
Chris Angelico
-
Chris “Kwpolska” Warrick
-
Greg Ewing
-
Ian Foote
-
Joshua Landau
-
M.-A. Lemburg
-
MRAB
-
Nick Coghlan
-
Stefan Behnel
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy