On 28/07/13 09:14, Greg Ewing wrote:
Steven D'Aprano wrote:
Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code points go up to hex 10FFFF,
They do *now*, but we can't be sure that they will stay that way in the future.
Yes we can. The Unicode Consortium have guaranteed that Unicode will never be extended past code point U+10FFFF. I quote: Q: Will UTF-16 ever be extended to more than a million characters? A: No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). http://www.unicode.org/faq/utf_bom.html#utf16-6 Supporting some hypothetical "Super-hyper-mega-Code" in 2035 will be as big a change as adding Unicode in the first place. It will probably require a PEP :-) [...]
I'd like to be able to tell people:
"To enter a Unicode code point in a string, put a backslash in front of it."
instead of telling them to count the number of hex digits,
But they're *still* going to have to count hex digits, and pad to 6 if it happens to be followed by a problematic character.
Most uses of hex escapes aren't followed by another hex digit: there are in excess of a million Unicode code points, and less than 50 are hex digits (less than 30 if you exclude East-Asian full-width forms). To return to the example that keeps being given, if you're writing Ethiopian text, I don't think it is actually very likely that you will want to follow ETHIOPIC SYLLABLE SEE by a Latin digit 5 with no separator between them. Yes, it "might" happen, but there are trivial ways to deal that, in no particular order: - pad the code point to six digits - don't use \U+, use a fixed-width \u or \U escape - use string concatenation '\U+1234' '5' - use string substitutions (% or format or $ templates).
If we're going to introduce something new, we might as well design it not to have silly, awkward properties like that.
The Ruby \U{...} syntax has the following advantages:
* Very clear, not prone to editing errors * No fixed limit on number of digits * Extends easily to multiple code points * Can optionally accept U+ for those who like that * Precedent exists in at least one other language
As I said earlier, if someone wants to champion that idea, I won't object.
Or we could invent something of our own, such as using another backslash as a delimiter:
\U+1234\
Multiple characters could be written as:
\U+1234+5678+9abc\
Another suggestion which was made is: \N{U+xxxx} (Sorry, I have forgotten who made that suggestion originally.) That could be extended to allow multiple space-separated code points: \N{U+xxxx U+yyyy U+zzzzz} or \N{U+xxxx yyyy zzzzz} -- Steven