On 28 Jul 2013 10:34, "Andrew Barnert" <abarnert@yahoo.com> wrote:
>
> On Jul 28, 2013, at 1:18, Chris Angelico <rosuav@gmail.com> wrote:
>
> > On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
> > <greg.ewing@canterbury.ac.nz> wrote:
> >> Steven D'Aprano wrote:
> >>>
> >>> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode code
> >>> points go up to hex 10FFFF,
> >>
> >> They do *now*, but we can't be sure that they will stay that
> >> way in the future.
> >
> > They will for as long as UTF-16 is supported. Really, it would have
> > been better all round if UTF-16 had never existed, and everyone just
> > had to switch up to UTF-32; sure, memory would have been wasted, but
> > concepts like PEP 393 would have been devised to deal with that, and
> > we wouldn't have stupid bugs in 99% of programming languages.
>
> UTF-16 wouldn't have been a problem if it weren't almost compatible with UCS2, allowing all kinds of Unicode 1.0 software to misleadingly claim Unicode 2.0 support. (For example, for a long time, both Windows and Java "supported" UTF-16 by treating surrogate pairs as two characters instead of one, which is like "supporting" UTF-8 by treating it like ASCII--except that the bugs are much less likely to hit developers early in the cycle.) There are use cases for which UTF-16 is perfectly reasonable. For example, strings with lots of BMP CJK characters and an occasional non-BMP character aren't helped by PEP 393, or by UTF-8, but they are helped by UTF-16. (So long as you can rely on software not treating it as UCS2…) But anyway, this is pretty far off topic.
>
> Unicode could go past 10FFFF without dropping UTF-16, either by adding more surrogate pair ranges, or by adding surrogate triplets. It's really no different from extending UTF-8, which is no problem.
>
> The problem is that we have no way to predict how they will extend UTF-16, UTF-8, or code point notation if that ever happens. Assuming that the max length for a code point is six nibbles does sound like assuming nobody will ever need more than 640k characters.

The idea of enhancing name based lookup by accepting the "U+" prefix as specifying a code point sounds good to me. It's already a delimited notation, doesn't require a new escape and, as someone else pointed out, allows \N to be used consistently, even if a code point doesn't have a name yet.

Cheers,
Nick.

>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas@python.org
> http://mail.python.org/mailman/listinfo/python-ideas