[Python-ideas] Support Unicode code point notation
Nick Coghlan
ncoghlan at gmail.com
Sun Jul 28 02:57:50 CEST 2013
On 28 Jul 2013 10:34, "Andrew Barnert" <abarnert at yahoo.com> wrote:
>
> On Jul 28, 2013, at 1:18, Chris Angelico <rosuav at gmail.com> wrote:
>
> > On Sun, Jul 28, 2013 at 12:14 AM, Greg Ewing
> > <greg.ewing at canterbury.ac.nz> wrote:
> >> Steven D'Aprano wrote:
> >>>
> >>> Aside: you keep writing H..HHHHHH for Unicode code points. Unicode
code
> >>> points go up to hex 10FFFF,
> >>
> >> They do *now*, but we can't be sure that they will stay that
> >> way in the future.
> >
> > They will for as long as UTF-16 is supported. Really, it would have
> > been better all round if UTF-16 had never existed, and everyone just
> > had to switch up to UTF-32; sure, memory would have been wasted, but
> > concepts like PEP 393 would have been devised to deal with that, and
> > we wouldn't have stupid bugs in 99% of programming languages.
>
> UTF-16 wouldn't have been a problem if it weren't almost compatible with
UCS2, allowing all kinds of Unicode 1.0 software to misleadingly claim
Unicode 2.0 support. (For example, for a long time, both Windows and Java
"supported" UTF-16 by treating surrogate pairs as two characters instead of
one, which is like "supporting" UTF-8 by treating it like ASCII--except
that the bugs are much less likely to hit developers early in the cycle.)
There are use cases for which UTF-16 is perfectly reasonable. For example,
strings with lots of BMP CJK characters and an occasional non-BMP character
aren't helped by PEP 393, or by UTF-8, but they are helped by UTF-16. (So
long as you can rely on software not treating it as UCS2…) But anyway, this
is pretty far off topic.
>
> Unicode could go past 10FFFF without dropping UTF-16, either by adding
more surrogate pair ranges, or by adding surrogate triplets. It's really no
different from extending UTF-8, which is no problem.
>
> The problem is that we have no way to predict how they will extend
UTF-16, UTF-8, or code point notation if that ever happens. Assuming that
the max length for a code point is six nibbles does sound like assuming
nobody will ever need more than 640k characters.
The idea of enhancing name based lookup by accepting the "U+" prefix as
specifying a code point sounds good to me. It's already a delimited
notation, doesn't require a new escape and, as someone else pointed out,
allows \N to be used consistently, even if a code point doesn't have a name
yet.
Cheers,
Nick.
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130728/428a3de7/attachment-0001.html>
More information about the Python-ideas
mailing list